Hadoop介紹
Hadoop-大數據開源世界的亞當夏娃。
核心是HDFS數據存儲系統,和MapReduce分布式計算框架。
HDFS
原理是把大塊數據切碎,
每個碎塊復制三份,分開放在三個廉價機上,一直保持有三塊可用的數據互為備份。使用的時候只從其中一個備份讀出來,這個碎塊數據就有了。
存數據的叫datenode(格子間),管理datenode的叫namenode(執傘人)。
MapReduce
原理是大任務先分堆處理-Map,再匯總處理結果-Reduce。分和匯是多臺服務器并行進行,才能體現集群的威力。難度在于如何把任務拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什么。
單機版Hadoop介紹
對于學習hadoop原理和hadoop開發的人來說,搭建一套hadoop系統是必須的。但
- 配置該系統是非常頭疼的,很多人配置過程就放棄了。
- 沒有服務器供你使用
這里介紹一種免配置的單機版hadoop安裝使用方法,可以簡單快速的跑一跑hadoop例子輔助學習、開發和測試。
要求筆記本上裝了linux虛擬機,虛擬機上裝了Docker。
安裝
使用docker下載sequenceiq/hadoop-docker:2.7.0鏡像并運行。
[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer
下載成功輸出
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0
啟動
[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd: [ OK ]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out
啟動成功后命令行shell會自動進入Hadoop的容器環境,不需要執行docker exec。在容器環境進入/usr/local/hadoop/sbin,執行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,如下
bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.
localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.
starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.
bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out
Hadoop啟動完成,如此簡單。
要問分布式部署有多麻煩,數數光配置文件就有多少個吧!我親眼見過一個hadoop老鳥,因為新換的服務器hostname主機名帶橫線“-”,配了一上午,環境硬是沒起來。
運行自帶的例子
回到Hadoop主目錄,運行示例程序
bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted Application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%
mapreduce計算完成,有如下輸出
20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=569
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5929
Total time spent by all reduces in occupied slots (ms)=8545
Total time spent by all map tasks (ms)=5929
Total time spent by all reduce tasks (ms)=8545
Total vcore-seconds taken by all map tasks=5929
Total vcore-seconds taken by all reduce tasks=8545
Total megabyte-seconds taken by all map tasks=6071296
Total megabyte-seconds taken by all reduce tasks=8750080
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=159
CPU time spent (ms)=1280
Physical memory (bytes) snapshot=303452160
Virtual memory (bytes) snapshot=1291390976
Total committed heap usage (bytes)=136450048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
hdfs命令查看輸出結果
bash-4.1# bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
例子講解
grep是一個在輸入中計算正則表達式匹配的mapreduce程序,篩選出符合正則的字符串以及出現次數。
shell的grep結果會顯示完整的一行,這個命令只顯示行中匹配的那個字符串
grep input output 'dfs[a-z.]+'
正則表達式dfs[a-z.]+,表示字符串要以dfs開頭,后面是小寫字母或者換行符n之外的任意單個字符都可以,數量一個或者多個。
輸入是input里的所有文件,
bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root 690 May 16 2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16 2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml
-rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml
-rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml
-rw-r--r--. 1 root root 774 May 16 2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml
結果輸出到output。
計算流程如下
稍有不同的是這里有兩次reduce,第二次reduce就是把結果按照出現次數排個序。map和reduce流程開發者自己隨意組合,只要各流程的輸入輸出能銜接上就行。
管理系統介紹
Hadoop提供了web界面的管理系統,
端口號 用途 50070 Hadoop Namenode UI端口 50075 Hadoop Datanode UI端口 50090 Hadoop SecondaryNamenode 端口 50030 JobTracker監控端口 50060 TaskTrackers端口 8088 Yarn任務監控端口 60010 Hbase HMaster監控UI端口 60030 Hbase HRegionServer端口 8080 Spark監控UI端口 4040 Spark任務UI端口
加命令參數
docker run命令要加入參數,才能訪問UI管理頁面
docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
執行這條命令后在宿主機瀏覽器就可以查看系統了,當然如果Linux有瀏覽器也可以查看。我的Linux沒有圖形界面,所以在宿主機查看。
50070 Hadoop Namenode UI端口
50075 Hadoop Datanode UI端口
8088 Yarn任務監控端口
已完成和正在運行的mapreduce任務都可以在8088里查看,上圖有gerp和wordcount兩個任務。
一些問題
一、./sbin/mr-jobhistory-daemon.sh start historyserver必須執行,否則運行任務過程中會報
20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
JAVA.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.Apache.org/hadoop/ConnectionRefused
二、./start-all.sh必須執行否則報形如 Unknown Job job_1592960164748_0001錯誤
三、docker run命令后面必須加--privileged=true,否則運行任務過程中會報java.io.IOException: Job status not available
四、注意,Hadoop 默認不會覆蓋結果文件,因此再次運行上面實例會提示出錯,需要先將 ./output 刪除。或者換成output01試試?
總結
本文方法可以低成本的完成Hadoop的安裝配置,對于學習理解和開發測試都有幫助的。如果開發自己的Hadoop程序,需要將程序打jar包上傳到share/hadoop/mapreduce/目錄,執行
bin/hadoop jar share/hadoop/mapreduce/yourtest.jar
來運行程序觀察效果。