Bigdata 大資料分析實務 (進階上機課程)

Bigdata 大資料簡介與分析應用
(上機課程)
莊家雋

大綱
• 另一種作業系統：Linux
• 啟動Hadoop
• 使用分散式儲存系統：HDFS
• 使用分散式運算系統：MapReduce
• 使用現成的工具做分類與推荐：Mahout
• 源源不絕的接收資料： Flume

Linux使用簡介
• 使用終端機
– Ctrl+alt+T
• 今天會用到的指令
– 基本檔案操作
– VIM文字編輯器

基本 Linux指令介紹: ls、cp
http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php#
• 複制檔案：cp
• 查看檔案：ls

基本 Linux指令介紹: mv、rm
http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php
• 移動檔案、改檔：mv
• 刪除檔案：rm

基本 Linux指令介紹: cat、mkdir
• 建立目錄：mkdir
• 查看檔案內容：cat
http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php

Vim文字編輯器介紹
• 使用『 vi filename 』進入一般指令模式
• 按下 i 進入編輯模式，開始編輯文字
• 按下 [ESC] 按鈕回到一般指令模式
• 按: 進入指令列模式，檔案儲存(w)並離開(q) vi 環境
http://linux.vbird.org/linux_basic/0310vi.php#vi

Ｈadoop 系統架構
• Master /slave architecture
– ＮameNode，DataNode
– Resource Manager，NodeManager
master slave1
NN DN
RM NM
10
slave2
DN
NM

窮人版Ｈadoop 系統架構
• 所有ｄａｅｍｏｎ都在同一台主機上
master
NN DN
RM NM
11

啟動HDFS
• start-dfs.sh
• http://master:50070

啟動Mapreduce
• start-yarn.sh
• http://master:50030/cluster

分散式檔案系統：HDFS
• 在分散式的儲存環境裏，提供單一的目錄系統
• 每個檔案被分割成許多區塊並進行異地備份
15
HDFS檔案1 檔案2

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ 16

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ 17

HDFS 命令列操作
• 基本指令
– hadoop fs –ls <file_in_hdfs>
– hadoop fs –lsr <dir_in_hdfs>
– hadoop fs –rm <file_in_hdfs>
– hadoop fs –rmr <dir_in_hdfs>
– hadoop fs -mkdir <dir_in_hdfs>
– hadoop fs –cat <file_in_hdfs>
– hadoop fs –get <file_in_hdfs> <file_in_local>
– hadoop fs –put <file_in_local> <file_in_hdfs>
18

分散式運算系統：MapReduce
• 一個問題被分割之後而成的小問題。解決一個問題，
其實就是要解決其所有子問題。
• 分而治之，各個擊破
– 傳統方法
• 分而治之，”同時”各個擊破
– MapReduce
• Ｍap：解決每個子問題
• Reduce：將子問題的解答做匯總
• 針對key/value的資料類型做分析
20

MapReduce如何做字數統計
This is a book
This is a pen
This is a desk
That is my book
That is my pen
<This,3>
<That,2>
This is a desk
That is my book
map1
map2
map3
<This,1>, <is, 1>, <a, 1>, <book,1>
<This,1>, <is, 1>, <a, 1>, <pen,1>
<This,1>, <is, 1>, <a, 1>, <desk,1>
<That,1>, <is, 1>, <my, 1>, <book,1>
<That,1>, <is, 1>, <my, 1>, <pen,1>
reduce
<This,3>, <That,2>, <is, 5>, <my, 2>, <a,3>
<book,2>, <desk,1>, <pen,2>
<This, [1,1,1]>
<That,[1,1]>
<is,[1,1,1,1,1]>
<my,[1,1]>
<a,[1,1,1]>
<book,[1,1]>
<pen,[1,1]>
<desk,[1]>
<is,5>
<my,2>
<a,3>
map2
<book,2>
<desk,1>
<pen,2>
That is my pen
map3
This is a book
This is a pen
map1

1. 由RM做全局的資源分配
2. NM定時回報目前的資源使用量
3. 每個JOB會有一個負責的AppMaster控制Job
4. 將資源管理與工作控制分開
5. YARN為一通用的資源管理系統
可達成在YARN上運行多種框架
22

MapReduce程式長成這樣…
23

Step by Step
#vim wordcount.data
aaa bbb ccc ddd
bbb ccc ddd eee
# hadoop fs -mkdir mr.wordcount
# hadoop fs -put wordcount.data mr.wordcount
# hadoop fs -ls mr.wordcount
# hadoop jar MR-sample.jar org.nchc.train.mr.wordcount.WordCount
mr.wordcount/wordcount.data output
...omit...
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=30
# hadoop fs -cat output/part-r-00000
aaa 1
bbb 2
ccc 2
ddd 2
eee 1

動手對資料做分類
國文數學
ID 1 0 10
ID 2 10 0
ID 3 10 10
ID 4 20 10
ID 5 10 20
ID 6 20 20
ID 7 50 60
ID 8 60 50
ID 9 60 60
ID 10 90 90
國文數學
ID 1 0 10
ID 2 10 0
ID 3 10 10
ID 4 20 10
ID 5 10 20
ID 6 20 20
ID 7 50 60
ID 8 60 50
ID 9 60 60
ID 10 90 90

Step by Step
#vi clustering.data
0 10
10 0
10 10
20 10
10 20
20 20
50 60
60 50
60 60
90 90
# hadoop fs -mkdir testdata
# hadoop fs -put clustering.data testdata
# hadoop fs -ls -R testdata
-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/clustering.data
# mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
-t1 3 -t2 2 -i testdata -o output
...omit...
14/09/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters
14/09/08 01:31:07 INFO driver.MahoutDriver: Program took 104405
ms (Minutes: 1.7400833333333334)
#mahout clusterdump --input output/clusters-0-final --pointsDir output/clusteredPoints
C-0{n=1 c=[9.000, 9.000] r=[]}
Weight : [props - optional]: Point:
1.0: [9.000, 9.000]
C-1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]}
1.0: [5.000, 6.000]
1.0: [6.000, 5.000]
1.0: [6.000, 6.000]
C-2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]}
1.0: [1:1.000]
1.0: [0:1.000]
1.0: [1.000, 1.000]
1.0: [2.000, 1.000]
1.0: [1.000, 2.000]
1.0: [2.000, 2.000]

讓我們想一想
• 資料前處理
– 轉成Mahout能處理的資料欄位
• 領域專門知識
– 為什麼是二群而不是三群呢?

推薦系統就在你身邊
• YouTube
• 博客來

book-a book-b book-c
User 1 5 4 5
User 2 4 5 4
User 3 5 4 4~5
User 4 1 2 1~2
User 5 2 1 1
推薦系統原理
User 1 5 4 5
User 2 4 5 4
User 3 5 4
User 4 1 2
User 5 2 1 1

Step by Step
#vi recom.data
1,1,5
1,2,4
1,3,5
2,1,4
2,2,5
2,3,4
3,1,5
3,2,4
4,1,1
4,2,2
5,1,2
5,2,1
5,3,1
# hadoop fs -mkdir testdata
# hadoop fs -put recom.data testdata
# hadoop fs -ls -R testdata
-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/recom.data
# mahout recommenditembased -s SIMILARITY_EUCLIDEAN_DISTANCE -i testdata -o output
...omit…
File Input Format Counters
Bytes Read=287
File Output Format Counters
Bytes Written=32
14/09/04 05:46:56 INFO driver.MahoutDriver:
Program took 434965 ms (Minutes: 7.249416666666667)
# hadoop fs -cat output/part-r-00000
3 [3:4.4787264]
4 [3:1.5212735]

User 1 5 4 5
User 2 4 5 4
User 3 5 4 4~5
User 4 1 2 1~2
User 5 2 1 1
分析結果
# hadoop fs -ca
3 [3:4.478726
4 [3:1.521273
1. 我們預測User4不太喜歡book-c，所以我不會推薦book-c給User4
2. 我們預測User3喜歡book-c，所以我會推薦book-c給User3

Try It!
book1 book2 book3 book4 book5 book6 book7 book8 Book9
User1 3 2 1 5 5 1 3 1
User2 2 3 1 3 5 4 3
User3 1 2 3 3 2 1
User4 2 1 2 1 1 2
User5 3 3 1 3 2 2 3 3 2
User6 1 3 2 2 1
user7 4 4 1 5 1 3 3 4
user對book的評價表
35

當資料源源不決的產生時
• 手動將資料放到HDFS上
• 使用Flume做資料收集
資料目錄 HDFS sink
HDFS
Memory
Channel
檔案
Flume

不用寫程式，也能自動執行
• 僅定義config檔即可
#vim example
agent.sources = source1
agent.channels = channel1
agent.sinks = sink1
agent.sources.source1.type = spooldir
agent.sources.source1.channels = channel1
agent.sources.source1.spoolDir = /home/hadoop/flumedata
agent.sources.source1.fileHeader = false
agent.sinks.sink1.type=hdfs
agent.sinks.sink1.channel=channel1
agent.sinks.sink1.hdfs.path=hdfs://master:9000/user/hadoop
agent.sinks.sink1.hdfs.fileType=DataStream
agent.sinks.sink1.hdfs.writeFormat=TEXT
agent.sinks.sink1.hdfs.rollSize = 0
agent.sinks.sink1.hdfs.rollCount = 0
agent.sinks.sink1.hdfs.idleTimeout = 0
agent.channels.channel1.type = memory
agent.channels.channel1.capacity = 100
#cd ~/flume/conf
#flume-ng agent -n agent -c . -f ./example
…

總結
• 使用虛擬機器技能 + 1
• 使用Linux技能 + 1
• 使用HDFS技能 + 1
• 使用Flume技能 + 1
• 使用MapReduce 技能 + 1
• 使用Mahout做分群技能 + 1
• 使用Mahout做推荐技能 + 1

…canopy.Job -t1 3 -t2 2 -i testdata
43
找出3群

…canopy.Job -t1 6 -t2 5 -i testdata
找出2群

Bigdata 大資料分析實務 (進階上機課程)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Bigdata 大資料分析實務 (進階上機課程)

Similar to Bigdata 大資料分析實務 (進階上機課程) (20)

Bigdata 大資料分析實務 (進階上機課程)