Introduction of Hadoop
1
Institute of Manufacturing Information and Systems (製造資訊與系統研究所)
Institute of Engineering Management (工程管理碩士在職專班)
National Cheng Kung University (國立成功大學)
主題:Hadoop(HDFS, MapReduce)
指導教授:李家岩 博士
報 告 者:洪紹嚴
日期:2015/10/08
Productivity Optimization Lab Shao-Yen Hung
Origin of the name “Hadoop”?
2
This toy’s name is Hadoop
This guy is Doug Cutting.
MapReduce algorithm pops up(Google Labs)2004 =>
2006 => He created Hadoop framework (Yahoo!)
Productivity Optimization Lab Shao-Yen Hung
Architecture
3
(Data Store) (Data Processing)
Name Node
Secondary
Name Node
Job Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Masters
Slaves
Productivity Optimization Lab Shao-Yen Hung
HDFS(Hadoop Distributed File System)
4
• In HDFS, the three casts are Client, Name Node, Data Nodes.
Productivity Optimization Lab Shao-Yen Hung
HDFS—Write data
5
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXX
XXX
XXXXXXXXX
XXX
XXXXX
Original File
(140MB)
(Usually 64 or 128 MB)
Name Node
Client
64MB
64MB
12MB
XXXX
XXXX
XXXX
XXXX
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
A block
DN1 DN2 DN3 DN4 DN5
XXXX
XXXX
XXXX
XXXX
‧one block always has 3 replicas‧
(e.g.) Block 1
XXXX
XXXX
XXXX
XXXX
XXXX
XXXX
XXXX
XXXX
(1)
(2)
(3)
(4)
(4)
metadata
blocks
Productivity Optimization Lab Shao-Yen Hung
HDFS—Replica Strategy(1/4)
6
Name Node
Block1: DN1, DN4, DN5
Block2: DN4, DN7, DN8
Block3: DN9, DN1, DN2
DN1
DN2
DN3
DN5
DN6
DN4 DN7
DN8
DN9
Rack 1 Rack 2 Rack 3
 In-rack latency < cross-rack latency
 In-rack bandwidth > cross-rack bandwidth
(1)Put 1st replica in a random location.
(2)Put the next 2 replicas in a different rack.
Productivity Optimization Lab Shao-Yen Hung
HDFS—Replica Strategy(2/4)
7
Name Node
Block1: DN1, DN4, DN5
Block2: DN4, DN7, DN8
Block3: DN9, DN1, DN2
DN1
DN2
DN3
DN5
DN6
DN4 DN7
DN8
DN9
Rack 1 Rack 2 Rack 3
Blk 1 Blk 1
Blk 1
 In-rack latency < cross-rack latency
 In-rack bandwidth > cross-rack bandwidth
(1)Put 1st replica in a random location.
(2)Put the next 2 replicas in a different rack.
Productivity Optimization Lab Shao-Yen Hung
HDFS—Replica Strategy(3/4)
8
Name Node
Block1: DN1, DN4, DN5
Block2: DN4, DN7, DN8
Block3: DN9, DN1, DN2
DN1
DN2
DN3
DN5
DN6
DN4 DN7
DN8
DN9
Rack 1 Rack 2 Rack 3
Blk 2 Blk 2
Blk 2
 In-rack latency < cross-rack latency
 In-rack bandwidth > cross-rack bandwidth
(1)Put 1st replica in a random location.
(2)Put the next 2 replicas in a different rack.
Productivity Optimization Lab Shao-Yen Hung
HDFS—Replica Strategy(4/4)
9
Name Node
Block1: DN1, DN4, DN5
Block2: DN4, DN7, DN8
Block3: DN9, DN1, DN2
DN1
DN2
DN3
DN5
DN6
DN4 DN7
DN8
DN9
Rack 1 Rack 2 Rack 3
Blk 1 Blk 1
Blk 1
Blk 2
Blk 2
Blk 2
Blk 3
Blk 3
Blk 3
 In-rack latency < cross-rack latency
 In-rack bandwidth > cross-rack bandwidth
(1)Put 1st replica in a random location.
(2)Put the next 2 replicas in a different rack.
Productivity Optimization Lab Shao-Yen Hung 10
HDFS—Read data
Name Node Client
Filename
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
DN1
DN2
DN3
DN4
DN5
Please give me Block 2
XXXX
XXXX
XXXX
XXXX
Block 2
XXXX
XXXX
XXXX
XXXX
Block 3
(1)
(2)
(3)
Productivity Optimization Lab Shao-Yen Hung
HDFS—Name Node Failure(1/2)
• Name Node failure
11
Name Node
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
DN1
DN2
DN3
DN4
DN5
 Single Point of Failure(單點故障,全部故障)
Productivity Optimization Lab Shao-Yen Hung
HDFS—Name Node Failure(2/2)
12
• Name Node failure
Name Node
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
DN1
DN2
DN3
DN4
DN5
Secondary
Name Node
 Connect to Name Node every hour.*(default)
 Backup of Name Node metadata.
 Rebuild Name Node if it fails.
Productivity Optimization Lab Shao-Yen Hung
HDFS—Data Nodes Failure(1/2)
• Data Nodes failure
13
Name Node
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
DN1
DN2
DN3
DN4
DN5
?
Productivity Optimization Lab Shao-Yen Hung
HDFS—Data Nodes Failure(2/2)
• Data Nodes failure
14
Name Node
Block1: DN1, DN3, DN5
Block2: DN1, DN2, DN3
Block3: DN1, DN4, DN5
…and so on….
DN1
DN2
DN3
DN4
DN5
Heartbeat
 Data Nodes send heartbeat to Name Node every 3 seconds
 A data node is regarded as “DEAD” if it doesn’t send a
heartbeat in 10 minutes.
 Name Node will replicate blocks to other DN when one data
node is dead.
Productivity Optimization Lab Shao-Yen Hung
Architecture
15
(Data Store) (Data Processing)
Name Node
Secondary
Name Node
Job Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Masters
Slaves
Productivity Optimization Lab Shao-Yen Hung 16
MapReduce Algorithm
Name Node
Job Tracker
Client
(e.g.)How many times does
“POLab” occur in File?
Blk1: DN1, DN7, DN8
Blk2: DN2, DN5, DN6
Blk3: DN4, DN12, DN13
(1)
(2)
Task Tracker
DN2
Task Tracker
DN1
Task Tracker
DN3
Task Tracker
DN4
Blk 1 Blk 2 Blk 3
(3) Map
POLab = 3 POLab = 0 POLab = 11
(4) Reduce
POLab = 14
 A divide and conquer algorithm
Productivity Optimization Lab Shao-Yen Hung 17
Hadoop Ecosystem
http://www.inside.com.tw/2015/03/12/big-data-4-hadoop
Productivity Optimization Lab Shao-Yen Hung 18
Reference(學習地圖)
• 認識大數據的黃色小象幫手 –– Hadoop
• HDFS Explained as Comics
• Understanding Hadoop Clusters and the Network
• How to run Hadoop on Linux? (Practice)*

Introduction of Hadoop

  • 1.
    Introduction of Hadoop 1 Instituteof Manufacturing Information and Systems (製造資訊與系統研究所) Institute of Engineering Management (工程管理碩士在職專班) National Cheng Kung University (國立成功大學) 主題:Hadoop(HDFS, MapReduce) 指導教授:李家岩 博士 報 告 者:洪紹嚴 日期:2015/10/08
  • 2.
    Productivity Optimization LabShao-Yen Hung Origin of the name “Hadoop”? 2 This toy’s name is Hadoop This guy is Doug Cutting. MapReduce algorithm pops up(Google Labs)2004 => 2006 => He created Hadoop framework (Yahoo!)
  • 3.
    Productivity Optimization LabShao-Yen Hung Architecture 3 (Data Store) (Data Processing) Name Node Secondary Name Node Job Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Masters Slaves
  • 4.
    Productivity Optimization LabShao-Yen Hung HDFS(Hadoop Distributed File System) 4 • In HDFS, the three casts are Client, Name Node, Data Nodes.
  • 5.
    Productivity Optimization LabShao-Yen Hung HDFS—Write data 5 XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXX XXX XXXXXXXXX XXX XXXXX Original File (140MB) (Usually 64 or 128 MB) Name Node Client 64MB 64MB 12MB XXXX XXXX XXXX XXXX Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. A block DN1 DN2 DN3 DN4 DN5 XXXX XXXX XXXX XXXX ‧one block always has 3 replicas‧ (e.g.) Block 1 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX (1) (2) (3) (4) (4) metadata blocks
  • 6.
    Productivity Optimization LabShao-Yen Hung HDFS—Replica Strategy(1/4) 6 Name Node Block1: DN1, DN4, DN5 Block2: DN4, DN7, DN8 Block3: DN9, DN1, DN2 DN1 DN2 DN3 DN5 DN6 DN4 DN7 DN8 DN9 Rack 1 Rack 2 Rack 3  In-rack latency < cross-rack latency  In-rack bandwidth > cross-rack bandwidth (1)Put 1st replica in a random location. (2)Put the next 2 replicas in a different rack.
  • 7.
    Productivity Optimization LabShao-Yen Hung HDFS—Replica Strategy(2/4) 7 Name Node Block1: DN1, DN4, DN5 Block2: DN4, DN7, DN8 Block3: DN9, DN1, DN2 DN1 DN2 DN3 DN5 DN6 DN4 DN7 DN8 DN9 Rack 1 Rack 2 Rack 3 Blk 1 Blk 1 Blk 1  In-rack latency < cross-rack latency  In-rack bandwidth > cross-rack bandwidth (1)Put 1st replica in a random location. (2)Put the next 2 replicas in a different rack.
  • 8.
    Productivity Optimization LabShao-Yen Hung HDFS—Replica Strategy(3/4) 8 Name Node Block1: DN1, DN4, DN5 Block2: DN4, DN7, DN8 Block3: DN9, DN1, DN2 DN1 DN2 DN3 DN5 DN6 DN4 DN7 DN8 DN9 Rack 1 Rack 2 Rack 3 Blk 2 Blk 2 Blk 2  In-rack latency < cross-rack latency  In-rack bandwidth > cross-rack bandwidth (1)Put 1st replica in a random location. (2)Put the next 2 replicas in a different rack.
  • 9.
    Productivity Optimization LabShao-Yen Hung HDFS—Replica Strategy(4/4) 9 Name Node Block1: DN1, DN4, DN5 Block2: DN4, DN7, DN8 Block3: DN9, DN1, DN2 DN1 DN2 DN3 DN5 DN6 DN4 DN7 DN8 DN9 Rack 1 Rack 2 Rack 3 Blk 1 Blk 1 Blk 1 Blk 2 Blk 2 Blk 2 Blk 3 Blk 3 Blk 3  In-rack latency < cross-rack latency  In-rack bandwidth > cross-rack bandwidth (1)Put 1st replica in a random location. (2)Put the next 2 replicas in a different rack.
  • 10.
    Productivity Optimization LabShao-Yen Hung 10 HDFS—Read data Name Node Client Filename Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. DN1 DN2 DN3 DN4 DN5 Please give me Block 2 XXXX XXXX XXXX XXXX Block 2 XXXX XXXX XXXX XXXX Block 3 (1) (2) (3)
  • 11.
    Productivity Optimization LabShao-Yen Hung HDFS—Name Node Failure(1/2) • Name Node failure 11 Name Node Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. DN1 DN2 DN3 DN4 DN5  Single Point of Failure(單點故障,全部故障)
  • 12.
    Productivity Optimization LabShao-Yen Hung HDFS—Name Node Failure(2/2) 12 • Name Node failure Name Node Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. DN1 DN2 DN3 DN4 DN5 Secondary Name Node  Connect to Name Node every hour.*(default)  Backup of Name Node metadata.  Rebuild Name Node if it fails.
  • 13.
    Productivity Optimization LabShao-Yen Hung HDFS—Data Nodes Failure(1/2) • Data Nodes failure 13 Name Node Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. DN1 DN2 DN3 DN4 DN5 ?
  • 14.
    Productivity Optimization LabShao-Yen Hung HDFS—Data Nodes Failure(2/2) • Data Nodes failure 14 Name Node Block1: DN1, DN3, DN5 Block2: DN1, DN2, DN3 Block3: DN1, DN4, DN5 …and so on…. DN1 DN2 DN3 DN4 DN5 Heartbeat  Data Nodes send heartbeat to Name Node every 3 seconds  A data node is regarded as “DEAD” if it doesn’t send a heartbeat in 10 minutes.  Name Node will replicate blocks to other DN when one data node is dead.
  • 15.
    Productivity Optimization LabShao-Yen Hung Architecture 15 (Data Store) (Data Processing) Name Node Secondary Name Node Job Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Masters Slaves
  • 16.
    Productivity Optimization LabShao-Yen Hung 16 MapReduce Algorithm Name Node Job Tracker Client (e.g.)How many times does “POLab” occur in File? Blk1: DN1, DN7, DN8 Blk2: DN2, DN5, DN6 Blk3: DN4, DN12, DN13 (1) (2) Task Tracker DN2 Task Tracker DN1 Task Tracker DN3 Task Tracker DN4 Blk 1 Blk 2 Blk 3 (3) Map POLab = 3 POLab = 0 POLab = 11 (4) Reduce POLab = 14  A divide and conquer algorithm
  • 17.
    Productivity Optimization LabShao-Yen Hung 17 Hadoop Ecosystem http://www.inside.com.tw/2015/03/12/big-data-4-hadoop
  • 18.
    Productivity Optimization LabShao-Yen Hung 18 Reference(學習地圖) • 認識大數據的黃色小象幫手 –– Hadoop • HDFS Explained as Comics • Understanding Hadoop Clusters and the Network • How to run Hadoop on Linux? (Practice)*