What does a Hadoop Process do on Your Machine                        Wang Xu                     gnawux@gmail.com         ...
Outline        .        1        .            Hadoop: a Clone of Google Infrastructure        .        2        .         ...
Apache Hadoop: History & Dreams    nutch, lucene. . .    Yahoo and search engines. . .    Doug Cutting. . .    Yahoo, Clou...
The Hadoop Family    Projects and Their Relatives in Google    .      . Common: ipc, utils, and other common stuff      .  ...
How Hadoop Help Your Business    Usages of Hadoop    .      . Search Engine: Nutch Projects, Yahoo (Now Bing Based), and  ...
The Nature of MapReduce    Map in Functional Programming    .      . Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8}      .      . E...
Distributed MapReduce    A Map Task’s Life    .      . Input: Segment of Input Records (from DFS)      .      . Job: Proce...
The Landscape of MapReduce          Map 1                            1. Map read data from DFS                            ...
Hadoop Distributed File System    Commodity PC based Massive Data Storage System    .      . Redundancy: block replicated ...
The Role of a DataNode    Block (chunk) container of HDFS    .      . Manage Dirs as a soft RAID0 — Write block files round...
DataNode in Disk    Block Files    .      . Those blk XXX      .                             .      . 64MB or 128MB blocks...
Block Writing To DataNode    The Pipe Line    .      . Setup Pipe line: Client → DataNode1 → DataNode2 →      .        Dat...
The Role of a TaskTracker    Local Commander of a Node    .      . Running from begin to the end      .      . Get task fr...
Daily Life of a Mapper    Direct Mapper Output    .      . Run map() against Every Records, and Collect The K-Vs      .   ...
Illustration of Map and Combiner from Yahoo    Combiner step inserted into the MapReduce data flow    .                    ...
Life of a Reducer    Shuffle & Sort    .      . Copy map results from all Maps      .      . Store map output in disk or mem...
Q&A      What does a Hadoop Process do on Your Machine                                                          17 / 17.  ...
Upcoming SlideShare
Loading in …5
×

20110227 hadoop disk-linuxfb

1,621 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,621
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

20110227 hadoop disk-linuxfb

  1. 1. What does a Hadoop Process do on Your Machine Wang Xu gnawux@gmail.com Feb, 2011 What does a Hadoop Process do on Your Machine 1 / 17.▲
  2. 2. Outline . 1 . Hadoop: a Clone of Google Infrastructure . 2 . What’s MapReduce . 3 . How HDFS supports MapReduce and Others . 4 . What’s DataNode Doing . 5 . What’s TaskTracker Doing What does a Hadoop Process do on Your Machine 2 / 17. ▲
  3. 3. Apache Hadoop: History & Dreams nutch, lucene. . . Yahoo and search engines. . . Doug Cutting. . . Yahoo, CloudEra, & Facebook What does a Hadoop Process do on Your Machine 3 / 17. ▲
  4. 4. The Hadoop Family Projects and Their Relatives in Google . . Common: ipc, utils, and other common stuff . . HDFS ⇐⇒ Google GFS: Distributed File System . . MapReduce ⇐⇒ Google MapReduce: Framework of Distributed . Computing . HBase ⇐⇒ BigTable: Column Family based Non-Relational . Database . . Zookeeper ⇐⇒ Chubby: Distributed Lock Service, for . Quorum. . . . Avro ⇐⇒ Protocol Buffers: Cross language data Serialization . and Exchange . Hive & Pig: Data Warehouse based on MapReduce Platform . . Oozie: Data flow engine . What does a Hadoop Process do on Your Machine 4 / 17. ▲
  5. 5. How Hadoop Help Your Business Usages of Hadoop . . Search Engine: Nutch Projects, Yahoo (Now Bing Based), and . some others . Log Analysis: for user behavior, network signalling, etc. . . . New Messaging system of Facebook is based on HBase . . Advertisement: Yahoo and other company . . Hive is used in Facebook . What does a Hadoop Process do on Your Machine 5 / 17. ▲
  6. 6. The Nature of MapReduce Map in Functional Programming . . Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8} . . Every elements are processed with given method . . . Elements do not affect each other . . The input is immutable, and the output is a new list . . Fit for Parallel Processing . Reduce in Functional Programming . . Reduce: reduce({1,2,3,4},(×)) Rightarrow {24} . . . All the elements in list are processed together . . The input is immutable, and the output is a new list . What does a Hadoop Process do on Your Machine 6 / 17. ▲
  7. 7. Distributed MapReduce A Map Task’s Life . . Input: Segment of Input Records (from DFS) . . Job: Process Records one by one — Emit K-V Pairs, 0, 1, or . . More . Then: Working As a Server, Waiting the Reduce’s K-V retriving . request. A Reduce Task’s Life . . Shuffle: Retrive from All Map Tasks for Specific Keys . . . Sort: Group and merge the K-V Pairs . . Reduce: Write File Back to DFS . What does a Hadoop Process do on Your Machine 7 / 17. ▲
  8. 8. The Landscape of MapReduce Map 1 1. Map read data from DFS . seperately Reduce 1 2. Map process the data, and . do not communicate each Map 2 other 3. Map keep result in node . Reduce 2 local storage (local disk) 4. Reduce retrive data from all . Map 3 the Maps 5. Reduce do not communicate . Reduce 3 each other either Map 4 6. Reduce write back result to . DFS Figure: Data Flow of MapReduce What does a Hadoop Process do on Your Machine 8 / 17. ▲
  9. 9. Hadoop Distributed File System Commodity PC based Massive Data Storage System . . Redundancy: block replicated to different nodes in different . racks . . Location awareness, task can be sched to nodes storing data . . Write once, read multi-times . . Large files will be splitted to Blocks . What does a Hadoop Process do on Your Machine 9 / 17. ▲
  10. 10. The Role of a DataNode Block (chunk) container of HDFS . . Manage Dirs as a soft RAID0 — Write block files round-robin . . Keep a block-dir Map in Memory . . . DataNodeProtocol(by NameNode): Communicate with . NameNode — Report, Heartbeat and get command . DataTransferProtocol: Communicate with Client and other . DataNodes — Transfer Blocks What does a Hadoop Process do on Your Machine 10 / 17. ▲
  11. 11. DataNode in Disk Block Files . . Those blk XXX . . . 64MB or 128MB blocks . Meta Files . . Those blk XXX.meta . . . Header: layout version, and bytes per checksum . . Checksums . What does a Hadoop Process do on Your Machine 11 / 17. ▲
  12. 12. Block Writing To DataNode The Pipe Line . . Setup Pipe line: Client → DataNode1 → DataNode2 → . DataNode3 . DataNode: Receiving packet, .and forward to next datanode . . DataNode Write Received Data Buffer . . DataNode then Write correspond meta . . DataNode flush the file stream. . What does a Hadoop Process do on Your Machine 12 / 17. ▲
  13. 13. The Role of a TaskTracker Local Commander of a Node . . Running from begin to the end . . Get task from JobTracker — The Big BOSS . . . Both Map and Reduce are runned by TaskTracker . . Assign tasks to Mapper and Reducer Process . . Work as Http Server (Jetty) for data transfer between TTs . What does a Hadoop Process do on Your Machine 13 / 17. ▲
  14. 14. Daily Life of a Mapper Direct Mapper Output . . Run map() against Every Records, and Collect The K-Vs . . . Write K-V into File (in OutputFormat) once got a K-V pair . . Flush file. . Buffered Mapper (The Normal Case) . . Run map() against Every Records, and Collect The K-Vs . . collect K-V’s into a buffer set by io.sort.mb . . . Spill to external file if Map output fulfill the buffer. . . Finally, do a external sort (Optional Combiner) and write to the . final files . file: $local/taskTracker/jobcache/jobid/taskid/file.out . What does a Hadoop Process do on Your Machine 14 / 17. ▲
  15. 15. Illustration of Map and Combiner from Yahoo Combiner step inserted into the MapReduce data flow . . Figure: http://developer.yahoo.com/hadoop/tutorial/module4.html What does a Hadoop Process do on Your Machine 15 / 17. ▲
  16. 16. Life of a Reducer Shuffle & Sort . . Copy map results from all Maps . . Store map output in disk or memory . . file: . . $local/taskTracker/jobcache/jobid/taskid/output/maplocationid.out . Sort: Merge the map outputs (like the Combiner in Map, . hmmm. . . It should be combiner likes Sort) Reduce . . . Write the result out with Output Format to HDFS . What does a Hadoop Process do on Your Machine 16 / 17. ▲
  17. 17. Q&A What does a Hadoop Process do on Your Machine 17 / 17. ▲

×