Your SlideShare is downloading. ×
0
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
20110227 hadoop disk-linuxfb
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20110227 hadoop disk-linuxfb

1,326

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,326
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. What does a Hadoop Process do on Your Machine Wang Xu gnawux@gmail.com Feb, 2011 What does a Hadoop Process do on Your Machine 1 / 17.▲
  • 2. Outline . 1 . Hadoop: a Clone of Google Infrastructure . 2 . What’s MapReduce . 3 . How HDFS supports MapReduce and Others . 4 . What’s DataNode Doing . 5 . What’s TaskTracker Doing What does a Hadoop Process do on Your Machine 2 / 17. ▲
  • 3. Apache Hadoop: History & Dreams nutch, lucene. . . Yahoo and search engines. . . Doug Cutting. . . Yahoo, CloudEra, & Facebook What does a Hadoop Process do on Your Machine 3 / 17. ▲
  • 4. The Hadoop Family Projects and Their Relatives in Google . . Common: ipc, utils, and other common stuff . . HDFS ⇐⇒ Google GFS: Distributed File System . . MapReduce ⇐⇒ Google MapReduce: Framework of Distributed . Computing . HBase ⇐⇒ BigTable: Column Family based Non-Relational . Database . . Zookeeper ⇐⇒ Chubby: Distributed Lock Service, for . Quorum. . . . Avro ⇐⇒ Protocol Buffers: Cross language data Serialization . and Exchange . Hive & Pig: Data Warehouse based on MapReduce Platform . . Oozie: Data flow engine . What does a Hadoop Process do on Your Machine 4 / 17. ▲
  • 5. How Hadoop Help Your Business Usages of Hadoop . . Search Engine: Nutch Projects, Yahoo (Now Bing Based), and . some others . Log Analysis: for user behavior, network signalling, etc. . . . New Messaging system of Facebook is based on HBase . . Advertisement: Yahoo and other company . . Hive is used in Facebook . What does a Hadoop Process do on Your Machine 5 / 17. ▲
  • 6. The Nature of MapReduce Map in Functional Programming . . Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8} . . Every elements are processed with given method . . . Elements do not affect each other . . The input is immutable, and the output is a new list . . Fit for Parallel Processing . Reduce in Functional Programming . . Reduce: reduce({1,2,3,4},(×)) Rightarrow {24} . . . All the elements in list are processed together . . The input is immutable, and the output is a new list . What does a Hadoop Process do on Your Machine 6 / 17. ▲
  • 7. Distributed MapReduce A Map Task’s Life . . Input: Segment of Input Records (from DFS) . . Job: Process Records one by one — Emit K-V Pairs, 0, 1, or . . More . Then: Working As a Server, Waiting the Reduce’s K-V retriving . request. A Reduce Task’s Life . . Shuffle: Retrive from All Map Tasks for Specific Keys . . . Sort: Group and merge the K-V Pairs . . Reduce: Write File Back to DFS . What does a Hadoop Process do on Your Machine 7 / 17. ▲
  • 8. The Landscape of MapReduce Map 1 1. Map read data from DFS . seperately Reduce 1 2. Map process the data, and . do not communicate each Map 2 other 3. Map keep result in node . Reduce 2 local storage (local disk) 4. Reduce retrive data from all . Map 3 the Maps 5. Reduce do not communicate . Reduce 3 each other either Map 4 6. Reduce write back result to . DFS Figure: Data Flow of MapReduce What does a Hadoop Process do on Your Machine 8 / 17. ▲
  • 9. Hadoop Distributed File System Commodity PC based Massive Data Storage System . . Redundancy: block replicated to different nodes in different . racks . . Location awareness, task can be sched to nodes storing data . . Write once, read multi-times . . Large files will be splitted to Blocks . What does a Hadoop Process do on Your Machine 9 / 17. ▲
  • 10. The Role of a DataNode Block (chunk) container of HDFS . . Manage Dirs as a soft RAID0 — Write block files round-robin . . Keep a block-dir Map in Memory . . . DataNodeProtocol(by NameNode): Communicate with . NameNode — Report, Heartbeat and get command . DataTransferProtocol: Communicate with Client and other . DataNodes — Transfer Blocks What does a Hadoop Process do on Your Machine 10 / 17. ▲
  • 11. DataNode in Disk Block Files . . Those blk XXX . . . 64MB or 128MB blocks . Meta Files . . Those blk XXX.meta . . . Header: layout version, and bytes per checksum . . Checksums . What does a Hadoop Process do on Your Machine 11 / 17. ▲
  • 12. Block Writing To DataNode The Pipe Line . . Setup Pipe line: Client → DataNode1 → DataNode2 → . DataNode3 . DataNode: Receiving packet, .and forward to next datanode . . DataNode Write Received Data Buffer . . DataNode then Write correspond meta . . DataNode flush the file stream. . What does a Hadoop Process do on Your Machine 12 / 17. ▲
  • 13. The Role of a TaskTracker Local Commander of a Node . . Running from begin to the end . . Get task from JobTracker — The Big BOSS . . . Both Map and Reduce are runned by TaskTracker . . Assign tasks to Mapper and Reducer Process . . Work as Http Server (Jetty) for data transfer between TTs . What does a Hadoop Process do on Your Machine 13 / 17. ▲
  • 14. Daily Life of a Mapper Direct Mapper Output . . Run map() against Every Records, and Collect The K-Vs . . . Write K-V into File (in OutputFormat) once got a K-V pair . . Flush file. . Buffered Mapper (The Normal Case) . . Run map() against Every Records, and Collect The K-Vs . . collect K-V’s into a buffer set by io.sort.mb . . . Spill to external file if Map output fulfill the buffer. . . Finally, do a external sort (Optional Combiner) and write to the . final files . file: $local/taskTracker/jobcache/jobid/taskid/file.out . What does a Hadoop Process do on Your Machine 14 / 17. ▲
  • 15. Illustration of Map and Combiner from Yahoo Combiner step inserted into the MapReduce data flow . . Figure: http://developer.yahoo.com/hadoop/tutorial/module4.html What does a Hadoop Process do on Your Machine 15 / 17. ▲
  • 16. Life of a Reducer Shuffle & Sort . . Copy map results from all Maps . . Store map output in disk or memory . . file: . . $local/taskTracker/jobcache/jobid/taskid/output/maplocationid.out . Sort: Merge the map outputs (like the Combiner in Map, . hmmm. . . It should be combiner likes Sort) Reduce . . . Write the result out with Output Format to HDFS . What does a Hadoop Process do on Your Machine 16 / 17. ▲
  • 17. Q&A What does a Hadoop Process do on Your Machine 17 / 17. ▲

×