20110227 hadoop disk-linuxfb
Upcoming SlideShare
Loading in...5
×
 

20110227 hadoop disk-linuxfb

on

  • 1,475 views

 

Statistics

Views

Total Views
1,475
Slideshare-icon Views on SlideShare
1,475
Embed Views
0

Actions

Likes
1
Downloads
32
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20110227 hadoop disk-linuxfb 20110227 hadoop disk-linuxfb Presentation Transcript

    • What does a Hadoop Process do on Your Machine Wang Xu gnawux@gmail.com Feb, 2011 What does a Hadoop Process do on Your Machine 1 / 17.▲
    • Outline . 1 . Hadoop: a Clone of Google Infrastructure . 2 . What’s MapReduce . 3 . How HDFS supports MapReduce and Others . 4 . What’s DataNode Doing . 5 . What’s TaskTracker Doing What does a Hadoop Process do on Your Machine 2 / 17. ▲
    • Apache Hadoop: History & Dreams nutch, lucene. . . Yahoo and search engines. . . Doug Cutting. . . Yahoo, CloudEra, & Facebook What does a Hadoop Process do on Your Machine 3 / 17. ▲
    • The Hadoop Family Projects and Their Relatives in Google . . Common: ipc, utils, and other common stuff . . HDFS ⇐⇒ Google GFS: Distributed File System . . MapReduce ⇐⇒ Google MapReduce: Framework of Distributed . Computing . HBase ⇐⇒ BigTable: Column Family based Non-Relational . Database . . Zookeeper ⇐⇒ Chubby: Distributed Lock Service, for . Quorum. . . . Avro ⇐⇒ Protocol Buffers: Cross language data Serialization . and Exchange . Hive & Pig: Data Warehouse based on MapReduce Platform . . Oozie: Data flow engine . What does a Hadoop Process do on Your Machine 4 / 17. ▲
    • How Hadoop Help Your Business Usages of Hadoop . . Search Engine: Nutch Projects, Yahoo (Now Bing Based), and . some others . Log Analysis: for user behavior, network signalling, etc. . . . New Messaging system of Facebook is based on HBase . . Advertisement: Yahoo and other company . . Hive is used in Facebook . What does a Hadoop Process do on Your Machine 5 / 17. ▲
    • The Nature of MapReduce Map in Functional Programming . . Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8} . . Every elements are processed with given method . . . Elements do not affect each other . . The input is immutable, and the output is a new list . . Fit for Parallel Processing . Reduce in Functional Programming . . Reduce: reduce({1,2,3,4},(×)) Rightarrow {24} . . . All the elements in list are processed together . . The input is immutable, and the output is a new list . What does a Hadoop Process do on Your Machine 6 / 17. ▲
    • Distributed MapReduce A Map Task’s Life . . Input: Segment of Input Records (from DFS) . . Job: Process Records one by one — Emit K-V Pairs, 0, 1, or . . More . Then: Working As a Server, Waiting the Reduce’s K-V retriving . request. A Reduce Task’s Life . . Shuffle: Retrive from All Map Tasks for Specific Keys . . . Sort: Group and merge the K-V Pairs . . Reduce: Write File Back to DFS . What does a Hadoop Process do on Your Machine 7 / 17. ▲
    • The Landscape of MapReduce Map 1 1. Map read data from DFS . seperately Reduce 1 2. Map process the data, and . do not communicate each Map 2 other 3. Map keep result in node . Reduce 2 local storage (local disk) 4. Reduce retrive data from all . Map 3 the Maps 5. Reduce do not communicate . Reduce 3 each other either Map 4 6. Reduce write back result to . DFS Figure: Data Flow of MapReduce What does a Hadoop Process do on Your Machine 8 / 17. ▲
    • Hadoop Distributed File System Commodity PC based Massive Data Storage System . . Redundancy: block replicated to different nodes in different . racks . . Location awareness, task can be sched to nodes storing data . . Write once, read multi-times . . Large files will be splitted to Blocks . What does a Hadoop Process do on Your Machine 9 / 17. ▲
    • The Role of a DataNode Block (chunk) container of HDFS . . Manage Dirs as a soft RAID0 — Write block files round-robin . . Keep a block-dir Map in Memory . . . DataNodeProtocol(by NameNode): Communicate with . NameNode — Report, Heartbeat and get command . DataTransferProtocol: Communicate with Client and other . DataNodes — Transfer Blocks What does a Hadoop Process do on Your Machine 10 / 17. ▲
    • DataNode in Disk Block Files . . Those blk XXX . . . 64MB or 128MB blocks . Meta Files . . Those blk XXX.meta . . . Header: layout version, and bytes per checksum . . Checksums . What does a Hadoop Process do on Your Machine 11 / 17. ▲
    • Block Writing To DataNode The Pipe Line . . Setup Pipe line: Client → DataNode1 → DataNode2 → . DataNode3 . DataNode: Receiving packet, .and forward to next datanode . . DataNode Write Received Data Buffer . . DataNode then Write correspond meta . . DataNode flush the file stream. . What does a Hadoop Process do on Your Machine 12 / 17. ▲
    • The Role of a TaskTracker Local Commander of a Node . . Running from begin to the end . . Get task from JobTracker — The Big BOSS . . . Both Map and Reduce are runned by TaskTracker . . Assign tasks to Mapper and Reducer Process . . Work as Http Server (Jetty) for data transfer between TTs . What does a Hadoop Process do on Your Machine 13 / 17. ▲
    • Daily Life of a Mapper Direct Mapper Output . . Run map() against Every Records, and Collect The K-Vs . . . Write K-V into File (in OutputFormat) once got a K-V pair . . Flush file. . Buffered Mapper (The Normal Case) . . Run map() against Every Records, and Collect The K-Vs . . collect K-V’s into a buffer set by io.sort.mb . . . Spill to external file if Map output fulfill the buffer. . . Finally, do a external sort (Optional Combiner) and write to the . final files . file: $local/taskTracker/jobcache/jobid/taskid/file.out . What does a Hadoop Process do on Your Machine 14 / 17. ▲
    • Illustration of Map and Combiner from Yahoo Combiner step inserted into the MapReduce data flow . . Figure: http://developer.yahoo.com/hadoop/tutorial/module4.html What does a Hadoop Process do on Your Machine 15 / 17. ▲
    • Life of a Reducer Shuffle & Sort . . Copy map results from all Maps . . Store map output in disk or memory . . file: . . $local/taskTracker/jobcache/jobid/taskid/output/maplocationid.out . Sort: Merge the map outputs (like the Combiner in Map, . hmmm. . . It should be combiner likes Sort) Reduce . . . Write the result out with Output Format to HDFS . What does a Hadoop Process do on Your Machine 16 / 17. ▲
    • Q&A What does a Hadoop Process do on Your Machine 17 / 17. ▲