Intro To Hadoop

27,557 views
24,182 views

Published on

These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/

Published in: Education, Technology
4 Comments
40 Likes
Statistics
Notes
  • #Hadoop Hadoop Tutorial (Videos and Books) $15 29Gb of data http://www.dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi All, We are planning to start Hadoop online training batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/hadoop-online-training-hyderabad-bangalore
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Very good presentation. The language is so simple. I loved reading all of it
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
27,557
On SlideShare
0
From Embeds
0
Number of Embeds
5,664
Actions
Shares
0
Downloads
1,029
Comments
4
Likes
40
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • TODO: Linked in paper, out VLDB paper\n
  • TODO: Linked in paper, out VLDB paper\n
  • TODO: Linked in paper, out VLDB paper\n
  • TODO: Linked in paper, out VLDB paper\n
  • TODO: Linked in paper, out VLDB paper\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 9 disks/day will fail\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • explain about racks\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • TODO: show data changing at phases\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • VLDB paper too\n
  • \n
  • Intro To Hadoop

    1. Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
    2. Outline • What is Big Data? • Hadoop • HDFS • MapReduce • Twitter Analytics and Hadoop 2
    3. What is big data? • A bunch of data? • An industry? • An expertise? • A trend? • A cliche? 3
    4. Wikipedia big data In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. Source: http://en.wikipedia.org/wiki/Big_data 4
    5. How big is big? 5
    6. How big is big? • 2008: Google processes 20 PB a day 5
    7. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day 5
    8. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day 5
    9. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data 5
    10. How big is big? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day 5
    11. That’s a lot of data Credit: http://www.flickr.com/photos/19779889@N00/1367404058/ 6
    12. So what? 7
    13. So what? s/data/knowledge/g 7
    14. No really, what do you do with it? 8
    15. No really, what do you do with it? • User behavior analysis 8
    16. No really, what do you do with it? • User behavior analysis • AB test analysis 8
    17. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting 8
    18. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics 8
    19. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling 8
    20. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling • Recommendations 8
    21. No really, what do you do with it? • User behavior analysis • AB test analysis • Ad targeting • Trending topics • User and topic modeling • Recommendations • And more... 8
    22. How to scale data? 9
    23. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine 10
    24. Parallel processing is complicated Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
    25. Parallel processing is complicated • How do we assign tasks to workers? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
    26. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
    27. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
    28. Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization? Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
    29. Data storage is not trivial • Data volumes are massive • Reliably storing PBs of data is challenging • Disk/hardware/network failures • Probability of failure event increases with number of machines For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day? 12
    30. Hadoop cluster Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!) 13
    31. Hadoop 14
    32. Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination http://hapdoop.apache.org 15
    33. Joy Credit: http://www.flickr.com/photos/spyndle/3480602438/ 16
    34. Hadoop origins 17
    35. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google 17
    36. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System 17
    37. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 17
    38. Hadoop origins • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 17
    39. Hadoop Stack (Columnar Database) Pig Hive Cascading (Data Flow) (SQL) (Java) HBase MapReduce (Distributed Programming Framework) HDFS (Hadoop Distributed File System) 18
    40. HDFS 19
    41. HDFS is... 20
    42. HDFS is... • A distributed file system 20
    43. HDFS is... • A distributed file system • Redundant storage 20
    44. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware 20
    45. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures 20
    46. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files 20
    47. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts 20
    48. HDFS is... • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System 20
    49. HDFS - files and blocks 21
    50. HDFS - files and blocks• Files are stored as a collection of blocks 21
    51. HDFS - files and blocks• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable) 21
    52. HDFS - files and blocks• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable) 21
    53. HDFS - files and blocks• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable)• The NameNode (NN) manages metadata about filesand blocks 21
    54. HDFS - files and blocks• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable)• The NameNode (NN) manages metadata about filesand blocks• The SecondaryNameNode (SNN) holds a backupof the NN data 21
    55. HDFS - files and blocks• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable)• The NameNode (NN) manages metadata about filesand blocks• The SecondaryNameNode (SNN) holds a backupof the NN data• DataNodes (DN) store and serve blocks 21
    56. Replication• Multiple copies of a block are stored• Replication strategy: • Copy #1 on another node on same rack • Copy #2 on another node on different rack 22
    57. HDFS - writes Client Master Note: Write path for a File NameNode single block shown. Client writes multiple blocks in parallel. block Rack #1 Rack #2 Slave node Slave node Slave node DataNode DataNode DataNode Block Block Block 23
    58. HDFS - reads Client Master Client reads multiple blocks File NameNode in parallel and re-assembles into a file. block N block 1 block 2 Slave node Slave node Slave node DataNode DataNode DataNode Block Block Block 24
    59. What about DataNode failures?• DNs check in with the NN to report health• Upon failure NN orders DNs to replicate under-replicated blocks Credit: http://www.flickr.com/photos/18536761@N00/367661087/ 25
    60. MapReduce 26
    61. MapReduce is... 27
    62. MapReduce is... • A programming model for expressing distributed computations at a massive scale 27
    63. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations 27
    64. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop 27
    65. Typical large-data problem (Dean and Ghemawat, OSDI 2004) 28
    66. Typical large-data problem • Iterate over a large number of records (Dean and Ghemawat, OSDI 2004) 28
    67. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each (Dean and Ghemawat, OSDI 2004) 28
    68. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results (Dean and Ghemawat, OSDI 2004) 28
    69. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results (Dean and Ghemawat, OSDI 2004) 28
    70. Typical large-data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output (Dean and Ghemawat, OSDI 2004) 28
    71. Typical large-data problem • Iterate over a large number of records Map • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output (Dean and Ghemawat, OSDI 2004) 28
    72. Typical large-data problem • Iterate over a large number of records Map • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results Reduce • Generate final output (Dean and Ghemawat, OSDI 2004) 28
    73. MapReduce paradigm 29
    74. MapReduce paradigm • Implement two functions: 29
    75. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) 29
    76. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) 29
    77. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* 29
    78. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer 29
    79. MapReduce Flow k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 30
    80. MapReduce - word count examplefunction map(String name, String document): for each word w in document: emit(w, 1)function reduce(String word, Iterator partialCounts): totalCount = 0 for each count in partialCounts: totalCount += count emit(word, totalCount) 31
    81. MapReduce paradigm - part 2 32
    82. MapReduce paradigm - part 2 • There’s more! 32
    83. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer 32
    84. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber 32
    85. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks 32
    86. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based 32
    87. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer 32
    88. MapReduce paradigm - part 2 • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer • Reduce(k2, list(v2)) -> list(v3) 32
    89. MapReduce flow k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3 33
    90. MapReduce additional details 34
    91. MapReduce additional details • Reduce starts after all mappers complete 34
    92. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk 34
    93. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner 34
    94. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order 34
    95. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers 34
    96. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning 34
    97. MapReduce - jobs and tasks 35
    98. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set 35
    99. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task 35
    100. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically 35
    101. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally 35
    102. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation 35
    103. MapReduce - jobs and tasks • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks 35
    104. MapReduce architecture Client Master Job JobTracker Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker Task Task Task 36
    105. What about failed tasks? Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    106. What about failed tasks? • Tasks will fail Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    107. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    108. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    109. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    110. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    111. What about failed tasks? • Tasks will fail • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
    112. MapReduce data locality 38
    113. MapReduce data locality • Move computation to the data 38
    114. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost 38
    115. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data 38
    116. MapReduce data locality • Move computation to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN 38
    117. MapReduce - Java API• Mapper: void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)• Reducer: void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) 39
    118. MapReduce - Java API 40
    119. MapReduce - Java API• Writable 40
    120. MapReduce - Java API• Writable • Hadoop wrapper interface 40
    121. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc 40
    122. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable 40
    123. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable 40
    124. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector 40
    125. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values 40
    126. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter 40
    127. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters 40
    128. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat 40
    129. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat • Reads data and provide InputSplits 40
    130. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat 40
    131. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat• OutputFormat 40
    132. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat• OutputFormat • Writes data 40
    133. MapReduce - Java API• Writable • Hadoop wrapper interface • Text, IntWritable, LongWritable, etc• WritableComparable • Writable classes implement WritableComparable• OutputCollector • Class that collects keys and values• Reporter • Reports progress, updates counters• InputFormat • Reads data and provide InputSplits • Examples: TextInputFormat, KeyValueTextInputFormat• OutputFormat • Writes data • Examples: TextOutputFormat, SequenceFileOutputFormat 40
    134. MapReduce - Counters are... 41
    135. MapReduce - Counters are...• A distributed count of events during a job 41
    136. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging 41
    137. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend 41
    138. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend 41
    139. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend• Bad: 41
    140. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend• Bad:System.out.println(“Couldn’t parse value”); 41
    141. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend• Bad:System.out.println(“Couldn’t parse value”);• Good: 41
    142. MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend• Bad:System.out.println(“Couldn’t parse value”);• Good:reporter.incrCounter(BadParseEnum, 1L); 41
    143. MapReduce - word count mapperpublic static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }} 42
    144. MapReduce - word count reducerpublic static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }} 43
    145. MapReduce - word count main public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 44
    146. MapReduce - running a job• To run word count, add files to HDFS and do:$ bin/hadoop jar wordcount.jarorg.myorg.WordCount input_dir output_dir 45
    147. MapReduce is good for... 46
    148. MapReduce is good for... • Embarrassingly parallel algorithms 46
    149. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining 46
    150. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets 46
    151. MapReduce is good for... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset 46
    152. MapReduce is ok for... 47
    153. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) 47
    154. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk 47
    155. MapReduce is ok for... • Iterative jobs (i.e., graph algorithms) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high 47
    156. MapReduce is not good for... 48
    157. MapReduce is not good for... • Jobs that need shared state/coordination 48
    158. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing 48
    159. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store 48
    160. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs 48
    161. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets 48
    162. MapReduce is not good for... • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records 48
    163. Hadoop combined architecture Master JobTracker Backup NameNode SecondaryNameNode Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 49
    164. NameNode UI • Tool for browsing HDFS 50
    165. JobTracker UI • Tool to see running/completed/failed jobs 51
    166. Running Hadoop 52
    167. Running Hadoop • Multiple options 52
    168. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) 52
    169. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine 52
    170. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) 52
    171. Running Hadoop • Multiple options • On your local machine (standalone or pseudo- distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) • In your own datacenter 52
    172. Cloudera VM 53
    173. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded 53
    174. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop 53
    175. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing 53
    176. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing• Pre-loaded with sample data and jobs 53
    177. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing• Pre-loaded with sample data and jobs• Documented tutorials 53
    178. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing• Pre-loaded with sample data and jobs• Documented tutorials• VM: https://ccp.cloudera.com/display/SUPPORT/Clouderas+Hadoop+Demo+VM+for+CDH4 53
    179. Cloudera VM• Virtual machine with Hadoop and related technologiespre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing• Pre-loaded with sample data and jobs• Documented tutorials• VM: https://ccp.cloudera.com/display/SUPPORT/Clouderas+Hadoop+Demo+VM+for+CDH4• Tutorial: https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial 53
    180. Twitter Analytics and Hadoop 54
    181. Multiple teams use Hadoop • Analytics • Revenue • Personalization & Recommendations • Growth 55
    182. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 56
    183. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 56
    184. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 56
    185. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
    186. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
    187. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
    188. Twitter Analytics data flow Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log RasvelgHCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 56
    189. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
    190. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
    191. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
    192. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 57
    193. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 57
    194. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 57
    195. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 57
    196. Credits • Data-Intensive Information Processing Applications ― Session #1, Jimmy Lin, University of Maryland 58
    197. Questions? Bill Graham - @billgraham 59

    ×