Intro to HDFS and MapReduce

  • 6,286 views
Uploaded on

An introduction to HDFS and MapReduce for beginners.

An introduction to HDFS and MapReduce for beginners.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Thanks for your information...
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
6,286
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
333
Comments
1
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to HDFS and MapReduce Copyright © 2012-2013, Think Big Analytics, All Rights ReservedThursday, January 10, 13
  • 2. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
  • 3. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
  • 4. Think Big is the leading professional services firm that’s purpose built for Big Data. • One of Silicon Valley’s Fastest Growing Big Data start ups • 100% Focus on Big Data consulting & Data Science solution services • Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 • Clients: 40+ • North America Locations • US East: Boston, New York, Washington D.C. • US Central: Chicago, Austin • US West: HQ Mountain View, San Diego, Salt Lake City • EMEA & APACConfidential Think Big Analytics 3Thursday, January 10, 13
  • 5. Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012Confidential Think Big Analytics 01/04/13 4Thursday, January 10, 13
  • 6. Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions Copyright © 2012-2013, Think Big Analytics, All 5 Rights ReservedThursday, January 10, 13
  • 7. Big Data Copyright © 2012-2013, Think Big Analytics, All 6 Rights ReservedThursday, January 10, 13
  • 8. A Data Shift... Source: EMC Digital Universe Study* Copyright © 2012-2013, Think Big Analytics, All 7 Rights ReservedThursday, January 10, 13
  • 9. Motivation “Simple algorithms and lots of data trump complex models. ” Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems Copyright © 2012-2013, Think Big Analytics, All 8 Rights ReservedThursday, January 10, 13
  • 10. Pioneers • Google and Yahoo: - Index 850+ million websites, over one trillion URLs. • Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. Copyright © 2012-2013, Think Big Analytics, All 9 Rights ReservedThursday, January 10, 13
  • 11. Hadoop Ecosystem Copyright © 2012-2013, Think Big Analytics, All 10 Rights ReservedThursday, January 10, 13
  • 12. Common Tool? • Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. Copyright © 2012-2013, Think Big Analytics, All 11 Rights ReservedThursday, January 10, 13
  • 13. Hadoop Origins • MapReduce and Google File System (GFS) pioneered at Google. • Hadoop is the commercially-supported open-source equivalent. Copyright © 2012-2013, Think Big Analytics, All 12 Rights ReservedThursday, January 10, 13
  • 14. What Is Hadoop? • Hadoop is a platform. • Distributes and replicates data. • Manages parallel tasks created by users. • Runs as several processes on a cluster. • The term Hadoop generally refers to a toolset, not a single tool. Copyright © 2012-2013, Think Big Analytics, All 13 Rights ReservedThursday, January 10, 13
  • 15. Why Hadoop? • Handles unstructured to semi-structured to structured data. • Handles enormous data volumes. • Flexible data analysis and machine learning tools. • Cost-effective scalability. Copyright © 2012-2013, Think Big Analytics, All 14 Rights ReservedThursday, January 10, 13
  • 16. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
  • 17. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
  • 18. HDFS Copyright © 2012-2013, Think Big Analytics, All 16 Rights ReservedThursday, January 10, 13
  • 19. What Is HDFS? • Hadoop Distributed File System. • Stores files in blocks across many nodes in a cluster. • Replicates the blocks across nodes for durability. • Master/Slave architecture. Copyright © 2012-2013, Think Big Analytics, All 17 Rights ReservedThursday, January 10, 13
  • 20. HDFS Traits • Not fully POSIX compliant. • No file updates. • Write once, read many times. • Large blocks, sequential read patterns. • Designed for batch processing. Copyright © 2012-2013, Think Big Analytics, All 18 Rights ReservedThursday, January 10, 13
  • 21. HDFS Master • NameNode - Runs on a single node as a master process ‣ Holds file metadata (which blocks are where) ‣ Directs client access to files in HDFS • SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata Copyright © 2012-2013, Think Big Analytics, All 19 Rights ReservedThursday, January 10, 13
  • 22. HDFS Slaves • DataNode - Generally runs on all nodes in the cluster ‣ Block creation/replication/deletion/reads ‣ Takes orders from the NameNode Copyright © 2012-2013, Think Big Analytics, All 20 Rights ReservedThursday, January 10, 13
  • 23. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 24. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 25. HDFS Illustrated NameNode 1 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 26. HDFS Illustrated NameNode 1,4,6 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 27. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 28. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 29. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  • 30. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 31. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 32. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 33. Power of Hadoop NameNode ,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 34. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 35. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 36. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time = Transfer DataNode 2 DataNode 3 Rate x Number of Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 37. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time 100 MB/s = x Transfer DataNode 2 DataNode 3 3 Rate x = Number of 300MB/s Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  • 38. HDFS Shell • Easy to use command line interface. • Create, copy, move, and delete files. • Administrative duties - chmod, chown, chgrp. • Set replication factor for a file. • Head, tail, cat to view files. Copyright © 2012-2013, Think Big Analytics, All 23 Rights ReservedThursday, January 10, 13
  • 39. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
  • 40. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
  • 41. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 25 Rights ReservedThursday, January 10, 13
  • 42. MapReduce Basics • Logical functions: Mappers and Reducers. • Developers write map and reduce functions, then submit a jar to the Hadoop cluster. • Hadoop handles distributing the Map and Reduce tasks across the cluster. • Typically batch oriented. Copyright © 2012-2013, Think Big Analytics, All 26 Rights ReservedThursday, January 10, 13
  • 43. MapReduce Daemons •JobTracker (Master) - Manages MapReduce jobs, giving tasks to different nodes, managing task failure •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker Copyright © 2012-2013, Think Big Analytics, All 27 Rights ReservedThursday, January 10, 13
  • 44. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
  • 45. MapReduce in Hadoop Let’s look at how MapReduce actually works in Hadoop, using WordCount. Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
  • 46. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) (there, 1) map 1 mapreduce 1 phase 2 (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
  • 47. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
  • 48. Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 reduce 1 there 2 There is a Reduce phase uses 1 Copyright © 2012-2013, Think Big Analytics, All 30 Rights ReservedThursday, January 10, 13
  • 49. Input Mappers Hadoop uses MapReduce (doc1, "…") There is a Map phase (doc2, "…") (doc3, "") There is a Reduce phase (doc4, "…") Copyright © 2012-2013, Think Big Analytics, All 31 Rights ReservedThursday, January 10, 13
  • 50. Input Mappers (hadoop, 1) Hadoop uses MapReduce (doc1, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (doc2, "…") (a, 1) (map, 1) (phase, 1) (doc3, "") (there, 1) (is, 1) There is a Reduce phase (doc4, "…") (a, 1) (reduce, 1) (phase, 1) Copyright © 2012-2013, Think Big Analytics, All 32 Rights ReservedThursday, January 10, 13
  • 51. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (doc4, "…") (reduce 1) Copyright © 2012-2013, Think Big Analytics, All 33 Rights ReservedThursday, January 10, 13
  • 52. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 34 Rights ReservedThursday, January 10, 13
  • 53. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 35 Rights ReservedThursday, January 10, 13
  • 54. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  • 55. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N (reduce, [1]), outputs. (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  • 56. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") Reduce: • • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N Collect multiple inputs into (reduce, [1]), outputs. (there, 1), one output. (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  • 57. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 58. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 59. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 60. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 61. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker * Intermediate Data Is Map Phase k,v M k,v k,v M k,v M k,v Stored Locally DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 62. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 63. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 64. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 65. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 66. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 67. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  • 68. The Hadoop Java API Copyright © 2012-2013, Think Big Analytics, All 38 Rights ReservedThursday, January 10, 13
  • 69. MapReduce in Java Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
  • 70. MapReduce in Java Let’s look at WordCount written in the MapReduce Java API. Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
  • 71. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
  • 72. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } Let’s drill into this code... } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
  • 73. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
  • 74. Map Codepublic class SimpleWordCountMapper Mapper class with 4 extends MapReduceBase implements type parameters for the Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and output types. static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
  • 75. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Output key-value objects static final IntWritable one = new IntWritable(1); we’ll reuse. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 42 Rights ReservedThursday, January 10, 13
  • 76. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Map method with input, static final IntWritable one = new IntWritable(1); output “collector”, and reporting object. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 43 Rights ReservedThursday, January 10, 13
  • 77. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); Tokenize the line, } “collect” each } (word, 1) }} Copyright © 2012-2013, Think Big Analytics, All 44 Rights ReservedThursday, January 10, 13
  • 78. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
  • 79. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
  • 80. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
  • 81. Reduce Codepublic class SimpleWordCountReducer Reducer class with 4 extends MapReduceBase implements type parameters for the Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and output types. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
  • 82. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reduce method with Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”, and reporting object. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 47 Rights ReservedThursday, January 10, 13
  • 83. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { Count the counts per count += counts.next().get(); } word and emit output.collect(key, new IntWritable(count)); (word, N) }} Copyright © 2012-2013, Think Big Analytics, All 48 Rights ReservedThursday, January 10, 13
  • 84. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  • 85. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  • 86. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  • 87. Conclusions Copyright © 2012-2013, Think Big Analytics, All 50 Rights ReservedThursday, January 10, 13
  • 88. Hadoop Benefits • A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. Copyright © 2012-2013, Think Big Analytics, All 51 Rights ReservedThursday, January 10, 13
  • 89. Hadoop Tools • Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). Copyright © 2012-2013, Think Big Analytics, All 52 Rights ReservedThursday, January 10, 13
  • 90. Hadoop Distributions • A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. Copyright © 2012-2013, Think Big Analytics, All 53 Rights ReservedThursday, January 10, 13
  • 91. Thank You! - Feel free to contact me at ‣ ryan.tabora@thinkbiganalytics.com - Or our solutions consultant ‣ matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! Copyright © 2012-2013, Think Big Analytics, All 54 Rights ReservedThursday, January 10, 13
  • 92. Bonus Content Copyright © 2012-2013, Think Big Analytics, All 55 Rights ReservedThursday, January 10, 13
  • 93. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
  • 94. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
  • 95. Hive: SQL for Hadoop Copyright © 2012-2013, Think Big Analytics, All 57 Rights ReservedThursday, January 10, 13
  • 96. Hive Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
  • 97. Hive Let’s look at WordCount written in Hive, the SQL for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
  • 98. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
  • 99. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
  • 100. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
  • 101. Create a table to hold CREATE TABLE docs (line STRING); the raw text we’re counting. Each line is a “column”. LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
  • 102. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs Load the text in the “docs” directory into the OVERWRITE INTO TABLE docs; table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 61 Rights ReservedThursday, January 10, 13
  • 103. CREATE TABLE docs (line STRING); Create the final table LOAD DATA INPATH docs and fill it with the results OVERWRITE INTO TABLE docs; from a nested query of the docs table that performs WordCount CREATE TABLE word_counts AS on the fly. SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 62 Rights ReservedThursday, January 10, 13
  • 104. Hive Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
  • 105. Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
  • 106. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
  • 107. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
  • 108. Pig: Data Flow for Hadoop Copyright © 2012-2013, Think Big Analytics, All 65 Rights ReservedThursday, January 10, 13
  • 109. Pig Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
  • 110. Pig Let’s look at WordCount written in Pig, the Data Flow language for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
  • 111. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
  • 112. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
  • 113. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
  • 114. inpt = LOAD docs using TextLoader AS (line:chararray); Like the Hive example, load “docs” content, each line is a “field”. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
  • 115. inpt = LOAD docs using TextLoader AS (line:chararray); Tokenize into words (an array) and “flatten” into separate records. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 69 Rights ReservedThursday, January 10, 13
  • 116. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; Collect the same words grpd = GROUP words BY word; together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 70 Rights ReservedThursday, January 10, 13
  • 117. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd Count each word. GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 71 Rights ReservedThursday, January 10, 13
  • 118. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Save the results. STORE cntd INTO output; Profit! Copyright © 2012-2013, Think Big Analytics, All 72 Rights ReservedThursday, January 10, 13
  • 119. Pig Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
  • 120. Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
  • 121. Questions? Copyright © 2012-2013, Think Big Analytics, All 74 Rights ReservedThursday, January 10, 13