• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Intro to HDFS and MapReduce
 

Intro to HDFS and MapReduce

on

  • 4,713 views

An introduction to HDFS and MapReduce for beginners.

An introduction to HDFS and MapReduce for beginners.

Statistics

Views

Total Views
4,713
Views on SlideShare
1,988
Embed Views
2,725

Actions

Likes
5
Downloads
242
Comments
1

3 Embeds 2,725

http://mobicon.tistory.com 2723
http://blog.naver.com 1
http://editor.daum.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks for your information...
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Intro to HDFS and MapReduce Intro to HDFS and MapReduce Presentation Transcript

    • Introduction to HDFS and MapReduce Copyright © 2012-2013, Think Big Analytics, All Rights ReservedThursday, January 10, 13
    • Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
    • Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
    • Think Big is the leading professional services firm that’s purpose built for Big Data. • One of Silicon Valley’s Fastest Growing Big Data start ups • 100% Focus on Big Data consulting & Data Science solution services • Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 • Clients: 40+ • North America Locations • US East: Boston, New York, Washington D.C. • US Central: Chicago, Austin • US West: HQ Mountain View, San Diego, Salt Lake City • EMEA & APACConfidential Think Big Analytics 3Thursday, January 10, 13
    • Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012Confidential Think Big Analytics 01/04/13 4Thursday, January 10, 13
    • Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions Copyright © 2012-2013, Think Big Analytics, All 5 Rights ReservedThursday, January 10, 13
    • Big Data Copyright © 2012-2013, Think Big Analytics, All 6 Rights ReservedThursday, January 10, 13
    • A Data Shift... Source: EMC Digital Universe Study* Copyright © 2012-2013, Think Big Analytics, All 7 Rights ReservedThursday, January 10, 13
    • Motivation “Simple algorithms and lots of data trump complex models. ” Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems Copyright © 2012-2013, Think Big Analytics, All 8 Rights ReservedThursday, January 10, 13
    • Pioneers • Google and Yahoo: - Index 850+ million websites, over one trillion URLs. • Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. Copyright © 2012-2013, Think Big Analytics, All 9 Rights ReservedThursday, January 10, 13
    • Hadoop Ecosystem Copyright © 2012-2013, Think Big Analytics, All 10 Rights ReservedThursday, January 10, 13
    • Common Tool? • Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. Copyright © 2012-2013, Think Big Analytics, All 11 Rights ReservedThursday, January 10, 13
    • Hadoop Origins • MapReduce and Google File System (GFS) pioneered at Google. • Hadoop is the commercially-supported open-source equivalent. Copyright © 2012-2013, Think Big Analytics, All 12 Rights ReservedThursday, January 10, 13
    • What Is Hadoop? • Hadoop is a platform. • Distributes and replicates data. • Manages parallel tasks created by users. • Runs as several processes on a cluster. • The term Hadoop generally refers to a toolset, not a single tool. Copyright © 2012-2013, Think Big Analytics, All 13 Rights ReservedThursday, January 10, 13
    • Why Hadoop? • Handles unstructured to semi-structured to structured data. • Handles enormous data volumes. • Flexible data analysis and machine learning tools. • Cost-effective scalability. Copyright © 2012-2013, Think Big Analytics, All 14 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
    • HDFS Copyright © 2012-2013, Think Big Analytics, All 16 Rights ReservedThursday, January 10, 13
    • What Is HDFS? • Hadoop Distributed File System. • Stores files in blocks across many nodes in a cluster. • Replicates the blocks across nodes for durability. • Master/Slave architecture. Copyright © 2012-2013, Think Big Analytics, All 17 Rights ReservedThursday, January 10, 13
    • HDFS Traits • Not fully POSIX compliant. • No file updates. • Write once, read many times. • Large blocks, sequential read patterns. • Designed for batch processing. Copyright © 2012-2013, Think Big Analytics, All 18 Rights ReservedThursday, January 10, 13
    • HDFS Master • NameNode - Runs on a single node as a master process ‣ Holds file metadata (which blocks are where) ‣ Directs client access to files in HDFS • SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata Copyright © 2012-2013, Think Big Analytics, All 19 Rights ReservedThursday, January 10, 13
    • HDFS Slaves • DataNode - Generally runs on all nodes in the cluster ‣ Block creation/replication/deletion/reads ‣ Takes orders from the NameNode Copyright © 2012-2013, Think Big Analytics, All 20 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode 1 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode 1,4,6 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode ,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time = Transfer DataNode 2 DataNode 3 Rate x Number of Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time 100 MB/s = x Transfer DataNode 2 DataNode 3 3 Rate x = Number of 300MB/s Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
    • HDFS Shell • Easy to use command line interface. • Create, copy, move, and delete files. • Administrative duties - chmod, chown, chgrp. • Set replication factor for a file. • Head, tail, cat to view files. Copyright © 2012-2013, Think Big Analytics, All 23 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
    • MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 25 Rights ReservedThursday, January 10, 13
    • MapReduce Basics • Logical functions: Mappers and Reducers. • Developers write map and reduce functions, then submit a jar to the Hadoop cluster. • Hadoop handles distributing the Map and Reduce tasks across the cluster. • Typically batch oriented. Copyright © 2012-2013, Think Big Analytics, All 26 Rights ReservedThursday, January 10, 13
    • MapReduce Daemons •JobTracker (Master) - Manages MapReduce jobs, giving tasks to different nodes, managing task failure •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker Copyright © 2012-2013, Think Big Analytics, All 27 Rights ReservedThursday, January 10, 13
    • MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
    • MapReduce in Hadoop Let’s look at how MapReduce actually works in Hadoop, using WordCount. Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) (there, 1) map 1 mapreduce 1 phase 2 (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 reduce 1 there 2 There is a Reduce phase uses 1 Copyright © 2012-2013, Think Big Analytics, All 30 Rights ReservedThursday, January 10, 13
    • Input Mappers Hadoop uses MapReduce (doc1, "…") There is a Map phase (doc2, "…") (doc3, "") There is a Reduce phase (doc4, "…") Copyright © 2012-2013, Think Big Analytics, All 31 Rights ReservedThursday, January 10, 13
    • Input Mappers (hadoop, 1) Hadoop uses MapReduce (doc1, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (doc2, "…") (a, 1) (map, 1) (phase, 1) (doc3, "") (there, 1) (is, 1) There is a Reduce phase (doc4, "…") (a, 1) (reduce, 1) (phase, 1) Copyright © 2012-2013, Think Big Analytics, All 32 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (doc4, "…") (reduce 1) Copyright © 2012-2013, Think Big Analytics, All 33 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 34 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 35 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N (reduce, [1]), outputs. (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
    • Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") Reduce: • • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N Collect multiple inputs into (reduce, [1]), outputs. (there, 1), one output. (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker * Intermediate Data Is Map Phase k,v M k,v k,v M k,v M k,v Stored Locally DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
    • The Hadoop Java API Copyright © 2012-2013, Think Big Analytics, All 38 Rights ReservedThursday, January 10, 13
    • MapReduce in Java Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
    • MapReduce in Java Let’s look at WordCount written in the MapReduce Java API. Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } Let’s drill into this code... } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper Mapper class with 4 extends MapReduceBase implements type parameters for the Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and output types. static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Output key-value objects static final IntWritable one = new IntWritable(1); we’ll reuse. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 42 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Map method with input, static final IntWritable one = new IntWritable(1); output “collector”, and reporting object. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 43 Rights ReservedThursday, January 10, 13
    • Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); Tokenize the line, } “collect” each } (word, 1) }} Copyright © 2012-2013, Think Big Analytics, All 44 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer Reducer class with 4 extends MapReduceBase implements type parameters for the Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and output types. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reduce method with Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”, and reporting object. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 47 Rights ReservedThursday, January 10, 13
    • Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { Count the counts per count += counts.next().get(); } word and emit output.collect(key, new IntWritable(count)); (word, N) }} Copyright © 2012-2013, Think Big Analytics, All 48 Rights ReservedThursday, January 10, 13
    • Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
    • Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
    • Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
    • Conclusions Copyright © 2012-2013, Think Big Analytics, All 50 Rights ReservedThursday, January 10, 13
    • Hadoop Benefits • A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. Copyright © 2012-2013, Think Big Analytics, All 51 Rights ReservedThursday, January 10, 13
    • Hadoop Tools • Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). Copyright © 2012-2013, Think Big Analytics, All 52 Rights ReservedThursday, January 10, 13
    • Hadoop Distributions • A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. Copyright © 2012-2013, Think Big Analytics, All 53 Rights ReservedThursday, January 10, 13
    • Thank You! - Feel free to contact me at ‣ ryan.tabora@thinkbiganalytics.com - Or our solutions consultant ‣ matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! Copyright © 2012-2013, Think Big Analytics, All 54 Rights ReservedThursday, January 10, 13
    • Bonus Content Copyright © 2012-2013, Think Big Analytics, All 55 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
    • Hive: SQL for Hadoop Copyright © 2012-2013, Think Big Analytics, All 57 Rights ReservedThursday, January 10, 13
    • Hive Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
    • Hive Let’s look at WordCount written in Hive, the SQL for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
    • CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
    • CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
    • CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
    • Create a table to hold CREATE TABLE docs (line STRING); the raw text we’re counting. Each line is a “column”. LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
    • CREATE TABLE docs (line STRING); LOAD DATA INPATH docs Load the text in the “docs” directory into the OVERWRITE INTO TABLE docs; table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 61 Rights ReservedThursday, January 10, 13
    • CREATE TABLE docs (line STRING); Create the final table LOAD DATA INPATH docs and fill it with the results OVERWRITE INTO TABLE docs; from a nested query of the docs table that performs WordCount CREATE TABLE word_counts AS on the fly. SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 62 Rights ReservedThursday, January 10, 13
    • Hive Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
    • Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
    • The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
    • Pig: Data Flow for Hadoop Copyright © 2012-2013, Think Big Analytics, All 65 Rights ReservedThursday, January 10, 13
    • Pig Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
    • Pig Let’s look at WordCount written in Pig, the Data Flow language for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); Like the Hive example, load “docs” content, each line is a “field”. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); Tokenize into words (an array) and “flatten” into separate records. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 69 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; Collect the same words grpd = GROUP words BY word; together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 70 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd Count each word. GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 71 Rights ReservedThursday, January 10, 13
    • inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Save the results. STORE cntd INTO output; Profit! Copyright © 2012-2013, Think Big Analytics, All 72 Rights ReservedThursday, January 10, 13
    • Pig Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
    • Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
    • Questions? Copyright © 2012-2013, Think Big Analytics, All 74 Rights ReservedThursday, January 10, 13