Introduction to HDFS                and MapReduce                           Copyright © 2012-2013, Think Big Analytics, Al...
Who Am I             - Ryan Tabora             - Data Developer at Think                   Big Analytics             - Big...
Who Am I             - Ryan Tabora             - Data Developer at Think                   Big Analytics             - Big...
Think Big is the leading professional services firm that’s purpose built                                  for Big Data.    ...
Think Big Recognized as a Top Pure-Play Big Data Vendor                                           Source: Forbes February ...
Agenda                     - Big Data                     - Hadoop Ecosystem                     - HDFS                   ...
Big Data                                  Copyright © 2012-2013, Think Big Analytics, All                              6  ...
A Data Shift...                               Source: EMC Digital Universe Study*                                         ...
Motivation             “Simple algorithms and lots                of data trump complex                       models. ”   ...
Pioneers                • Google and Yahoo:                     - Index 850+ million websites, over one                   ...
Hadoop                           Ecosystem                                    Copyright © 2012-2013, Think Big Analytics, ...
Common Tool?                     • Hadoop                           - Cluster: distributed computing                      ...
Hadoop Origins                • MapReduce and Google File System (GFS)                       pioneered at Google.         ...
What Is Hadoop?                •      Hadoop is a platform.                •      Distributes and replicates data.        ...
Why Hadoop?                • Handles unstructured to semi-structured to                       structured data.            ...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
HDFS                                  Copyright © 2012-2013, Think Big Analytics, All                            16       ...
What Is HDFS?                • Hadoop Distributed File System.                • Stores files in blocks across many nodes in...
HDFS Traits                • Not fully POSIX compliant.                • No file updates.                • Write once, read...
HDFS Master                • NameNode                           - Runs on a single node as a master process               ...
HDFS Slaves                • DataNode                           - Generally runs on all nodes in the cluster              ...
HDFS Illustrated                                                   NameNode                           Put File         Fil...
HDFS Illustrated                                                   NameNode                           Put File         Fil...
HDFS Illustrated                                                   NameNode                                               ...
HDFS Illustrated                                                   NameNode                                               ...
HDFS Illustrated                                                   NameNode                                               ...
HDFS Illustrated                                                   NameNode                                               ...
HDFS Illustrated                                                   NameNode                                               ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
Power of Hadoop                                                NameNode                                                   ...
HDFS Shell                • Easy to use command line interface.                • Create, copy, move, and delete files.     ...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
MapReduce                              in                            Hadoop                                    Copyright ©...
MapReduce Basics                • Logical functions: Mappers and Reducers.                • Developers write map and reduc...
MapReduce                               Daemons           •JobTracker (Master)               - Manages MapReduce jobs, giv...
MapReduce in                             Hadoop                                     Copyright © 2012-2013, Think Big Analy...
MapReduce in                             Hadoop                      Let’s look at how MapReduce                        ac...
Input            Mappers          Sort,            Reducers                       Output                                  ...
Input            Mappers          Sort,            Reducers                       Output                                  ...
Input            Mappers    Sort,   Reducers                       Output                                     Shuffle      ...
Input            Mappers       Hadoop uses       MapReduce                           (doc1, "…")        There is a        ...
Input            Mappers                                         (hadoop, 1)       Hadoop uses       MapReduce            ...
Input            Mappers              Sort,            Reducers                                               Shuffle      ...
Input            Mappers              Sort,          Reducers                                               Shuffle        ...
Input            Mappers              Sort,          Reducers                         Output                              ...
Input            Mappers              Sort,          Reducers                         Output                              ...
Input            Mappers              Sort,          Reducers                         Output                              ...
Input            Mappers              Sort,               Reducers                          Output                        ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
Cluster View of                                           MapReduce                                                       ...
The                           Hadoop                           Java API                                      Copyright © 2...
MapReduce in Java                                  Copyright © 2012-2013, Think Big Analytics, All                        ...
MapReduce in Java                           Let’s look at WordCount                                 written in the        ...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Map Codepublic class SimpleWordCountMapper                               Mapper class with 4 extends MapReduceBase impleme...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ...
Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritab...
Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritab...
Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritab...
Reduce Codepublic class SimpleWordCountReducer                              Reducer class with 4 extends MapReduceBase imp...
Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements                               Reduce metho...
Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritab...
Other Options                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A distribute...
Other Options                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A distribute...
Other Options                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A distribute...
Conclusions                                     Copyright © 2012-2013, Think Big Analytics, All                           ...
Hadoop Benefits                     • A cost-effective, scalable way to:                           - Store massive data set...
Hadoop Tools                     • Offers a variety of tools for:                           - Application development.    ...
Hadoop                               Distributions                     • A rich, open-source ecosystem.                   ...
Thank You!             - Feel free to contact me at               ‣ ryan.tabora@thinkbiganalytics.com             - Or our...
Bonus                           Content                                   Copyright © 2012-2013, Think Big Analytics, All ...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
Hive:                           SQL for                           Hadoop                                     Copyright © 2...
Hive                                  Copyright © 2012-2013, Think Big Analytics, All                            58       ...
Hive                           Let’s look at WordCount                                written in Hive,                    ...
CREATE TABLE docs (line STRING);  LOAD DATA INPATH docs  OVERWRITE INTO TABLE docs;  CREATE TABLE word_counts AS  SELECT w...
CREATE TABLE docs (line STRING);  LOAD DATA INPATH docs  OVERWRITE INTO TABLE docs;  CREATE TABLE word_counts AS  SELECT w...
CREATE TABLE docs (line STRING);  LOAD DATA INPATH docs  OVERWRITE INTO TABLE docs;  CREATE TABLE word_counts AS  SELECT w...
Create a table to hold  CREATE TABLE docs (line STRING);               the raw text we’re                                 ...
CREATE TABLE docs (line STRING);  LOAD DATA INPATH docs                      Load the text in the                         ...
CREATE TABLE docs (line STRING);                                                Create the final table  LOAD DATA INPATH do...
Hive                                  Copyright © 2012-2013, Think Big Analytics, All                            63       ...
Hive               Because so many Hadoop users                come from SQL backgrounds,                   Hive is one of...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
The Hadoop Ecosystem                     • HDFS - Hadoop Distributed File System.                     • Map/Reduce - A dis...
Pig:                            Data Flow                           for Hadoop                                    Copyrigh...
Pig                                 Copyright © 2012-2013, Think Big Analytics, All                            66         ...
Pig                           Let’s look at WordCount                                 written in Pig,                     ...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
inpt = LOAD docs using TextLoader     AS (line:chararray);            Like the Hive example,                              ...
inpt = LOAD docs using TextLoader     AS (line:chararray);            Tokenize into words (an                             ...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
inpt = LOAD docs using TextLoader     AS (line:chararray);  words = FOREACH inpt     GENERATE flatten(TOKENIZE(line)) AS w...
Pig                                 Copyright © 2012-2013, Think Big Analytics, All                            73         ...
Pig                              Pig and Hive overlap,                           but Pig is popular for ETL,              ...
Questions?                                    Copyright © 2012-2013, Think Big Analytics, All                             ...
Upcoming SlideShare
Loading in...5
×

Intro to HDFS and MapReduce

8,513

Published on

An introduction to HDFS and MapReduce for beginners.

Published in: Technology
1 Comment
10 Likes
Statistics
Notes
  • Thanks for your information...
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
8,513
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
398
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Intro to HDFS and MapReduce

  1. 1. Introduction to HDFS and MapReduce Copyright © 2012-2013, Think Big Analytics, All Rights ReservedThursday, January 10, 13
  2. 2. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
  3. 3. Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. Copyright © 2012-2013, Think Big Analytics, All 2 Rights ReservedThursday, January 10, 13
  4. 4. Think Big is the leading professional services firm that’s purpose built for Big Data. • One of Silicon Valley’s Fastest Growing Big Data start ups • 100% Focus on Big Data consulting & Data Science solution services • Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 • Clients: 40+ • North America Locations • US East: Boston, New York, Washington D.C. • US Central: Chicago, Austin • US West: HQ Mountain View, San Diego, Salt Lake City • EMEA & APACConfidential Think Big Analytics 3Thursday, January 10, 13
  5. 5. Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012Confidential Think Big Analytics 01/04/13 4Thursday, January 10, 13
  6. 6. Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions Copyright © 2012-2013, Think Big Analytics, All 5 Rights ReservedThursday, January 10, 13
  7. 7. Big Data Copyright © 2012-2013, Think Big Analytics, All 6 Rights ReservedThursday, January 10, 13
  8. 8. A Data Shift... Source: EMC Digital Universe Study* Copyright © 2012-2013, Think Big Analytics, All 7 Rights ReservedThursday, January 10, 13
  9. 9. Motivation “Simple algorithms and lots of data trump complex models. ” Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems Copyright © 2012-2013, Think Big Analytics, All 8 Rights ReservedThursday, January 10, 13
  10. 10. Pioneers • Google and Yahoo: - Index 850+ million websites, over one trillion URLs. • Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. Copyright © 2012-2013, Think Big Analytics, All 9 Rights ReservedThursday, January 10, 13
  11. 11. Hadoop Ecosystem Copyright © 2012-2013, Think Big Analytics, All 10 Rights ReservedThursday, January 10, 13
  12. 12. Common Tool? • Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. Copyright © 2012-2013, Think Big Analytics, All 11 Rights ReservedThursday, January 10, 13
  13. 13. Hadoop Origins • MapReduce and Google File System (GFS) pioneered at Google. • Hadoop is the commercially-supported open-source equivalent. Copyright © 2012-2013, Think Big Analytics, All 12 Rights ReservedThursday, January 10, 13
  14. 14. What Is Hadoop? • Hadoop is a platform. • Distributes and replicates data. • Manages parallel tasks created by users. • Runs as several processes on a cluster. • The term Hadoop generally refers to a toolset, not a single tool. Copyright © 2012-2013, Think Big Analytics, All 13 Rights ReservedThursday, January 10, 13
  15. 15. Why Hadoop? • Handles unstructured to semi-structured to structured data. • Handles enormous data volumes. • Flexible data analysis and machine learning tools. • Cost-effective scalability. Copyright © 2012-2013, Think Big Analytics, All 14 Rights ReservedThursday, January 10, 13
  16. 16. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
  17. 17. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 15 Rights ReservedThursday, January 10, 13
  18. 18. HDFS Copyright © 2012-2013, Think Big Analytics, All 16 Rights ReservedThursday, January 10, 13
  19. 19. What Is HDFS? • Hadoop Distributed File System. • Stores files in blocks across many nodes in a cluster. • Replicates the blocks across nodes for durability. • Master/Slave architecture. Copyright © 2012-2013, Think Big Analytics, All 17 Rights ReservedThursday, January 10, 13
  20. 20. HDFS Traits • Not fully POSIX compliant. • No file updates. • Write once, read many times. • Large blocks, sequential read patterns. • Designed for batch processing. Copyright © 2012-2013, Think Big Analytics, All 18 Rights ReservedThursday, January 10, 13
  21. 21. HDFS Master • NameNode - Runs on a single node as a master process ‣ Holds file metadata (which blocks are where) ‣ Directs client access to files in HDFS • SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata Copyright © 2012-2013, Think Big Analytics, All 19 Rights ReservedThursday, January 10, 13
  22. 22. HDFS Slaves • DataNode - Generally runs on all nodes in the cluster ‣ Block creation/replication/deletion/reads ‣ Takes orders from the NameNode Copyright © 2012-2013, Think Big Analytics, All 20 Rights ReservedThursday, January 10, 13
  23. 23. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  24. 24. HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  25. 25. HDFS Illustrated NameNode 1 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  26. 26. HDFS Illustrated NameNode 1,4,6 Put File 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  27. 27. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  28. 28. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  29. 29. HDFS Illustrated NameNode 1,4,6 Put File 2 ,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 21 Rights ReservedThursday, January 10, 13
  30. 30. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  31. 31. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  32. 32. Power of Hadoop NameNode 1,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  33. 33. Power of Hadoop NameNode ,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  34. 34. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  35. 35. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  36. 36. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time = Transfer DataNode 2 DataNode 3 Rate x Number of Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  37. 37. Power of Hadoop NameNode 5,4,6 Read File 2 ,5,3 3 ,2,6 Read time 100 MB/s = x Transfer DataNode 2 DataNode 3 3 Rate x = Number of 300MB/s Machines* DataNode 4 DataNode 5 DataNode 6 Copyright © 2012-2013, Think Big Analytics, All 22 Rights ReservedThursday, January 10, 13
  38. 38. HDFS Shell • Easy to use command line interface. • Create, copy, move, and delete files. • Administrative duties - chmod, chown, chgrp. • Set replication factor for a file. • Head, tail, cat to view files. Copyright © 2012-2013, Think Big Analytics, All 23 Rights ReservedThursday, January 10, 13
  39. 39. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
  40. 40. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 24 Rights ReservedThursday, January 10, 13
  41. 41. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 25 Rights ReservedThursday, January 10, 13
  42. 42. MapReduce Basics • Logical functions: Mappers and Reducers. • Developers write map and reduce functions, then submit a jar to the Hadoop cluster. • Hadoop handles distributing the Map and Reduce tasks across the cluster. • Typically batch oriented. Copyright © 2012-2013, Think Big Analytics, All 26 Rights ReservedThursday, January 10, 13
  43. 43. MapReduce Daemons •JobTracker (Master) - Manages MapReduce jobs, giving tasks to different nodes, managing task failure •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker Copyright © 2012-2013, Think Big Analytics, All 27 Rights ReservedThursday, January 10, 13
  44. 44. MapReduce in Hadoop Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
  45. 45. MapReduce in Hadoop Let’s look at how MapReduce actually works in Hadoop, using WordCount. Copyright © 2012-2013, Think Big Analytics, All 28 Rights ReservedThursday, January 10, 13
  46. 46. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) (there, 1) map 1 mapreduce 1 phase 2 (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
  47. 47. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 29 Rights ReservedThursday, January 10, 13
  48. 48. Input Mappers Sort, Reducers Output Shuffle Hadoop uses MapReduce a2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 reduce 1 there 2 There is a Reduce phase uses 1 Copyright © 2012-2013, Think Big Analytics, All 30 Rights ReservedThursday, January 10, 13
  49. 49. Input Mappers Hadoop uses MapReduce (doc1, "…") There is a Map phase (doc2, "…") (doc3, "") There is a Reduce phase (doc4, "…") Copyright © 2012-2013, Think Big Analytics, All 31 Rights ReservedThursday, January 10, 13
  50. 50. Input Mappers (hadoop, 1) Hadoop uses MapReduce (doc1, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (doc2, "…") (a, 1) (map, 1) (phase, 1) (doc3, "") (there, 1) (is, 1) There is a Reduce phase (doc4, "…") (a, 1) (reduce, 1) (phase, 1) Copyright © 2012-2013, Think Big Analytics, All 32 Rights ReservedThursday, January 10, 13
  51. 51. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (doc4, "…") (reduce 1) Copyright © 2012-2013, Think Big Analytics, All 33 Rights ReservedThursday, January 10, 13
  52. 52. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 34 Rights ReservedThursday, January 10, 13
  53. 53. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (doc4, "…") (reduce 1) (uses, 1) uses 1 Copyright © 2012-2013, Think Big Analytics, All 35 Rights ReservedThursday, January 10, 13
  54. 54. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (doc3, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  55. 55. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N (reduce, [1]), outputs. (there, 1), (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  56. 56. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (doc1, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (doc2, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 Map: (doc3, "") Reduce: • • (phase,1) r-z Transform one input 1), (a, 1) (is, to 0-N Collect multiple inputs into (reduce, [1]), outputs. (there, 1), one output. (there, [1,1]), (doc4, "…") (reduce 1) (uses, 1) Copyright © 2012-2013, Think Big Analytics, All 36 Rights ReservedThursday, January 10, 13
  57. 57. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  58. 58. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  59. 59. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  60. 60. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  61. 61. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker * Intermediate Data Is Map Phase k,v M k,v k,v M k,v M k,v Stored Locally DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  62. 62. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  63. 63. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  64. 64. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  65. 65. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  66. 66. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  67. 67. Cluster View of MapReduce NameNode M R JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode Copyright © 2012-2013, Think Big Analytics, All 37 Rights ReservedThursday, January 10, 13
  68. 68. The Hadoop Java API Copyright © 2012-2013, Think Big Analytics, All 38 Rights ReservedThursday, January 10, 13
  69. 69. MapReduce in Java Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
  70. 70. MapReduce in Java Let’s look at WordCount written in the MapReduce Java API. Copyright © 2012-2013, Think Big Analytics, All 39 Rights ReservedThursday, January 10, 13
  71. 71. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
  72. 72. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } Let’s drill into this code... } }} Copyright © 2012-2013, Think Big Analytics, All 40 Rights ReservedThursday, January 10, 13
  73. 73. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
  74. 74. Map Codepublic class SimpleWordCountMapper Mapper class with 4 extends MapReduceBase implements type parameters for the Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and output types. static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 41 Rights ReservedThursday, January 10, 13
  75. 75. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Output key-value objects static final IntWritable one = new IntWritable(1); we’ll reuse. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 42 Rights ReservedThursday, January 10, 13
  76. 76. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); Map method with input, static final IntWritable one = new IntWritable(1); output “collector”, and reporting object. @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }} Copyright © 2012-2013, Think Big Analytics, All 43 Rights ReservedThursday, January 10, 13
  77. 77. Map Codepublic class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); Tokenize the line, } “collect” each } (word, 1) }} Copyright © 2012-2013, Think Big Analytics, All 44 Rights ReservedThursday, January 10, 13
  78. 78. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
  79. 79. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 45 Rights ReservedThursday, January 10, 13
  80. 80. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
  81. 81. Reduce Codepublic class SimpleWordCountReducer Reducer class with 4 extends MapReduceBase implements type parameters for the Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and output types. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 46 Rights ReservedThursday, January 10, 13
  82. 82. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reduce method with Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”, and reporting object. @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }} Copyright © 2012-2013, Think Big Analytics, All 47 Rights ReservedThursday, January 10, 13
  83. 83. Reduce Codepublic class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { Count the counts per count += counts.next().get(); } word and emit output.collect(key, new IntWritable(count)); (word, N) }} Copyright © 2012-2013, Think Big Analytics, All 48 Rights ReservedThursday, January 10, 13
  84. 84. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  85. 85. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  86. 86. Other Options • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 49 Rights ReservedThursday, January 10, 13
  87. 87. Conclusions Copyright © 2012-2013, Think Big Analytics, All 50 Rights ReservedThursday, January 10, 13
  88. 88. Hadoop Benefits • A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. Copyright © 2012-2013, Think Big Analytics, All 51 Rights ReservedThursday, January 10, 13
  89. 89. Hadoop Tools • Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). Copyright © 2012-2013, Think Big Analytics, All 52 Rights ReservedThursday, January 10, 13
  90. 90. Hadoop Distributions • A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. Copyright © 2012-2013, Think Big Analytics, All 53 Rights ReservedThursday, January 10, 13
  91. 91. Thank You! - Feel free to contact me at ‣ ryan.tabora@thinkbiganalytics.com - Or our solutions consultant ‣ matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! Copyright © 2012-2013, Think Big Analytics, All 54 Rights ReservedThursday, January 10, 13
  92. 92. Bonus Content Copyright © 2012-2013, Think Big Analytics, All 55 Rights ReservedThursday, January 10, 13
  93. 93. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
  94. 94. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 56 Rights ReservedThursday, January 10, 13
  95. 95. Hive: SQL for Hadoop Copyright © 2012-2013, Think Big Analytics, All 57 Rights ReservedThursday, January 10, 13
  96. 96. Hive Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
  97. 97. Hive Let’s look at WordCount written in Hive, the SQL for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 58 Rights ReservedThursday, January 10, 13
  98. 98. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
  99. 99. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 59 Rights ReservedThursday, January 10, 13
  100. 100. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
  101. 101. Create a table to hold CREATE TABLE docs (line STRING); the raw text we’re counting. Each line is a “column”. LOAD DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 60 Rights ReservedThursday, January 10, 13
  102. 102. CREATE TABLE docs (line STRING); LOAD DATA INPATH docs Load the text in the “docs” directory into the OVERWRITE INTO TABLE docs; table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 61 Rights ReservedThursday, January 10, 13
  103. 103. CREATE TABLE docs (line STRING); Create the final table LOAD DATA INPATH docs and fill it with the results OVERWRITE INTO TABLE docs; from a nested query of the docs table that performs WordCount CREATE TABLE word_counts AS on the fly. SELECT word, count(1) AS count FROM (SELECT explode(split(line, s)) AS word FROM docs) w GROUP BY word ORDER BY word; Copyright © 2012-2013, Think Big Analytics, All 62 Rights ReservedThursday, January 10, 13
  104. 104. Hive Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
  105. 105. Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! Copyright © 2012-2013, Think Big Analytics, All 63 Rights ReservedThursday, January 10, 13
  106. 106. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
  107. 107. The Hadoop Ecosystem • HDFS - Hadoop Distributed File System. • Map/Reduce - A distributed framework for executing work in parallel. • Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. • Pig - A top down scripting language to manipulate. • HBase - A NoSQL, non-sequential data store. Copyright © 2012-2013, Think Big Analytics, All 64 Rights ReservedThursday, January 10, 13
  108. 108. Pig: Data Flow for Hadoop Copyright © 2012-2013, Think Big Analytics, All 65 Rights ReservedThursday, January 10, 13
  109. 109. Pig Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
  110. 110. Pig Let’s look at WordCount written in Pig, the Data Flow language for Hadoop. Copyright © 2012-2013, Think Big Analytics, All 66 Rights ReservedThursday, January 10, 13
  111. 111. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
  112. 112. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Let’s drill into this code... Copyright © 2012-2013, Think Big Analytics, All 67 Rights ReservedThursday, January 10, 13
  113. 113. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
  114. 114. inpt = LOAD docs using TextLoader AS (line:chararray); Like the Hive example, load “docs” content, each line is a “field”. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 68 Rights ReservedThursday, January 10, 13
  115. 115. inpt = LOAD docs using TextLoader AS (line:chararray); Tokenize into words (an array) and “flatten” into separate records. words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 69 Rights ReservedThursday, January 10, 13
  116. 116. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; Collect the same words grpd = GROUP words BY word; together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 70 Rights ReservedThursday, January 10, 13
  117. 117. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd Count each word. GENERATE group, COUNT(words); STORE cntd INTO output; Copyright © 2012-2013, Think Big Analytics, All 71 Rights ReservedThursday, January 10, 13
  118. 118. inpt = LOAD docs using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Save the results. STORE cntd INTO output; Profit! Copyright © 2012-2013, Think Big Analytics, All 72 Rights ReservedThursday, January 10, 13
  119. 119. Pig Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
  120. 120. Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. Copyright © 2012-2013, Think Big Analytics, All 73 Rights ReservedThursday, January 10, 13
  121. 121. Questions? Copyright © 2012-2013, Think Big Analytics, All 74 Rights ReservedThursday, January 10, 13
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×