Apache Hadoop
                          an introduction

                             Todd Lipcon
                        ...
Hi there!
                         Software Engineer at
                         Hadoop contributor, HBase committer
     ...
Outline
                         Why should you care? (Intro)
                         What is Hadoop?
                   ...
Data is everywhere.

                         Data is important.


Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
“I keep saying that the sexy job
                 in the next 10 years will be
              statisticians, and I’m not ki...
Are you throwing
                            away data?
                         Data comes in many shapes and sizes:
    ...
So, what’s Hadoop?


                                The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry




T...
Apache Hadoop is an
                             open-source system
                         to reliably store and process...
Two Core
                           Components
                         HDFS           Map/Reduce

                    Sel...
What makes
                         Hadoop special?


Thursday, May 27, 2010
Assumption 1: Machines can be reliable...




Image: MadMan the Mighty CC BY-NC-SA
Thursday, May 27, 2010
Hadoop separates distributed
                     system fault-tolerance code
                        from application log...
Assumption 2: Machines have identities...




                                     Image:Laughing Squid CC BY-
           ...
Hadoop lets you interact
                   with a cluster, not a bunch
                          of machines.




  Image...
Assumption 3: Your analysis fits on one machine




                                  Image: Matthew J. Stinson CC-BY-NC
Th...
Hadoop scales linearly
                               with data size
                           or analysis complexity.
  ...
A Typical Look...
                         5-4000 commodity servers
                         (8-core, 8-24GB RAM, 4-12 TB,...
Image: Josh Hough CC BY-NC-SA




                               STOP!
                           REAL METAL?
            ...
Hadoop sounds like
                               magic.




                         How is it possible?
Thursday, May 27...
dramatis personae
       Starring...

                         NameNode (metadata server and database)

                  ...
Namenode         HDFS
                                                   3x64MB file, 3 rep
             (fs metadata)
    ...
HDFS Write Path




Thursday, May 27, 2010
HDFS Failures?
                Datanode crash?
                         Clients read another copy
                        ...
The M/R
                   Programming Model




Thursday, May 27, 2010
You specify map()
                           and reduce()
                            functions.

                The fram...
fault-tolerance
                         (that’s what’s important)
                         (and that’s why Hadoop)




Th...
map()
                         map: K₁,V₁→list K₂,V₂

    public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
      /*...
(the shuffle)

              map output is assigned to a “reducer”

              map output is sorted by key




Thursday,...
reduce()
                         K₂, iter(V₂)→list(K₃,V₃)

      public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  ...
Putting it together...
  Logical


                                 Physical Flow



 Physical




Thursday, May 27, 2010
Some samples...
                         Build an inverted index.
                         Summarize data grouped by a key...
M/R
                                                  Job on stars
     Tasktrackers on the same                    Differ...
M/R




Thursday, May 27, 2010
M/R Failures
                         Task fails
                           Try again?
                           Try agai...
There’s more than the
                            Java API
            Streaming              Pig                   Hive
 ...
The Hadoop Ecosystem
                                           ETL Tools       BI Reporting       RDBMS

                ...
Hadoop in the Wild
                                 (yes, it’s used in production)

                         Yahoo! Hadoop...
Ok, fine, what next?
                         Get Hadoop!
                          Cloudera’s Distribution for Hadoop
    ...
Questions?

                          todd@cloudera.com

                          (feedback? yes!)

                     ...
Backup Slides




Thursday, May 27, 2010
Important APIs
                                    → is 1:many
                 Input Format data→K₁,V₁
                  ...
public int run(String[] args)
    throws Exception {                    grepJob.setReducerClass(LongSumRedu   FileOutputFo...
$ cat input.txt
    adams dunster kirkland dunster
    kirland dudley dunster
    adams dunster winthrop

    $ bin/hadoop...
JobConf grepJob = new JobConf(getConf(), Grep.class);
         try {
           grepJob.setJobName("grep-search");

      ...
JobConf sortJob = new JobConf(Grep.class);
             sortJob.setJobName("grep-sort");

             FileInputFormat.set...
The types there...
                           ?, Text

                     Text, Long

              Text, list(Long)

  ...
Facebook Data Infrastructure
                           Facebook’s DWH
                                 2008
             ...
Upcoming SlideShare
Loading in...5
×

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

5,383

Published on

Todd gives his perspective on Apache Hadoop

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,383
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
427
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

  1. 1. Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera May 27, 2010 Thursday, May 27, 2010
  2. 2. Hi there! Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems Thursday, May 27, 2010
  3. 3. Outline Why should you care? (Intro) What is Hadoop? The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Thursday, May 27, 2010
  4. 4. Data is everywhere. Data is important. Thursday, May 27, 2010
  5. 5. Thursday, May 27, 2010
  6. 6. Thursday, May 27, 2010
  7. 7. Thursday, May 27, 2010
  8. 8. Thursday, May 27, 2010
  9. 9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Thursday, May 27, 2010
  10. 10. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … . Are you throwing it away because it doesn’t ‘fit’? Thursday, May 27, 2010
  11. 11. So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  12. 12. Apache Hadoop is an open-source system to reliably store and process gobs of information across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  13. 13. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth Fault-tolerant clustered storage. distributed computing. Thursday, May 27, 2010
  14. 14. What makes Hadoop special? Thursday, May 27, 2010
  15. 15. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Thursday, May 27, 2010
  16. 16. Hadoop separates distributed system fault-tolerance code from application logic. Unicorns Systems Statisticians Programmers Thursday, May 27, 2010
  17. 17. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Thursday, May 27, 2010
  18. 18. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Thursday, May 27, 2010
  19. 19. Assumption 3: Your analysis fits on one machine Image: Matthew J. Stinson CC-BY-NC Thursday, May 27, 2010
  20. 20. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data One Hadoop works for both applications! Thursday, May 27, 2010
  21. 21. A Typical Look... 5-4000 commodity servers (8-core, 8-24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack Thursday, May 27, 2010
  22. 22. Image: Josh Hough CC BY-NC-SA STOP! REAL METAL? Isn’t this some kind of “Cloud Computing” conference? Hadoop runs as a cloud (a cluster) and maybe in a cloud (eg EC2). Thursday, May 27, 2010
  23. 23. Hadoop sounds like magic. How is it possible? Thursday, May 27, 2010
  24. 24. dramatis personae Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! Thursday, May 27, 2010
  25. 25. Namenode HDFS 3x64MB file, 3 rep (fs metadata) 4x64MB file, 3 rep Small file, 7 rep Datanodes Thursday, May 27, 2010 One Rack A Different Rack
  26. 26. HDFS Write Path Thursday, May 27, 2010
  27. 27. HDFS Failures? Datanode crash? Clients read another copy Background rebalance/rereplicate Namenode crash? uh-oh not responsible for majority of downtime! Thursday, May 27, 2010
  28. 28. The M/R Programming Model Thursday, May 27, 2010
  29. 29. You specify map() and reduce() functions. The framework does the rest. Thursday, May 27, 2010
  30. 30. fault-tolerance (that’s what’s important) (and that’s why Hadoop) Thursday, May 27, 2010
  31. 31. map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Thursday, May 27, 2010
  32. 32. (the shuffle) map output is assigned to a “reducer” map output is sorted by key Thursday, May 27, 2010
  33. 33. reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Thursday, May 27, 2010
  34. 34. Putting it together... Logical Physical Flow Physical Thursday, May 27, 2010
  35. 35. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DWH technologies (by archiving raw data). Thursday, May 27, 2010
  36. 36. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle One Rack A Different Rack Thursday, May 27, 2010
  37. 37. M/R Thursday, May 27, 2010
  38. 38. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Thursday, May 27, 2010
  39. 39. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Many tasks actually require a series of M/R jobs; that’s ok! Thursday, May 27, 2010
  40. 40. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Column DB) (Key-Value store) HDFS (Hadoop Distributed File System) Thursday, May 27, 2010
  41. 41. Hadoop in the Wild (yes, it’s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Facebook: 15TB new data per day; 10000+ cores, 12+ PB Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research) Thursday, May 27, 2010
  42. 42. Ok, fine, what next? Get Hadoop! Cloudera’s Distribution for Hadoop http://hadoop.apache.org/ Try it out! (Locally, or on EC2) Door Prize Watch free training videos on http://cloudera.com/ Thursday, May 27, 2010
  43. 43. Questions? todd@cloudera.com (feedback? yes!) (hiring? yes!) Thursday, May 27, 2010
  44. 44. Backup Slides Thursday, May 27, 2010
  45. 45. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data Thursday, May 27, 2010
  46. 46. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); Thursday, May 27, 2010
  47. 47. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams Thursday, May 27, 2010
  48. 48. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Thursday, May 27, 2010
  49. 49. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Thursday, May 27, 2010
  50. 50. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text Thursday, May 27, 2010
  51. 51. Facebook Data Infrastructure Facebook’s DWH 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers il 1, 2009 Thursday, May 27, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×