Apache Hadoop,
                  Big Data, and You
                                   Philip Zeyliger
                    ...
Hi there!

                          Software Engineer
                          Worked at




Wednesday, November 18, 2009
I work on stuff...




Wednesday, November 18, 2009
Outline
                          Why should you care? (Intro)
                          Challenging yesteryear’s assumpti...
Data is everywhere.

                        Data is important.

Wednesday, November 18, 2009
Wednesday, November 18, 2009
Wednesday, November 18, 2009
Wednesday, November 18, 2009
“I keep saying that the sexy job
                in the next 10 years will be
             statisticians, and I’m not kidd...
So, what’s Hadoop?


                               The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry




We...
Apache Hadoop is an open-source
    system (written in Java!) to store and
                  process
                     ...
Two Big
                                 Components
                               HDFS      Map/Reduce

                S...
Challenging some of
                     yesteryear’s
                    assumptions...

Wednesday, November 18, 2009
Assumption 1: Machines can be reliable...




 Image: MadMan the Mighty CC BY-NC-SA
Wednesday, November 18, 2009
Hadoop Goal:

                  Separate distributed
                 system fault-tolerance
               code from appl...
Assumption 2: Machines have identities...
                                             Image:Laughing Squid CC BY-
       ...
Hadoop Goal:

                 Users should interact with
                  clusters, not machines.




Wednesday, Novembe...
Assumption 3: A data set fits on one machine...
                                                  Image: Matthew J. Stinson...
Hadoop Goal:

                           System should scale
                        linearly (or better) with
           ...
The M/R
                  Programming Model




Wednesday, November 18, 2009
You specify map()
                            and reduce()
                             functions.

               The fra...
map()
                          map: K₁,V₁→list K₂,V₂

   public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
     /**...
(the shuffle)

             map output is assigned to a “reducer”

             map output is sorted by key




Wednesday, ...
reduce()
                      K₂, iter(V₂)→list(K₃,V₃)

      public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
     ...
Putting it together...
  Logical


                                  Physical Flow



 Physical




Wednesday, November 18...
Some samples...
                          Build an inverted index.
                          Summarize data grouped by a k...
There’s more than the
                          Java API
           Streaming               Pig                Hive
      ...
A typical look...
                          Commodity servers (8-core, 8-16GB
                          RAM, 4-12 TB, 2x1 ...
The cast...
       Starring...

                           NameNode (metadata server and database)

                      ...
HDFS
                                                 3x64MB file, 3 rep
                                                 4...
HDFS Write Path




Wednesday, November 18, 2009
HDFS Failures?
               Datanode crash?
                     Clients read another copy
                     Backgrou...
M/R
                                                        Job on stars
     Tasktrackers on the same                    ...
M/R




Wednesday, November 18, 2009
M/R Failures
                          Task fails
                               Try again?
                              ...
Hadoop in the Wild

                         Yahoo! Hadoop Clusters: > 82PB, >25k machines
                         (Eric1...
The Hadoop Ecosystem
                                          ETL Tools       BI Reporting       RDBMS

                 ...
Ok, fine, what next?
                          Get Hadoop!
                               http://hadoop.apache.org/
       ...
Just one slide...

                          Software: Cloudera Distribution for
                          Hadoop, Clouder...
Questions?

                               philip@cloudera.com




Wednesday, November 18, 2009
Backup Slides




Wednesday, November 18, 2009
Important APIs
                                        → is 1:many
               Input Format       data→K₁,V₁
          ...
$ cat input.txt
    adams dunster kirkland dunster
    kirland dudley dunster
    adams dunster winthrop

    $ bin/hadoop...
public int run(String[] args)
    throws Exception {                    grepJob.setReducerClass(LongSumRedu   FileOutputFo...
JobConf grepJob = new JobConf(getConf(), Grep.class);
        try {
          grepJob.setJobName("grep-search");

        ...
JobConf sortJob = new JobConf(Grep.class);
            sortJob.setJobName("grep-sort");

            FileInputFormat.setIn...
The types there...
                           ?, Text

                    Text, Long

             Text, list(Long)

    ...
A Simple Join
                                          Id         Last       First
                       People

       ...
Upcoming SlideShare
Loading in …5
×

Apache Hadoop Talk at QCon

3,903 views

Published on

Published in: Business
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,903
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
192
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Apache Hadoop Talk at QCon

  1. 1. Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009
  2. 2. Hi there! Software Engineer Worked at Wednesday, November 18, 2009
  3. 3. I work on stuff... Wednesday, November 18, 2009
  4. 4. Outline Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Wednesday, November 18, 2009
  5. 5. Data is everywhere. Data is important. Wednesday, November 18, 2009
  6. 6. Wednesday, November 18, 2009
  7. 7. Wednesday, November 18, 2009
  8. 8. Wednesday, November 18, 2009
  9. 9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Wednesday, November 18, 2009
  10. 10. So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009
  11. 11. Apache Hadoop is an open-source system (written in Java!) to store and process gobs of data across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009
  12. 12. Two Big Components HDFS Map/Reduce Self-healing high- bandwidth Fault-tolerant clustered storage. distributed computing. Wednesday, November 18, 2009
  13. 13. Challenging some of yesteryear’s assumptions... Wednesday, November 18, 2009
  14. 14. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Wednesday, November 18, 2009
  15. 15. Hadoop Goal: Separate distributed system fault-tolerance code from application logic. Systems Programmers Statisticians Wednesday, November 18, 2009
  16. 16. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Wednesday, November 18, 2009
  17. 17. Hadoop Goal: Users should interact with clusters, not machines. Wednesday, November 18, 2009
  18. 18. Assumption 3: A data set fits on one machine... Image: Matthew J. Stinson CC- BY-NC Wednesday, November 18, 2009
  19. 19. Hadoop Goal: System should scale linearly (or better) with data size. Wednesday, November 18, 2009
  20. 20. The M/R Programming Model Wednesday, November 18, 2009
  21. 21. You specify map() and reduce() functions. The framework does the rest. Wednesday, November 18, 2009
  22. 22. map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Wednesday, November 18, 2009
  23. 23. (the shuffle) map output is assigned to a “reducer” map output is sorted by key Wednesday, November 18, 2009
  24. 24. reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Wednesday, November 18, 2009
  25. 25. Putting it together... Logical Physical Flow Physical Wednesday, November 18, 2009
  26. 26. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data). Wednesday, November 18, 2009
  27. 27. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Friday, @10:10 Wednesday, November 18, 2009
  28. 28. A typical look... Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack Wednesday, November 18, 2009
  29. 29. The cast... Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! Wednesday, November 18, 2009
  30. 30. HDFS 3x64MB file, 3 rep 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes Wednesday, November 18, 2009 One Rack A Different Rack
  31. 31. HDFS Write Path Wednesday, November 18, 2009
  32. 32. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh Wednesday, November 18, 2009
  33. 33. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle Wednesday, November 18, 2009 One Rack A Different Rack
  34. 34. M/R Wednesday, November 18, 2009
  35. 35. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Wednesday, November 18, 2009
  36. 36. Hadoop in the Wild Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld) Wednesday, November 18, 2009
  37. 37. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Wednesday, November 18, 2009
  38. 38. Ok, fine, what next? Get Hadoop! http://hadoop.apache.org/ Cloudera Distribution for Hadoop Try it out! (Locally, or on EC2) Wednesday, November 18, 2009
  39. 39. Just one slide... Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc. Wednesday, November 18, 2009
  40. 40. Questions? philip@cloudera.com Wednesday, November 18, 2009
  41. 41. Backup Slides Wednesday, November 18, 2009
  42. 42. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data Wednesday, November 18, 2009
  43. 43. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams Wednesday, November 18, 2009
  44. 44. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); Wednesday, November 18, 2009
  45. 45. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Wednesday, November 18, 2009
  46. 46. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Wednesday, November 18, 2009
  47. 47. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text Wednesday, November 18, 2009
  48. 48. A Simple Join Id Last First People 1 Washington George 2 Lincoln Abraham Key Entry Log Location Id Time Dunster 1 11:00am Dunster 2 11:02am Kirkland 2 11:08am You want to track individuals throughout the day. How would you do this in M/R, if you had to? Wednesday, November 18, 2009

×