Apache Hadoop &
    Friends
         Philip Zeyliger
     philip@cloudera.com
     @philz42 @cloudera
      February 18, 2...
Hi there!

     Software Engineer
     Worked at




And, oh, I ski (though mostly in Tahoe)


                           ...
@cloudera, I work on...




                    Apache Avro
 Cloudera Desktop


                                  3
Outline
Why should you care? (Intro)
Challenging yesteryear’s assumptions
The MapReduce Model
HDFS, Hadoop Map/Reduce
The ...
Data is everywhere.

Data is important.



                      5
6
7
8
“I keep saying that the sexy job
   in the next 10 years will be
statisticians, and I’m not kidding.”

                   ...
Are you throwing
   away data?
Data comes in many shapes and sizes:
relational tuples, log files, semistructured
textual da...
So, what’s Hadoop?


       The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry




                          ...
Apache Hadoop is an open-source
system (written in Java!) to reliably
        store and process
             gobs of data
...
Two Big
          Components
      HDFS               Map/Reduce

 Self-healing high-
    bandwidth              Fault-tol...
Challenging some of
   yesteryear’s
  assumptions...


                      14
Assumption 1: Machines can be reliable...




Image: MadMan the Mighty CC BY-NC-SA
                                       ...
Hadoop Goal:

   Separate distributed
  system fault-tolerance
code from application logic.
           Systems
         Pr...
Assumption 2: Machines have identities...
                                            Image:Laughing Squid CC BY-
        ...
Hadoop Goal:

Users should interact with
 clusters, not machines.




                             18
Assumption 3: A data set fits on one machine...
                                                 Image: Matthew J. Stinson ...
Hadoop Goal:

   System should scale
linearly (or better) with
        data size.


                            20
The M/R
Programming Model




                    21
You specify map()
   and reduce()
    functions.

The framework does
      the rest.
                     22
fault-tolerance
  (that’s what’s important)
  (and that’s why Hadoop)




                              23
map()
            map: K₁,V₁→list K₂,V₂

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  /**
    * Called once f...
(the shuffle)

map output is assigned to a “reducer”

map output is sorted by key




                                     ...
reduce()
         K₂, iter(V₂)→list(K₃,V₃)

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  /**
    * This method ...
Putting it together...
Logical


                   Physical Flow



Physical




                                    27
Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing man...
There’s more than the
          Java API
Streaming             Pig                   Hive
perl, python,     Higher-level  ...
A Typical Look...
Commodity servers (8-core, 8-16GB
RAM, 4-12 TB, 2x1 gE NIC)
2-level network architecture
 20-40 nodes pe...
So, how does it all
      work?




                      31
dramatis personae
Starring...

         NameNode (metadata server and database)

         SecondaryNameNode (assistant to ...
HDFS
                          3x64MB file, 3 rep
   (fs metadata)
                          4x64MB file, 3 rep
    Namenode...
HDFS Write Path




                  34
HDFS Failures?
Datanode crash?
 Clients read another copy
 Background rebalance
Namenode crash?
 uh-oh



                ...
M/R
                                     Job on stars
Tasktrackers on the same            Different job
 machines as datan...
M/R




      37
M/R Failures
Task fails
  Try again?
  Try again somewhere else?
  Report failure
Retries possible because of idempotence
...
Hadoop in the Wild
     (yes, it’s used in production)

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld...
The Hadoop Ecosystem
                            ETL Tools       BI Reporting       RDBMS

                          Pig (...
Ok, fine, what next?
Get Hadoop!
 Cloudera’s Distribution for Hadoop
 http://hadoop.apache.org/
Try it out! (Locally, or on...
Just one slide...

Software: Cloudera Distribution for
Hadoop, Cloudera Desktop, more…
Training and certification…
Free on-...
Okay, two slides


Talk to us if you’re going to do Hadoop
in your organization.




                                     ...
Questions?

philip@cloudera.com

(feedback? yes!)

(hiring? yes!)




                      44
Backup Slides


  (i’ve got your back)




                         45
Important APIs
                                → is 1:many
           Input Format   data→K₁,V₁
                          ...
public int run(String[] args)
throws Exception {                    grepJob.setReducerClass(LongSumRedu   FileOutputFormat...
$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop

$ bin/hadoop jar hadoop-0.18...
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
  grepJob.setJobName("grep-search");

  FileInputFormat.setInp...
JobConf sortJob = new JobConf(Grep.class);
     sortJob.setJobName("grep-sort");

     FileInputFormat.setInputPaths(sortJ...
The types there...
    ?, Text

  Text, Long

Text, list(Long)

  Text, Long

  Long, Text

                           51
BI / Reporting                        Ad-hoc
                                        Queries
RDBMS (Aggregates)
          ...
Facebook’s DW (phase M)
                  M 2008N
                      >
           Facebook Data Infrastructure
        ...
Upcoming SlideShare
Loading in …5
×

Apache Hadoop & Friends at Utah Java User's Group

4,792 views

Published on

Talk about Apache Hadoop at the February 18, 2010 meeting of the Utah Java User's Group.

http://www.ujug.org/web/

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,792
On SlideShare
0
From Embeds
0
Number of Embeds
210
Actions
Shares
0
Downloads
175
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Apache Hadoop & Friends at Utah Java User's Group

  1. 1. Apache Hadoop & Friends Philip Zeyliger philip@cloudera.com @philz42 @cloudera February 18, 2010 1
  2. 2. Hi there! Software Engineer Worked at And, oh, I ski (though mostly in Tahoe) 2
  3. 3. @cloudera, I work on... Apache Avro Cloudera Desktop 3
  4. 4. Outline Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions 4
  5. 5. Data is everywhere. Data is important. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8
  9. 9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) 9
  10. 10. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … . Are you throwing it away because it doesn’t ‘fit’? 10
  11. 11. So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry 11
  12. 12. Apache Hadoop is an open-source system (written in Java!) to reliably store and process gobs of data across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry 12
  13. 13. Two Big Components HDFS Map/Reduce Self-healing high- bandwidth Fault-tolerant clustered storage. distributed computing. 13
  14. 14. Challenging some of yesteryear’s assumptions... 14
  15. 15. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA 15
  16. 16. Hadoop Goal: Separate distributed system fault-tolerance code from application logic. Systems Programmers Statisticians 16
  17. 17. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA 17
  18. 18. Hadoop Goal: Users should interact with clusters, not machines. 18
  19. 19. Assumption 3: A data set fits on one machine... Image: Matthew J. Stinson CC- BY-NC 19
  20. 20. Hadoop Goal: System should scale linearly (or better) with data size. 20
  21. 21. The M/R Programming Model 21
  22. 22. You specify map() and reduce() functions. The framework does the rest. 22
  23. 23. fault-tolerance (that’s what’s important) (and that’s why Hadoop) 23
  24. 24. map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } 24
  25. 25. (the shuffle) map output is assigned to a “reducer” map output is sorted by key 25
  26. 26. reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } 26
  27. 27. Putting it together... Logical Physical Flow Physical 27
  28. 28. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data). 28
  29. 29. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Many tasks actually require a series of M/R jobs; that’s ok! 29
  30. 30. A Typical Look... Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack 30
  31. 31. So, how does it all work? 31
  32. 32. dramatis personae Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! 32
  33. 33. HDFS 3x64MB file, 3 rep (fs metadata) 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes One Rack A Different Rack 33
  34. 34. HDFS Write Path 34
  35. 35. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh 35
  36. 36. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle One Rack A Different Rack 36
  37. 37. M/R 37
  38. 38. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence 38
  39. 39. Hadoop in the Wild (yes, it’s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) (M-R, but not Hadoop) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld) 39
  40. 40. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) 40
  41. 41. Ok, fine, what next? Get Hadoop! Cloudera’s Distribution for Hadoop http://hadoop.apache.org/ Try it out! (Locally, or on EC2) Door Prize 41
  42. 42. Just one slide... Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc. 42
  43. 43. Okay, two slides Talk to us if you’re going to do Hadoop in your organization. 43
  44. 44. Questions? philip@cloudera.com (feedback? yes!) (hiring? yes!) 44
  45. 45. Backup Slides (i’ve got your back) 45
  46. 46. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data 46
  47. 47. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); 47
  48. 48. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams 48
  49. 49. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... 49
  50. 50. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } 50
  51. 51. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text 51
  52. 52. BI / Reporting Ad-hoc Queries RDBMS (Aggregates) Data Mining ETL (Extraction, Transform, Load) Storage (Raw Data) Collection Instrumentation } Traditional DW 52
  53. 53. Facebook’s DW (phase M) M 2008N > Facebook Data Infrastructure Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Wednesday, April 1, 2009 53

×