1                                                                                                                   Scale ...
2      Welcome to Intro to Hadoop             Usually a 4 hour online class             Focus on Why, What, How of Hadoop ...
3      Meet Your Instructor             Developer of Bixo web mining toolkit             Committer on Apache Tika         ...
4                                                                                                                       Ov...
5      How to Crunch a Petabyte?           Lots of disks, spinning all the time           Redundancy, since disks die     ...
6      Hadoop to the Rescue           Scalable - many servers with lots of cores and spindles           Reliable - detect ...
7      Logical Architecture                                                           Cluster                             ...
8      Storage Layer           Hadoop Distributed File System (aka HDFS, or Hadoop DFS)           Runs on top of regular O...
9      Execution Layer           Hadoop Map-Reduce           Responsible for running a job in parallel on many servers    ...
10      Scalable                    Cluster                            Rack                                           Rack...
11      Reliable             Each block is replicated, typically three times.             Each task must succeed, or the j...
12      Fault-tolerant             Failed tasks are automatically retried.             Failed data transfers are automatic...
13      Simple                                              Node                    Node                     Node         ...
14      Typical Hadoop Cluster           Has one “master” server - high quality, beefy box.                   NameNode pro...
15      Architectural Components       Solid boxes are unique applications                                                ...
16                                                                                                                       D...
17      Virtual File System             Treats many disks on many servers as one huge, logical volume             Data is ...
18      Replication             Each block is stored on several different disks (default is 3)             Hadoop tries to...
19      Error Recovery             Slaves constantly “check in” with the master.             Data is automatically replica...
20      Limitations             Optimized for streaming data in/out                     So no random access to data in a fi...
21      NameNode             Runs on master node                     Is a single point of failure                     Ther...
22      DataNodes                                                                              DataNode                   ...
23                                                                                                                       M...
24      Definitions          Key Value Pair -> two units of data, exchanged between Map & Reduce          Map -> The ‘map’ ...
25      All Together                                                        Mapper                                        ...
26      How MapReduce Works             Map translates input to keys and values to new keys and values                    ...
27      Canonical Example - Word Count             Word Count                     Read a document, parse out the words, co...
28                                                                                           [1, When in the Course of hum...
29      Divide & Conquer (splitting data)             Because                     The Map function only cares about the cu...
30                                     [K1,V1]                                          [K2,V2]                           ...
31      Divide & Conquer (parallelizable)             Because                     Each Mapper is independent and processes...
32      JobTracker                                                                                                        ...
33      TaskTracker                                                                                                       ...
34                                                                                                                       H...
35      Avoiding Hadoop             Hadoop is a big hammer - but not every problem is a nail                     Small dat...
36      Avoiding Map-Reduce                     Writing Hadoop code is painful and error prone                     Hive & ...
37      Leveraging the Eco-System                     Many open source projects built on top of Hadoop                    ...
38      Get Involved             Join the mailing list - http://hadoop.apache.org/mailing_lists.html             Go to use...
39      Learn More             Buy the book - Hadoop: The Definitive Guide, 2nd edition             Try the tutorials      ...
40      Resources             Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com             Hadoop mai...
Upcoming SlideShare
Loading in...5
×

A (very) short intro to Hadoop

18,126

Published on

A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)

Published in: Technology, Business
2 Comments
37 Likes
Statistics
Notes
No Downloads
Views
Total Views
18,126
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
467
Comments
2
Likes
37
Embeds 0
No embeds

No notes for slide

A (very) short intro to Hadoop

  1. 1. 1 Scale Unlimited, Inc A very short Intro to Hadoop photo by: i_pinz, flickr Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  2. 2. 2 Welcome to Intro to Hadoop Usually a 4 hour online class Focus on Why, What, How of Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  3. 3. 3 Meet Your Instructor Developer of Bixo web mining toolkit Committer on Apache Tika Hadoop developer and trainer Solr developer and trainer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  4. 4. 4 Overview A very short Intro to photo by: exfordy, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  5. 5. 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  6. 6. 6 Hadoop to the Rescue Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  7. 7. 7 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  8. 8. 8 Storage Layer Hadoop Distributed File System (aka HDFS, or Hadoop DFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  9. 9. 9 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special “map” and “reduce” operations Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  10. 10. 10 Scalable Cluster Rack Rack Rack Node Node Node Node ... cpu Execution disk Storage Virtual execution and storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  11. 11. 11 Reliable Each block is replicated, typically three times. Each task must succeed, or the job fails Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  12. 12. 12 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  13. 13. 13 Simple Node Node Node Node ... cpu Execution disk Storage Reduces complexity Conceptual “operating system” that spans many CPUs & disks Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  14. 14. 14 Typical Hadoop Cluster Has one “master” server - high quality, beefy box. NameNode process - manages file system JobTracker process - manages tasks Has multiple “slave” servers - commodity hardware. DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  15. 15. 15 Architectural Components Solid boxes are unique applications NameNode DataNode Dashed boxes are child JVM instances DataNode DataNode DataNode data block (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) mapper mapper child jvm mapper child jvm child jvm Client JobTracker jobs tasks TaskTracker reducer reducer child jvm reducer child jvm child jvm Master Slaves Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  16. 16. 16 Distributed File System A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  17. 17. 17 Virtual File System Treats many disks on many servers as one huge, logical volume Data is stored in 1...n blocks The “DataNode” process manages blocks of data on a slave. The “NameNode” process keeps track of file metadata on the master. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  18. 18. 18 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks. Protects data against disk, server, rack failures. Reduces the need to move data to code. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  19. 19. 19 Error Recovery Slaves constantly “check in” with the master. Data is automatically replicated if a disk or server “goes away”. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  20. 20. 20 Limitations Optimized for streaming data in/out So no random access to data in a file Data rates ≈ 30% - 50% of max raw disk rate No append support currently Write once, read many Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  21. 21. 21 NameNode Runs on master node Is a single point of failure There are no built-in software hot failover mechanisms Maintains filesystem metadata, the “Namespace” files and hierarchical directories Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  22. 22. 22 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are chunked and stored as blocks on DataNode Manages storage attached to the nodes that they run on Data never flows through NameNode, only DataNodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  23. 23. 23 Map-Reduce Paradigm A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  24. 24. 24 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The ‘map’ function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The ‘reduce’ function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  25. 25. 25 All Together Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  26. 26. 26 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,....}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  27. 27. 27 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1> Sum the values (ones) for each unique word Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  28. 28. 28 [1, When in the Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [3, which have connected them with another, and to assume] [K1,V1] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Group Defined [K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [When,4] [peope,6] [dissolve,3] [K3,V3] [connected,5] [k,sum(v,v,v,..)] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  29. 29. 29 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  30. 30. 30 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  31. 31. 31 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  32. 32. 32 JobTracker jobs Client JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use http://jobtracker-server:50030/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  33. 33. 33 TaskTracker mapper mapper child jvm mapper child jvm child jvm TaskTracker reducer reducer child jvm reducer child jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  34. 34. 34 Hadoop (sort of) Deep Thoughts A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  35. 35. 35 Avoiding Hadoop Hadoop is a big hammer - but not every problem is a nail Small data Real-time data Beware the Hadoopaphile Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  36. 36. 36 Avoiding Map-Reduce Writing Hadoop code is painful and error prone Hive & Pig are good solutions for People Who Like SQL Cascading is a good solution for complex, stable workflows Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  37. 37. 37 Leveraging the Eco-System Many open source projects built on top of Hadoop HBase - scalable NoSQL data store Sqoop - getting SQL data in/out of Hadoop Other projects work well with Hadoop Kafka/Scribe - getting log data in/out of Hadoop Avro - data serialization/storage format Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  38. 38. 38 Get Involved Join the mailing list - http://hadoop.apache.org/mailing_lists.html Go to user group meetings - e.g. http://hadoop.meetup.com/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  39. 39. 39 Learn More Buy the book - Hadoop: The Definitive Guide, 2nd edition Try the tutorials http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Get training (danger, personal plug) http://www.scaleunlimited.com/training http://www.cloudera.com/hadoop-training Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  40. 40. 40 Resources Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html Users groups - e.g. http://hadoop.meetup.com/ Hadoop API - http://hadoop.apache.org/common/docs/current/api/ Hadoop: The Definitive Guide, 2nd edition by Tom White Cascading: http://www.cascading.org Datameer: http://www.datameer.com Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×