A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
A (very) short intro to Hadoop
1. 1
Scale Unlimited, Inc
A very short
Intro to
Hadoop
photo by: i_pinz, flickr
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
2. 2
Welcome to Intro to Hadoop
Usually a 4 hour online class
Focus on Why, What, How of Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
3. 3
Meet Your Instructor
Developer of Bixo web mining toolkit
Committer on Apache Tika
Hadoop developer and trainer
Solr developer and trainer
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
4. 4
Overview
A very short
Intro to
photo by: exfordy, flickr
Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
5. 5
How to Crunch a Petabyte?
Lots of disks, spinning all the time
Redundancy, since disks die
Lots of CPU cores, working all the time
Retry, since network errors happen
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
6. 6
Hadoop to the Rescue
Scalable - many servers with lots of cores and spindles
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big computer
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
7. 7
Logical Architecture
Cluster
Execution
Storage
Logically, Hadoop is simply a computing cluster that provides:
a Storage layer, and
an Execution layer
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
8. 8
Storage Layer
Hadoop Distributed File System (aka HDFS, or Hadoop DFS)
Runs on top of regular OS file system, typically Linux ext3
Fixed-size blocks (64MB by default) that are replicated
Write once, read many; optimized for streaming in and out
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
9. 9
Execution Layer
Hadoop Map-Reduce
Responsible for running a job in parallel on many servers
Handles re-trying a task that fails, validating complete results
Jobs consist of special “map” and “reduce” operations
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
10. 10
Scalable
Cluster
Rack Rack Rack
Node Node Node Node ...
cpu Execution
disk Storage
Virtual execution and storage layers span many nodes (servers)
Scales linearly (sort of) with cores and disks.
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
11. 11
Reliable
Each block is replicated, typically three times.
Each task must succeed, or the job fails
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
12. 12
Fault-tolerant
Failed tasks are automatically retried.
Failed data transfers are automatically retried.
Servers can join and leave the cluster at any time.
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
13. 13
Simple
Node Node Node Node ...
cpu Execution
disk Storage
Reduces complexity
Conceptual “operating system” that spans many CPUs & disks
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
14. 14
Typical Hadoop Cluster
Has one “master” server - high quality, beefy box.
NameNode process - manages file system
JobTracker process - manages tasks
Has multiple “slave” servers - commodity hardware.
DataNode process - manages file system blocks on local drives
TaskTracker process - runs tasks on server
Uses high speed network between all servers
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
15. 15
Architectural Components
Solid boxes are unique applications NameNode DataNode
Dashed boxes are child JVM instances DataNode
DataNode
DataNode data block
(on same node as parent)
Dotted boxes are blocks of managed files
(on same node as parent)
mapper
mapper
child jvm
mapper
child jvm
child jvm
Client JobTracker
jobs tasks
TaskTracker
reducer
reducer
child jvm
reducer
child jvm
child jvm
Master Slaves
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
16. 16
Distributed File
System
A very short
Intro to
photo by: Graham Racher, flickr
Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
17. 17
Virtual File System
Treats many disks on many servers as one huge, logical volume
Data is stored in 1...n blocks
The “DataNode” process manages blocks of data on a slave.
The “NameNode” process keeps track of file metadata on the master.
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
18. 18
Replication
Each block is stored on several different disks (default is 3)
Hadoop tries to copy blocks to different servers and racks.
Protects data against disk, server, rack failures.
Reduces the need to move data to code.
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
19. 19
Error Recovery
Slaves constantly “check in” with the master.
Data is automatically replicated if a disk or server “goes away”.
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
20. 20
Limitations
Optimized for streaming data in/out
So no random access to data in a file
Data rates ≈ 30% - 50% of max raw disk rate
No append support currently
Write once, read many
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
21. 21
NameNode
Runs on master node
Is a single point of failure
There are no built-in software hot failover mechanisms
Maintains filesystem metadata, the “Namespace”
files and hierarchical directories
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
22. 22
DataNodes
DataNode
DataNode
DataNode
DataNode data block
Files stored on HDFS are chunked and stored as blocks on DataNode
Manages storage attached to the nodes that they run on
Data never flows through NameNode, only DataNodes
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
23. 23
Map-Reduce
Paradigm
A very short
Intro to
photo by: Graham Racher, flickr
Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
24. 24
Definitions
Key Value Pair -> two units of data, exchanged between Map & Reduce
Map -> The ‘map’ function in the MapReduce algorithm
user defined
converts each input Key Value Pair to 0...n output Key Value Pairs
Reduce -> The ‘reduce’ function in the MapReduce algorithm
user defined
converts each input Key + all Values to 0...n output Key Value Pairs
Group -> A built-in operation that happens between Map and Reduce
ensures each Key passed to Reduce includes all Values
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
25. 25
All Together
Mapper
Mapper Shuffle Reducer
Mapper Shuffle Reducer
Mapper Shuffle Reducer
Mapper
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
26. 26
How MapReduce Works
Map translates input to keys and values to new keys and values
[K1,V1] Map [K2,V2]
System Groups each unique key with all its values
[K2,V2] Group [K2,{V2,V2,....}]
Reduce translates the values of each unique key to new keys and values
[K2,{V2,V2,....}] Reduce [K3,V3]
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
27. 27
Canonical Example - Word Count
Word Count
Read a document, parse out the words, count the frequency of each word
Specifically, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1>
Sum the values (ones) for each unique word
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
28. 28
[1, When in the Course of human events, it becomes]
[2, necessary for one people to dissolve the political bands]
[3, which have connected them with another, and to assume]
[K1,V1]
[n, {k,k,k,k,...}]
Map
[K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1]
User
Group
Defined
[K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]
[connected,{1,1,1,1,1}] [k,{v,v,...}]
Reduce
[When,4] [peope,6] [dissolve,3]
[K3,V3]
[connected,5] [k,sum(v,v,v,..)]
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
29. 29
Divide & Conquer (splitting data)
Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
or just some fraction of the input data set
A Reducer can invoke Reduce on an arbitrary number of the unique keys
but all the values for that key
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
30. 30
[K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3]
Mapper
Mapper Shuffle Reducer
Mapper Shuffle Reducer
Mapper Shuffle Reducer
Mapper
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N
file directory
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
31. 31
Divide & Conquer (parallelizable)
Because
Each Mapper is independent and processes part of the whole, and
Each Reducer is independent and processes part of the whole
Then
Any number of Mappers can run on each node, and
Any number of Reducers can run on each node, and
The cluster can contain any number of nodes
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
32. 32
JobTracker
jobs
Client JobTracker
Is a single point of failure
Determines # Mapper Tasks from file splits via InputFormat
Uses predefined value for # Reducer Tasks
Client applications use JobClient to submit jobs and query status
Command line use hadoop job <commands>
Web status console use http://jobtracker-server:50030/
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
33. 33
TaskTracker
mapper
mapper
child jvm
mapper
child jvm
child jvm
TaskTracker
reducer
reducer
child jvm
reducer
child jvm
child jvm
Spawns each Task as a new child JVM
Max # mapper and reducer tasks set independently
Can pass child JVM opts via mapred.child.java.opts
Can re-use JVM to avoid overhead of task initialization
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
34. 34
Hadoop (sort of)
Deep Thoughts
A very short
Intro to
photo by: Graham Racher, flickr
Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
35. 35
Avoiding Hadoop
Hadoop is a big hammer - but not every problem is a nail
Small data
Real-time data
Beware the Hadoopaphile
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
36. 36
Avoiding Map-Reduce
Writing Hadoop code is painful and error prone
Hive & Pig are good solutions for People Who Like SQL
Cascading is a good solution for complex, stable workflows
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
37. 37
Leveraging the Eco-System
Many open source projects built on top of Hadoop
HBase - scalable NoSQL data store
Sqoop - getting SQL data in/out of Hadoop
Other projects work well with Hadoop
Kafka/Scribe - getting log data in/out of Hadoop
Avro - data serialization/storage format
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
38. 38
Get Involved
Join the mailing list - http://hadoop.apache.org/mailing_lists.html
Go to user group meetings - e.g. http://hadoop.meetup.com/
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
39. 39
Learn More
Buy the book - Hadoop: The Definitive Guide, 2nd edition
Try the tutorials
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Get training (danger, personal plug)
http://www.scaleunlimited.com/training
http://www.cloudera.com/hadoop-training
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011
40. 40
Resources
Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com
Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html
Users groups - e.g. http://hadoop.meetup.com/
Hadoop API - http://hadoop.apache.org/common/docs/current/api/
Hadoop: The Definitive Guide, 2nd edition by Tom White
Cascading: http://www.cascading.org
Datameer: http://www.datameer.com
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011