Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage and MapReduce as a programming model for distributed computing. HDFS stores data reliably across machines in a Hadoop cluster as blocks and achieves high fault tolerance through replication. MapReduce allows processing of large datasets in parallel by dividing the work into independent tasks called Maps and Reduces. Hadoop has seen widespread adoption for applications involving massive datasets and is used by companies like Yahoo!, Facebook and Amazon.
4. Hadoop
• It is open source software framework
• Licensed under Apache V2 License
• Created by Doug Cutting and Mike Cafarella in 2005
• Doug, who was working at Yahoo at the time, named it
after his son's toy elephant
• Derived from Google Map Reduce and Google File System
• Written in Java Programming Language
5. 0
100
200
300
400
500
600
700
800
900
1000
966
848
715
619
434
364
269
227
Amount of Stored Data By Sector
(in Petabytes, 2009)
1 zettabyte?
= 1 million petabytes
= 1 billion terabytes
= 1 trillion gigabytes
Why Hadoop?
5
Sources:
"Big Data: The Next Frontier for Innovation, Competition and Productivity."
US Bureau of Labor Statistics | McKinsley Global Institute Analysis
Petabytes
Mars
Earth
35ZB = enough data
to fill a stack of DVDs
reaching halfway to Mars
If you like analogies…
6. Why Hadoop?
• Need to process 100TB datasets On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
7. Distributed File System (DFS)
• classical model of a file system distributed across multiple
machines
• allows access to files located on another remote host as though
working on the actual host computer.
• multiple users on multiple machines to share files and storage
resources.
8. Distributed File System (DFS)
• one or more central servers store files that can be accessed,
– with proper authorization rights, by any number of remote
clients in the network.
• facilities for transparent replication and fault tolerance.
• DFS Operation should be fast to increase the performance of
System
– Operation: open, close, read, write file, send and receive file/object
9. Hadoop Distributed File System
(HDFS)
• Type of distributed file system
• Originally built as infrastructure for the Apache Nutch Web
Search Engine Project
• Significant Difference over other Distributed File System
– High Fault Tolerance
– High Throughput
– Easy Deployment on Low Cost Hardware
• Suitable for application that process massive data
10. HDFS Architecture
• Master-Slave Architecture
• HDFS Master “NameNode”
– Manages all file system metadata (hostname of datanode, block node)
• Transactions are logged, merged at startup
– Controls read/write access to files
– Mapping of block to DataNode
– Manages block replication
• HDFS Slaves “DataNodes”
– Notifies NameNode about block-IDs it has
– Serve read/write requests from clients
– Perform block create and replication tasks upon instruction by NameNode
13. Getting Files From HDFS
13
NameNode BackupNode
Giant File
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
HDFS
Client return locations
of blocks of file
DataNode DataNode DataNode DataNode DataNode
stream blocks from data nodes
14. Failure types:
Disk errors and failures
DataNode failures
Switch/Rack failures
NameNode failures
Datacenter failures
Failures, Failures, Failures
• HDFS was designed with the expectation that failures
(both hardware and software) would occur frequently
14
NameNode
DataNode
15. Fault Tolerance (DataNode
Failure)
15
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
NameNode detects DataNode lossBlocks are auto-replicated on remaining
nodes to satisfy replication factor
DataNodeDataNode DataNode
16. Fault Tolerance (NameNode
Failure)
16
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
Not an epic failure, because you
have the BackupNode
NameNode loss requires
manual intervention
Automatic failover is
in the works
17. Live Horizontal Scaling and
Rebalancing
17
NameNode BackupNode
DataNode DataNode
NameNode detects new DataNode
is added to cluster
DataNodeDataNode DataNode
Blocks are re-balanced
and re-distributed
DataNode DataNodeDataNode
18. • Highly scalable
– 1000s of nodes and massive (100s of TB) files
– Large block sizes to maximize sequential I/O
performance
• No use of mirroring or RAID.
– Reduce cost
– Use one mechanism (triply replicated blocks)
to deal with a wide variety of failure types
rather than multiple different mechanisms
HDFS Summary
18
Why?
19. Hadoop MapReduce (MR)
• Programming framework (library and runtime) for
analyzing data sets stored in HDFS
• MapReduce jobs are composed of two functions:
• User only writes the Map and Reduce functions
19
map() reduce()
sub-divide &
conquer
combine & reduce
cardinality
20. Essentially, it’s…
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
20
DoWork() DoWork() DoWork()
…
…
…
Output
MAPREDUCE
Hadoop MapReduce (MR)
22. Job Submission
22
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored to local file system
map()’s are assigned to TaskTrackers
(HDFS DataNode locality aware)
Submit jobs to JobTracker
MR
Client
jobs get queued
Mapper Mapper Mapper Mapper
mappers spawned
in separate JVM
and execute
mappers store temp results
reduce phase begins
Reducer Reducer Reducer Reducer
25. Dealing With Failures
• Like HDFS, MapReduce framework designed to be
highly fault tolerant
• Worker (Map or Reduce) failures
– Detected by periodic Master pings
– Map or Reduce jobs that fail are reset and then
given to a different node
– If a node failure occurs after the Map job has
completed, the job is redone and all Reduce jobs
are notified
• Master failure
– If the master fails for any reason the entire
computation is redone
25
26. Application
• Hadoop is used in wide area of application. Some
Examples are
– Search (Yahoo! , Amazon, Zvents)
– Log Processing (Facebook,Yahoo! ,ContextWeb, Joost , Last.fm)
– Recommendation System (Facebook)
– Data WareHousing (Facebook , AOL)
– Video and Image Analysis (New York Times, Eysalike)
28. References
• Meet Hadoop! Open Source Grid Computing by
Devraj Das
• The Hadoop Distributed File System: Architecture
and Design by Dhruba Borthakur
• Hadoop Design And Architecture
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
• An Introduction to Hadoop Distributed File System
http://www.ibm.com/developerworks/library/wa-
introhdfs/
• Big Data What’s the Big Deal? By David J. DeWitt
and Rimma Nehme. Microsoft Jim Gray System Lab