Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Map Reduce and
Hadoop
- S A L IL NAVG IR E
Big Data Explosion
• 90% of today's data was created in the last 2 years

• Moore's law: Data volume doubles every 18
mont...
Solution: Scalability
How?
Divide and Conquer
Challenges!
• How to assign units of work to the workers?

• What if there are more units of work than workers?
• What if ...
History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, sc...
What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will pro...
MapReduce Execution Overview
User
Program
fork
assign
map
Input Data
Split 0
read
Split 1
Split 2

fork
Master

fork

assi...
Hadoop Components

Storage

Processing

HDFS

MapReduce

Self-healing
high-bandwidth
clustered storage

Fault-tolerant
dis...
HDFS Architecture
HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for m...
HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
...
2 Types of Nodes
Slave Nodes
Master Nodes
Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeepin...
Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage

• TaskTrackers
• 1-4000 per cluster
• task execution
NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locati...
Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine
Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same...
Word Count Example
• Input
• Text files

• Output
• Single file containing (Word <TAB> Count)

• Map Phase
• Generates (Wo...
Typical Cluster
• 3-4000 commodity servers

• Each server
• 2x quad-core
• 16-24 GB ram

• 4-12 TB disk space

• 20-30 ser...
When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files
• ...
Who uses Hadoop?
Upcoming SlideShare
Loading in …5
×

MapReduce and Hadoop

840 views

Published on

Published in: Technology, Education
  • Be the first to comment

MapReduce and Hadoop

  1. 1. Map Reduce and Hadoop - S A L IL NAVG IR E
  2. 2. Big Data Explosion • 90% of today's data was created in the last 2 years • Moore's law: Data volume doubles every 18 months • YouTube: 13 million hours and 700 billion views in 2010 • Facebook: 20TB/day (compressed) • CERN/LHC: 40TB/day (15PB/year) • Many more examples
  3. 3. Solution: Scalability How? Divide and Conquer
  4. 4. Challenges! • How to assign units of work to the workers? • What if there are more units of work than workers? • What if the workers need to share intermediate incomplete data? • How do we aggregate such intermediate data? • How do we know when all workers have completed their assignments? • What if some workers failed?
  5. 5. History • 2000: Apache Lucene: batch index updates and sort/merge with on disk index • 2002: Apache Nutch: distributed, scalable open source web crawler • 2004: Google publishes GFS and MapReduce papers • 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project
  6. 6. What is Map Reduce? • A programming model to distribute a task on multiple nodes • Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes • Original MapReduce paper by Google • Features of MapReduce: • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers
  7. 7. MapReduce Execution Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1
  8. 8. Hadoop Components Storage Processing HDFS MapReduce Self-healing high-bandwidth clustered storage Fault-tolerant distributed processing
  9. 9. HDFS Architecture
  10. 10. HDFS Basics • HDFS is a filesystem written in Java • Sits on top of a native filesystem • Provides redundant storage for massive amounts of data • Use Commodity devices
  11. 11. HDFS Data • Data is split into blocks and stored on multiple nodes in the cluster • Each block is usually 64 MB or 128 MB • Each block is replicated multiple times • Replicas stored on different data nodes
  12. 12. 2 Types of Nodes Slave Nodes Master Nodes
  13. 13. Master Node • NameNode • only 1 per cluster • metadata server and database • SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler
  14. 14. Slave Nodes • DataNodes • 1-4000 per cluster • block data storage • TaskTrackers • 1-4000 per cluster • task execution
  15. 15. NameNode • A single NameNode stores all metadata, replication of blocks and read/write access to files • Filenames, locations on DataNodes of each block, owner, group, etc. • All information maintained in RAM for fast lookup
  16. 16. Secondary NameNode • Does memory-intensive administrative functions for the NameNode • Should run on a separate machine
  17. 17. Data Node • DataNodes store file contents • Different blocks of the same file will be stored on different DataNodes • Same block is stored on three (or more) DataNodes for redundancy
  18. 18. Word Count Example • Input • Text files • Output • Single file containing (Word <TAB> Count) • Map Phase • Generates (Word, Count) pairs • [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase • For each word, calculates aggregate • [{a,7}, {b,5}, {c,6}]
  19. 19. Typical Cluster • 3-4000 commodity servers • Each server • 2x quad-core • 16-24 GB ram • 4-12 TB disk space • 20-30 servers per rack
  20. 20. When Should I use it? Good choice for jobs that can be broken into parallelized jobs: • Indexing/Analysis of log files • Sorting of large data sets • Image Processing/Machine Learning Bad choice for serial or low latency jobs: • For real-time processing • For processing intensive task with little data • Replacing MySQL
  21. 21. Who uses Hadoop?

×