This document provides an overview of large scale data processing and storage. It introduces MapReduce programming model and how it is implemented in Hadoop. It discusses using MapReduce for web mining tasks. It also describes the Hadoop Distributed File System and how data is stored and accessed in HDFS. It explains benchmarking of Hadoop using terabyte and petabyte sorting. It discusses cloud computing with Elastic MapReduce on AWS. Finally, it summarizes BigTable, a distributed storage system from Google.
1. Large Scale Data Processing and
Storage
Ilayaraja Prabakaran
Product Engineer
ilayaraja@rediff.co.in
2. Agenda
Introduction to large data problem
MapReduce programming model
Web mining using MapReduce
MapReduce with Hadoop
Hadoop Distributed File System
Elastic MapReduce
Scalable storage architecture
7. Internet 2009 !
Websites
234 million - The number of websites by December 2009.
47 million - Added websites in 2009
Social Media
126 million – The number of blogs on the Internet (as
tracked by BlogPulse).
27.3 million – Number of tweets on Twitter per day
(November, 2009)
350 million – People on Facebook.
8. Internet 2009 !
Images
4 billion – Photos hosted by Flickr (October 2009).
2.5 billion – Photos uploaded each month to Facebook.
Videos
1 billion – The total number of videos YouTube serves in
one day.
924 million – Videos viewed per month on Hulu in the US
(November 2009).
9. The good news is that “Big Data” is here.
Bad news is that we are struggling to store and
analyze it.
Anyways, Should you worry about it?
10. 3 papers ..
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The
Google File System, 19th ACM Symposium on Operating
Systems Principles, Lake George, NY, October, 2003.
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on Large Clusters,
OSDI'04: Sixth Symposium on Operating System Design and
Implementation, San Francisco, CA, December, 2004.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage
System for Structured Data, OSDI'06: Seventh Symposium on
Operating System Design and Implementation, Seattle, WA,
November, 2006.
12. MapReduce
Programming model for processing multi
terabyte data on hundreds of CPUs in
parallel.
MapReduce provides:
- Automatic parallelization and distribution
- Fault tolerance
- I/O scheduling
- Status and Monitoring
13. Programming model
Input & Output: set of key/value pairs
Programmer specifies two functions:
PDS LQBNH LQBYDOXH ! OLVW RXWBNH LQWHUPHGLDWHBYDOXH
Processes input key/value pair
Produces set of intermediate pairs
UHGXFH RXWBNH OLVW LQWHUPHGLDWHBYDOXH ! OLVW RXWBNH RXWBYDOXH
Combines all intermediate values for a
particular key
Produces a set of merged output values
(usually just one)
17. Sam’s Mother
Believed “an apple a day keeps a
doctor away”
Mother
Sam
An Apple
Ref. SALSA HPC Group at Community Grids Labs
18. One day
Sam thought of drinking the apple
He used a to cut
the and a to
make juice.
19. Next Day
Sam applied his invention to all the fruits he
could find in the fruit basket
(map ‘( ))
A list of values mapped into another
list of values, which gets reduced into
a single value
(a, , o, , p, , …)
reduce Classical Notion of MapReduce in
Functional Programming
20. 18 Years Later
Sam got his first job in JuiceRUs for his talent in
making juice
Wa i t !
Now, it’s not just one basket
but a whole container of fruits
Large data and list of values for
output
Also, they produce a list of
juice types separately
But, Sam had just ONE
and ONE
NOT ENOUGH !!
21. Brave Sam
Implemented a parallel version of his innovation
(a, , o, , p, , …)
(a, , o, , p, , …)
Grouped by key
Each input to a reduce is a key, value-list
(possibly a list of these, depending on the
grouping/hashing mechanism)
e.g. a, ( …)
Reduced into a list of values
22. Brave Sam
Implemented a parallel version of his innovation
A list of key, value pairs mapped into
another list of key, value pairs which gets
grouped by the key and reduced into a list of
values
The idea of MapReduce in Data
Intensive Computing
23. Word Count
• map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
• reduce(String output_key, Iterator intermediate_values):
//output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(output_key, AsString(result));
24. Word Count: Example
a rose is a rose is a rose
a,1
rose,1
is,1
a,1 a1,1,1,1 a,4
rose,1 rose1,1,1,1 rose,4
is,1 is1,1,1 is,3
a,1
rose,1
is,1
28. Web Graph: MapReduce
• map(String input_key, String input_value):
// input_key: from-url
// input_value: document contents
for each outlink x in input_value: // parsed data
to-url = x.url // outgoing link
anchor = x.anchor // click-able text
from-url = input_key
EmitIntermediate(to-url, from-url,anchor);
29. Web Graph: MapReduce
• reduce(String output_key, Iterator
intermediate_values):
//output_key: a word
// output_values: a list of InLinks
// i.e. from-url,anchor pairs
result = new InLinks( )
for each v in intermediate_values:
result.add(v.url, v.anchor)
Emit(output_key, result);
32. Anchor text mining: MapReduce
map(key,value)
Key: to-url; value: Inlinks
for each inlink ‘i’ in value:
for each n-gram ‘ng’ in anchor:
score = calc_rank(ng)
emit( to-url, ng, score )
33. Anchor text mining: MapReduce
reduce(key,values)
Key: to-url, ng pair; values: an iterator over
score
agg_score = 0
for each score ‘s’ in values:
agg_score = agg_score +s
emit( to-url, ng, agg_score )
35. Hadoop
Created by Doug Cutting
Originated for Apache Nutch
Why hadoop?
Doug cutting - The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such.
36. Implementation
Hadoop: MapReduce APIs
HDFS: Storage
Mapper Interface
map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
Reducer Interface
reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter)
Programmers has to just override these
methods, makes life easier !
Takes care of splitting the work, data flow,
execution, handling failures so on.
41. Combiner
Performs local aggregation of the
intermediate outputs.
Cut down the amount of data transferred
from the Mapper to Reducer.
a,1
rose,1 a1,1 into (a,2)
is,1 rose1 into (rose,1)
a,1 is1 into (is,1) a,3
rose,1 rose1,1 into (rose,2) rose,3
is,1 is1 into (is,1) is,2
a,1 a1 into (a,1)
rose,1
42. Variations
Identity Reducer
- Zero reduce tasks
- Examples:
“Cleaning web link graph”
“Populating HDFS from other data sources”
- Map does the job and writes the output to HDFS.
MapReduce Chain
- Problems that are not solvable just by one map and
reduce phase.
- Series of map and reduce functions defined
- Output of previous job goes as input to next job.
43. Streaming
Allows you to write map/reduce in any
programming language.
Ex. Python, c++, perl, bash
I/O is represented textually.
Read from stdin and written to stdout as
tab separated key, value pair.
Format: key t value n
+$'223B+20(ELQKDGRRS MDU +$'223B+20(KDGRRS VWUHDPLQJMDU
LQSXW P,QSXW'LUV
RXWSXW P2XWSXW'LU
PDSSHU P3WKRQ0DSSHUS
UHGXFHU P3WKRQ5HGXFHUS
44. Pipes
API that provides strong coupling
between c++ code and hadoop.
Improved performance over
streaming.
Key and value pairs are STL strings.
APIs: getInputKey(), getInputValue()
ELQKDGRRS SLSHV LQSXW LQSXW3DWK RXWSXW RXWSXW3DWK SURJUDP
SDWKWRSLSHVSURJUDPH[HFXWDEOH
46. HDFS design principles
Handling hardware failures
Streaming data access
Storing very large files
Running on cluster of commodity
hardware
Simple coherency model
Data locality
Portability
50. HDFS Robustness
Name node failure, Data node failure
and network partitions
Heartbeats and Re-replication
Cluster Rebalancing
Data Integrity: checksum
Metadata disk failure: FsImage, Editlog
Snapshots
52. Map/Reduce Processes
Launching Application
- User application cod
- Submits a specific kind of Map/Reduce job
JobTracker
- Handles all jobs
- Makes all scheduling decisions
TaskTracker
- Manager for all tasks on a given node
Task
- Runs an individual map or reduce fragment
- Forks from the TaskTracker
54. Job Control Flow
Application launcher creates and submits job.
JobTracker initializes job, creates FileSplits, and
adds tasks to queue.
TaskTrackers ask for a new map or reduce task
every 10 seconds or when the previous task
finishes.
As tasks run, the TaskTracker reports status to
the JobTracker every 10 seconds.
Application launcher stops waiting when the job
completes.
59. Jim Gray’s Sort Benchmark
Started by Jim Gray at Microsoft in 1998
Currently managed by 3 of the previous
winners
Sorting different number of 100 byte
records
- 10 byte key
- 90 byte value
Multiple variants:
Minute Sort: sort must finish 60.0 secs
Terabyte Sort: 10^12 bytes sort
Gray Sort: = 10^14 bytes and = 1hour
60. Hadoop won Terabyte Sort ☺
Hadoop won this in 2008
Took 209 seconds to complete
910 nodes, 1800 maps and 1800
reduces .
2 quad core Xeons @ 2.0ghz per a
node
8 GB RAM per a node.
64. Notes on Petabyte Sort
80,000 maps and 20,000 reduces
Each node ran 2 maps and 2 reduces at a
time
Tail of maps was 100 minutes
Tail of reduces was 80 minutes
- caused by one slow node
Used speculative execution
The “waste” tasks at the end are mostly
speculative execution
67. Definition Characteristics
“A pool of highly scalable, abstracted infrastructure,
capable of hosting end-customer applications,
that is billed by consumption”
Characteristics:
Dynamic computing infrastructure
Service-centric approach
Self service based usage model
Minimally or self-managed platform
Consumption based billing
69. Elastic MapReduce (EMR)
Automatically spins up a Hadoop implementation
of mapreduce framework on EC2 cluster.
Sub-dividing data in a job flow into smaller
chunks so that they can be processed (the “map”
function) in parallel.
Recombining the processed data into the final
solution (the “reduce” function).
S3 as the source and destination of input and
output data respectively.
Easy to use console for launching job with
dynamic configuration
71. Motivation
Lots of (semi-)structured data
– URLs:
• Contents, crawl metadata, links, anchors,
pagerank, …
– Per-user Data:
• User preference settings, recent queries/search
results, …
– Geographic locations:
• Physical entities (shops, restaurants, etc.). roads,
satellite image data..
Scale is large
– Billions of URLs, many versions/page(~20K/version)
– Hundreds of millions of users, thousands of q/sec
– 100TB+ of satellite image data
72. Why not just use commercial DB?
Scale is too large for most commercial databases
Even if it weren’t, cost would be very high
– Building internally means system can be
applied across many projects for low
incremental cost
Low-level storage optimizations help
performance significantly
– Much harder to do when running on top of a
database layer
– Also fun and challenging to build large-scale
systems ☺
73. Goals
Want asynchronous processes to be
continuously updating different pieces of data
– Want access to most current data at any time
Need to support
– Very high read/write rates (millions of ops per
second)
– Efficient scans over all or interesting subsets
of data
Often want to examine data changes over time
– E.g. Contents of a web page over multiple
crawls
74. BigTable
Distributed multi-level map
– With an interesting data model
Fault-tolerant, persistent
Scalable
– Thousands of servers
– Terabytes of in-memory data
– Petabytes of disk-based data
– Millions of reads/writes per second, efficient
scans
Self-managing
– Servers can be added/removed dynamically
– Servers adjust to load imbalance
75. Hbase Hypertable
Use data model similar to BigTable
Sparse, distributed, persistent multi-
dimensional sorted map
Map is indexed by
- row key
- column key
- timestamp
79. Range Server
Manages ranges of table data
Caches updates in memory (CellCache)
Periodically spills (compacts) cached updates to
disk (CellStore)
hypertable.org
80. Master
Single Master (hot standbys)
Directs meta operations
– CREATE TABLE
– DROP TABLE
– ALTER TABLE
Handles recovery of RangeServer
Manages RangeServer Load Balancing
Client data does not move through Master
hypertable.org
81. Hyperspace
Chubby equivalent
– Distributed Lock Manager
– Filesystem for storing small amounts of
metadata
– Highly available
“Root of distributed data structures”
hypertable.org
82. Optimizations
Compression: Cell Store blocks are compressed
Caching: Block Cache Query Cache
Bloom Filter: Indicates if key is not present
Access Groups: minimizing I/O by locality
85. References
Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified
Data Processing on Large Clusters
SALSA HPC Group at Community Grids Labs
http://code.google.com/edu/parallel/mapreduce-tutorial.html
http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_s
orts_a_petabyte_in_162.html
http://hadoop.apache.org/
http://aws.amazon.com
http://www.emc.com
http://pingdom.com/