Large Scale Data Processing & Storage

Large Scale Data Processing and
Storage

Ilayaraja Prabakaran
Product Engineer

ilayaraja@rediff.co.in

Agenda
Introduction to large data problem
MapReduce programming model
Web mining using MapReduce
MapReduce with Hadoop
Hadoop Distributed File System
Elastic MapReduce
Scalable storage architecture

Internet 2009 !
Websites
234 million - The number of websites by December 2009.
47 million - Added websites in 2009

Social Media
126 million – The number of blogs on the Internet (as
tracked by BlogPulse).
27.3 million – Number of tweets on Twitter per day
(November, 2009)
350 million – People on Facebook.

Internet 2009 !
Images
4 billion – Photos hosted by Flickr (October 2009).
2.5 billion – Photos uploaded each month to Facebook.

Videos
1 billion – The total number of videos YouTube serves in
one day.
924 million – Videos viewed per month on Hulu in the US
(November 2009).

The good news is that “Big Data” is here.

Bad news is that we are struggling to store and
analyze it.

Anyways, Should you worry about it?

3 papers ..
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The
Google File System, 19th ACM Symposium on Operating
Systems Principles, Lake George, NY, October, 2003.
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on Large Clusters,
OSDI'04: Sixth Symposium on Operating System Design and
Implementation, San Francisco, CA, December, 2004.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage
System for Structured Data, OSDI'06: Seventh Symposium on
Operating System Design and Implementation, Seattle, WA,
November, 2006.

Opensource Solutions

MapReduce

GFS

BigTable

MapReduce
Programming model for processing multi
terabyte data on hundreds of CPUs in
parallel.
MapReduce provides:
- Automatic parallelization and distribution
- Fault tolerance
- I/O scheduling
- Status and Monitoring

Programming model
Input & Output: set of key/value pairs
Programmer specifies two functions:
PDS LQBNH LQBYDOXH ! OLVW RXWBNH LQWHUPHGLDWHBYDOXH
Processes input key/value pair
Produces set of intermediate pairs
UHGXFH RXWBNH OLVW LQWHUPHGLDWHBYDOXH ! OLVW RXWBNH RXWBYDOXH
Combines all intermediate values for a
particular key
Produces a set of merged output values
(usually just one)

Example

Thinking in MapReduce

Sam’s Mother
Believed “an apple a day keeps a
doctor away”

Mother
Sam

An Apple

Ref. SALSA HPC Group at Community Grids Labs

One day
Sam thought of drinking the apple
He used a to cut

the and a to

make juice.

Next Day
Sam applied his invention to all the fruits he
could find in the fruit basket

(map ‘( ))
A list of values mapped into another
list of values, which gets reduced into
a single value
(a, , o, , p, , …)

reduce Classical Notion of MapReduce in
Functional Programming

18 Years Later
Sam got his first job in JuiceRUs for his talent in
making juice
Wa i t !
Now, it’s not just one basket
but a whole container of fruits

Large data and list of values for
output
Also, they produce a list of
juice types separately

But, Sam had just ONE
and ONE
NOT ENOUGH !!

Brave Sam
Implemented a parallel version of his innovation

(a, , o, , p, , …)

(a, , o, , p, , …)
Grouped by key
Each input to a reduce is a key, value-list
(possibly a list of these, depending on the
grouping/hashing mechanism)
e.g. a, ( …)

Reduced into a list of values

Brave Sam
Implemented a parallel version of his innovation

A list of key, value pairs mapped into
another list of key, value pairs which gets
grouped by the key and reduced into a list of
values

The idea of MapReduce in Data
Intensive Computing

Word Count
• map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);

• reduce(String output_key, Iterator intermediate_values):
//output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(output_key, AsString(result));

Word Count: Example
a rose is a rose is a rose

a,1
rose,1
is,1
a,1 a1,1,1,1 a,4
rose,1 rose1,1,1,1 rose,4
is,1 is1,1,1 is,3
a,1
rose,1
is,1

Demo Time

Lets have some fun ☺

rediff uses MapReduce for..
Web crawling and indexing
Web data mining
- Reverse web-link graph
- ngram database
- Anchor text analysis
Mining usage logs
- Related queries
- Search Suggest
- Query classification

Reverse Web-link Graph
Web-
Key: http://www.rediff.com/news
Values:
fromUrl: http://www.rediff.com anchor: news
fromUrl: http://en.wikipedia.org/wiki/Rediff.com
anchor: rediff news
anchor: rediff headlines
fromUrl: http://www.alexa.com/siteinfo/rediff.com
anchor: rediff.com
…….

Web Graph: MapReduce
• map(String input_key, String input_value):
// input_key: from-url
// input_value: document contents

for each outlink x in input_value: // parsed data
to-url = x.url // outgoing link
anchor = x.anchor // click-able text
from-url = input_key
EmitIntermediate(to-url, from-url,anchor);

Web Graph: MapReduce
• reduce(String output_key, Iterator
intermediate_values):
//output_key: a word
// output_values: a list of InLinks
// i.e. from-url,anchor pairs

result = new InLinks( )
for each v in intermediate_values:
result.add(v.url, v.anchor)
Emit(output_key, result);

Anchor text mining
Input: Web Graph
Output: ranked set of anchors.

Anchor text mining: MapReduce
map(key,value)
Key: to-url; value: Inlinks
for each inlink ‘i’ in value:
for each n-gram ‘ng’ in anchor:
score = calc_rank(ng)
emit( to-url, ng, score )

Anchor text mining: MapReduce
reduce(key,values)
Key: to-url, ng pair; values: an iterator over
score
agg_score = 0
for each score ‘s’ in values:
agg_score = agg_score +s
emit( to-url, ng, agg_score )

Hadoop
Opensource implementation of
MapReduce

Hadoop
Created by Doug Cutting
Originated for Apache Nutch
Why hadoop?
Doug cutting - The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such.

Implementation
Hadoop: MapReduce APIs
HDFS: Storage
Mapper Interface
map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
Reducer Interface
reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter)
Programmers has to just override these
methods, makes life easier !
Takes care of splitting the work, data flow,
execution, handling failures so on.

Combiner
Performs local aggregation of the
intermediate outputs.
Cut down the amount of data transferred
from the Mapper to Reducer.
a,1
rose,1 a1,1 into (a,2)
is,1 rose1 into (rose,1)
a,1 is1 into (is,1) a,3
rose,1 rose1,1 into (rose,2) rose,3
is,1 is1 into (is,1) is,2
a,1 a1 into (a,1)
rose,1

Variations
Identity Reducer
- Zero reduce tasks
- Examples:
“Cleaning web link graph”
“Populating HDFS from other data sources”
- Map does the job and writes the output to HDFS.
MapReduce Chain
- Problems that are not solvable just by one map and
reduce phase.
- Series of map and reduce functions defined
- Output of previous job goes as input to next job.

Streaming
Allows you to write map/reduce in any
programming language.
Ex. Python, c++, perl, bash
I/O is represented textually.
Read from stdin and written to stdout as
tab separated key, value pair.
Format: key t value n
+$'223B+20(ELQKDGRRS MDU +$'223B+20(KDGRRS VWUHDPLQJMDU
LQSXW P,QSXW'LUV
RXWSXW P2XWSXW'LU
PDSSHU P3WKRQ0DSSHUS
UHGXFHU P3WKRQ5HGXFHUS

Pipes
API that provides strong coupling
between c++ code and hadoop.
Improved performance over
streaming.
Key and value pairs are STL strings.
APIs: getInputKey(), getInputValue()
ELQKDGRRS SLSHV LQSXW LQSXW3DWK RXWSXW RXWSXW3DWK SURJUDP
SDWKWRSLSHVSURJUDPH[HFXWDEOH

Hadoop Distributed File System
(HDFS)

HDFS design principles
Handling hardware failures
Streaming data access
Storing very large files
Running on cluster of commodity
hardware
Simple coherency model
Data locality
Portability

HDFS Robustness
Name node failure, Data node failure
and network partitions
Heartbeats and Re-replication
Cluster Rebalancing
Data Integrity: checksum
Metadata disk failure: FsImage, Editlog
Snapshots

Anatomy of Hadoop MapReduce
Job run on HDFS

Map/Reduce Processes
Launching Application
- User application cod
- Submits a specific kind of Map/Reduce job
JobTracker
- Handles all jobs
- Makes all scheduling decisions
TaskTracker
- Manager for all tasks on a given node
Task
- Runs an individual map or reduce fragment
- Forks from the TaskTracker

Job Control Flow
Application launcher creates and submits job.
JobTracker initializes job, creates FileSplits, and
adds tasks to queue.
TaskTrackers ask for a new map or reduce task
every 10 seconds or when the previous task
finishes.
As tasks run, the TaskTracker reports status to
the JobTracker every 10 seconds.
Application launcher stops waiting when the job
completes.

Jim Gray’s Sort Benchmark
Started by Jim Gray at Microsoft in 1998
Currently managed by 3 of the previous
winners
Sorting different number of 100 byte
records
- 10 byte key
- 90 byte value
Multiple variants:
Minute Sort: sort must finish 60.0 secs
Terabyte Sort: 10^12 bytes sort
Gray Sort: = 10^14 bytes and = 1hour

Hadoop won Terabyte Sort ☺
Hadoop won this in 2008
Took 209 seconds to complete
910 nodes, 1800 maps and 1800
reduces .
2 quad core Xeons @ 2.0ghz per a
node
8 GB RAM per a node.

Further stats.

Bytes Nodes Maps Reduces Replication Time

500 GB 1406 8000 2600 1 59 s

1 TB 1460 8000 2700 1 62 s

100 TB 3452 190000 10000 2 173 m

1000 TB 3658 80000 20000 2 975 m

Notes on Petabyte Sort
80,000 maps and 20,000 reduces
Each node ran 2 maps and 2 reduces at a
time
Tail of maps was 100 minutes
Tail of reduces was 80 minutes
- caused by one slow node
Used speculative execution
The “waste” tasks at the end are mostly
speculative execution

Cloud Computing Elastic
MapReduce

Definition Characteristics
“A pool of highly scalable, abstracted infrastructure,
capable of hosting end-customer applications,
that is billed by consumption”

Characteristics:
Dynamic computing infrastructure
Service-centric approach
Self service based usage model
Minimally or self-managed platform
Consumption based billing

Amazon web services (AWS)
Elastic Compute Cloud (EC2)
Elastic MapReduce
Simple Storage Service (S3)
Elastic Block Storage
Elastic Load Balancing
Amazon CloudWatch

Elastic MapReduce (EMR)
Automatically spins up a Hadoop implementation
of mapreduce framework on EC2 cluster.
Sub-dividing data in a job flow into smaller
chunks so that they can be processed (the “map”
function) in parallel.
Recombining the processed data into the final
solution (the “reduce” function).
S3 as the source and destination of input and
output data respectively.
Easy to use console for launching job with
dynamic configuration

Motivation
Lots of (semi-)structured data
– URLs:
• Contents, crawl metadata, links, anchors,
pagerank, …
– Per-user Data:
• User preference settings, recent queries/search
results, …
– Geographic locations:
• Physical entities (shops, restaurants, etc.). roads,
satellite image data..
Scale is large
– Billions of URLs, many versions/page(~20K/version)
– Hundreds of millions of users, thousands of q/sec
– 100TB+ of satellite image data

Why not just use commercial DB?

Scale is too large for most commercial databases
Even if it weren’t, cost would be very high
– Building internally means system can be
applied across many projects for low
incremental cost
Low-level storage optimizations help
performance significantly
– Much harder to do when running on top of a
database layer

– Also fun and challenging to build large-scale
systems ☺

Goals
Want asynchronous processes to be
continuously updating different pieces of data
– Want access to most current data at any time
Need to support
– Very high read/write rates (millions of ops per
second)
– Efficient scans over all or interesting subsets
of data
Often want to examine data changes over time
– E.g. Contents of a web page over multiple
crawls

BigTable
Distributed multi-level map
– With an interesting data model
Fault-tolerant, persistent
Scalable
– Thousands of servers
– Terabytes of in-memory data
– Petabytes of disk-based data
– Millions of reads/writes per second, efficient
scans
Self-managing
– Servers can be added/removed dynamically
– Servers adjust to load imbalance

Hbase Hypertable
Use data model similar to BigTable
Sparse, distributed, persistent multi-
dimensional sorted map
Map is indexed by
- row key
- column key
- timestamp

Table: Visual representation

hypertable.org

Table: Actual Representation

hypertable.org

System Overview

hypertable.org

Range Server
Manages ranges of table data
Caches updates in memory (CellCache)
Periodically spills (compacts) cached updates to
disk (CellStore)

hypertable.org

Master
Single Master (hot standbys)
Directs meta operations
– CREATE TABLE
– DROP TABLE
– ALTER TABLE
Handles recovery of RangeServer
Manages RangeServer Load Balancing
Client data does not move through Master

hypertable.org

Hyperspace
Chubby equivalent
– Distributed Lock Manager
– Filesystem for storing small amounts of
metadata
– Highly available
“Root of distributed data structures”

hypertable.org

Optimizations
Compression: Cell Store blocks are compressed
Caching: Block Cache Query Cache
Bloom Filter: Indicates if key is not present
Access Groups: minimizing I/O by locality

References
Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified
Data Processing on Large Clusters
SALSA HPC Group at Community Grids Labs
http://code.google.com/edu/parallel/mapreduce-tutorial.html
http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_s
orts_a_petabyte_in_162.html
http://hadoop.apache.org/
http://aws.amazon.com
http://www.emc.com
http://pingdom.com/

Large Scale Data Processing & Storage

More Related Content

What's hot

Viewers also liked

Similar to Large Scale Data Processing & Storage

Recently uploaded

Large Scale Data Processing & Storage