MapReduce succinctly

Data everywhere
Problem - We are drowning in data

Hadoop’s place
Effective storage and processing of large chunks of data

Google GFS and MapReduce
• Google was dealing a large amount of data over 10 years ago
• Documented experience in a series of papers
• The MapReduce programming model
• Google File System
• Scalable model that was implemented in Hadoop

Disk speeds
• Processing 10 TB file
• Time – ~430 minutes
• Stored as 1TB on 10 machines
• Time – ~43 minutes
To store data at scale you need to
use multiple disks/machines

Processor trends
• CPU speeds are not growing exponentially
• Processors take less power
• Processors are able to do more in one cycle
Product Name
Intel® Core™ i7-920
Processor (8M Cache,
2.66 GHz, 4.80 GT/s
Intel® QPI)
Intel® Core™ i7-6700K
Processor (8M Cache, up
to 4.20 GHz)
Code Name Bloomfield Skylake
Launch Date Q4'08 Q3'15
Lithography 45 nm 14 nm
Recommended
Customer Price BOX : $305.00 BOX : $350.00
# of Cores 4 4
# of Threads 8 8
Processor Base
Frequency 2.66 GHz 4 GHz
Max Turbo
Frequency 2.93 GHz 4.2 GHz
TDP 130 W 91 W
Source - http://ark.intel.com/compare/88195,37147
To scale you need to use multiple
CPUs/machines

Network speeds
• Gigabit - Speed: 1000 mbps
• Size: 1 TB
• ~ 2 Hours
Don’t move data unless you have to

Example scenario
• Example that we will use to understand the problem
• Data on favorite beverage
• Calculate average cups consumed per day for each beverage
Brianna, coffee, 3
Cameron, milk, 5
Thomas, milk, 4
Wyatt, coffee, 5
coffee, 4
milk, 4.5

Example – Single Threaded
Average cups consumed by tea drinkers is 3.33
Transform
Group by beverage
Summarize and display results

The problem of shared state
Can we avoid
shared state?

Key idea – cooperating units
• Organize program into independent but cooperating units
• Programs need to be broken into a structure that will minimize
the need for any shared state
• Cooperating units can work in parallel without sharing resources
and cooperate as needed

Key idea – avoid shared state
Sum large list
Add list 1
Add list 2
Add list 3
Add and display
sum

How can we apply to our problem?
• Data can be split into blocks
• Each block of data can be processed by a thread
Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5

The Akka Actor model
• Units can send and receive messages
• Mailbox

Implementation structured to avoid shared state

Implementation – Take 3
MapReduce
Framework
Sorts, groups and
sends data by key
[Sort/Shuffle step]

The MapReduce framework
Preparation Map - input Map - output Sort/shuffle -
output
Reduce output
Break files into
blocks that can
be processed
independently
Locate and use
code to read
each record
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5

Hadoop Distributed File System
• Files are split into large blocks
• Each block is stored on multiple nodes
• Namenode tracks block location

Other aspects
• Framework does a lot of the heavy lifting
• Machines can fail
• Tasks can fail
• Stragglers
• Users just write the Map and Reduce functions

Cup count demo – Apache Hadoop
• Demo
• Program is almost identical to what we wrote

Next steps
• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr
• Read Google’s paper on Map Reduce and GFS (HDFS)
• http://research.google.com/archive/mapreduce.html
• http://research.google.com/archive/gfs.html
• Get familiar with Hadoop and Apache Spark
• Become familiar with functional programming
• Scala, F#, Clojure
• Check out Syncfusion’s free e-Books on related topics
• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -
http://www.syncfusion.com/products/big-data

http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/resources/techportal/ebooks
Related links

Thank you
Daniel Jebaraj
www.syncfusion.com

MapReduce succinctly

Recommended

Recommended

More Related Content

Similar to MapReduce succinctly

Similar to MapReduce succinctly (20)

Recently uploaded

Recently uploaded (20)

MapReduce succinctly