Introduction to MapReduce & hadoop

Introduction to Hadoop and MapReduce
Colin Su, Tagtoo

Advertisement System Architecture (now)

Advertisement System Architecture (future)
• Grid
• Ad Server
• Data Highway
• Steaming Computing

Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our data in a proper form, them we could use it anytime

Data -> Information

Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to website visitors

Data Highway
• Transfer your data to the proper place

Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Server, and feedback useful information to Grid
• be the entrance of advertisement system

Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (GFS) papers
• written in Java
• could be divided in to 2 components:
• MapReduce
• HDFS (Hadoop distributed ﬁle system)
• a yellow elephant

Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too large, need a eﬀective way to manage it
• so does computation
• high fault-tolerance
• developed by Yahoo!

MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• diﬀerent from map/reduce, the conception of functional programming, but actually they have the same idea,
“divide and conquer”
• proposed by Google

Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), list) -> single result
• e.g.
• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]
• reduce(lambda x,y: x+y, [1,2,3,4]) => 10

Parallel “MapReduce” 5 Steps
•

prepare the map() input for mappers

•

mappers run the map() code -> generated intermediate pairs

•

dispatch intermediate pairs to reducers

•

reducers run the reduce() code, aggregate the results

•

prepare output from the result of reduce()

Example of “MapReduce” Word Count

map()

reduce()

• Original Input

Apple Orange Mongo
Orange Grapes Plum
...

• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

...

• map() to useful record

(Apple, 1)

Apple Orange Mongo

(Orange, 1)

(Mongo, 1)

Intermediate key/value pair

• sort and shuﬄe
(Apple, 1)
(Mongo, 1)

(Apple, 1)

(Orange, 1)

Reducer

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Orange, 1)
Shufﬂe to Reducers

(Orange, 1)

(Orange, 1)

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Mongo, 1)

unsorted

Sorted

(Orange, 1)
Reducer

(Mongo, 1)
(Mongo, 1)
Reducer

• Reduce()

(Apple, 1)
(Apple, 1)

(Apple, 2)

Reducer

(Orange, 1)
(Orange, 1)
(Orange, 1)
Reducer

(Orange, 3)

• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2
Orange 3
Grapes 1
Plum 5
WordCount.txt

Hadoop Infrastructure
• Pig: Programming Language for MapReduce
• Thrift: cross-language communication, just like Google’s ProtoBuﬀer
• Zookeeper: cluster management

Hadoop

Hadoop

Other Services

Thrift

MapReduce
Pig

Hadoop

HDFS

Hadoop

Hadoop
ZooKeeper

Introduction to MapReduce & hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Introduction to MapReduce & hadoop

Similar to Introduction to MapReduce & hadoop (20)

More from Colin Su

More from Colin Su (20)

Recently uploaded

Recently uploaded (20)

Introduction to MapReduce & hadoop