Introduction to Hadoop and MapReduce
Colin Su, Tagtoo
Advertisement System Architecture (now)
Advertisement System Architecture (future)
• Grid
• Ad Server
• Data Highway
• Steaming Computing
Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our...
Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to websi...
Data Highway
• Transfer your data to the proper place
Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Serve...
Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (G...
Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too ...
MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• d...
Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), li...
Parallel “MapReduce” 5 Steps
•

prepare the map() input for mappers

•

mappers run the map() code -> generated intermedia...
Example of “MapReduce” Word Count

map()

reduce()
Example of “MapReduce” Word Count
• Original Input

Apple Orange Mongo
Orange Grapes Plum
...
Example of “MapReduce” Word Count
• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

...
Example of “MapReduce” Word Count
• map() to useful record

(Apple, 1)

Apple Orange Mongo

(Orange, 1)

(Mongo, 1)

Inter...
Example of “MapReduce” Word Count
• sort and shuffle
(Apple, 1)
(Mongo, 1)

(Apple, 1)

(Orange, 1)

Reducer

(Apple, 1)

(M...
Example of “MapReduce” Word Count
• Reduce()

(Apple, 1)
(Apple, 1)

(Apple, 2)

Reducer

(Orange, 1)
(Orange, 1)
(Orange,...
Example of “MapReduce” Word Count
• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2
Orange 3
Gra...
Hadoop Infrastructure
• Pig: Programming Language for MapReduce
• Thrift: cross-language communication, just like Google’s...
Upcoming SlideShare
Loading in...5
×

Introduction to MapReduce & hadoop

568

Published on

Tagtoo internal seminar

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
568
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction to MapReduce & hadoop

  1. 1. Introduction to Hadoop and MapReduce Colin Su, Tagtoo
  2. 2. Advertisement System Architecture (now)
  3. 3. Advertisement System Architecture (future) • Grid • Ad Server • Data Highway • Steaming Computing
  4. 4. Grid • Core: • Data mining • Machine Learning • Collecting data from users, logs and calculate out the strategy • Sort our data in a proper form, them we could use it anytime Data -> Information
  5. 5. Ad Server • Ranking • According the “information” in Grid, decide which AD should be advertised • show proper ads to website visitors
  6. 6. Data Highway • Transfer your data to the proper place
  7. 7. Stream Computing • Core: • logging • feedback • anti-cheating • pricing • post-process everything thrown out from Ad Server, and feedback useful information to Grid • be the entrance of advertisement system
  8. 8. Hadoop • an open-source software framework for data scientists • derives from Google’s MapReduce and Google File System (GFS) papers • written in Java • could be divided in to 2 components: • MapReduce • HDFS (Hadoop distributed file system) • a yellow elephant
  9. 9. Why Hadoop? • moving computation is much cheaper and easier than moving data • “Big Data”, the amount of data becomes too large, need a effective way to manage it • so does computation • high fault-tolerance • developed by Yahoo!
  10. 10. MapReduce • a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster • different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer” • proposed by Google
  11. 11. Functional “map/reduce” • map()/reduce() in Python • map(function(elem), list) -> list • reduce(function(elem1, elem2), list) -> single result • e.g. • map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8] • reduce(lambda x,y: x+y, [1,2,3,4]) => 10
  12. 12. Parallel “MapReduce” 5 Steps • prepare the map() input for mappers • mappers run the map() code -> generated intermediate pairs • dispatch intermediate pairs to reducers • reducers run the reduce() code, aggregate the results • prepare output from the result of reduce()
  13. 13. Example of “MapReduce” Word Count map() reduce()
  14. 14. Example of “MapReduce” Word Count • Original Input Apple Orange Mongo Orange Grapes Plum ...
  15. 15. Example of “MapReduce” Word Count • Prepare data for mappers Apple Orange Mongo Orange Grapes Plum ...
  16. 16. Example of “MapReduce” Word Count • map() to useful record (Apple, 1) Apple Orange Mongo (Orange, 1) (Mongo, 1) Intermediate key/value pair
  17. 17. Example of “MapReduce” Word Count • sort and shuffle (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) Reducer (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) (Orange, 1) Shuffle to Reducers (Orange, 1) (Orange, 1) (Apple, 1) (Mongo, 1) (Apple, 1) (Mongo, 1) unsorted Sorted (Orange, 1) Reducer (Mongo, 1) (Mongo, 1) Reducer
  18. 18. Example of “MapReduce” Word Count • Reduce() (Apple, 1) (Apple, 1) (Apple, 2) Reducer (Orange, 1) (Orange, 1) (Orange, 1) Reducer (Orange, 3)
  19. 19. Example of “MapReduce” Word Count • Generate Output (Apple, 2) (Orange, 3) (Grapes, 1) (Plum, 5) Apple 2 Orange 3 Grapes 1 Plum 5 WordCount.txt
  20. 20. Hadoop Infrastructure • Pig: Programming Language for MapReduce • Thrift: cross-language communication, just like Google’s ProtoBuffer • Zookeeper: cluster management Hadoop Hadoop Other Services Thrift MapReduce Pig Hadoop HDFS Hadoop Hadoop ZooKeeper
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×