MapReduce & Hadoop●   Juanjo Mostazo●   SeedRocket BCN●   June 2011
M/R: Motivation●   Process big amount of data toproduce other data●   Scale up VS Scale out
M/R: What is it?●   Different programming paradigm●   Based on a google paper (2004)●   Automatic parallelization and dist...
M/R: Basics●   Input & Output: set of key/value pairs●   Big amount of data group & sort●   Job = Two phases = Mapper & Re...
M/R: Example(word counter)
M/R: Workflow
Hadoop: What is it?●   Framework based on GMR / GFS●   Apache project●   Developed in Java●   Multiple applications●   Use...
Hadoop: HDFS dist. systems●   Data is sent tocomputation(whereas herecomputation goes todata)●   Nodes must get infofrom e...
Hadoop: HDFS concepts●   Distributed file system.Layer on top ext3, xfs...●   Works better on huge files●   Redundancy (de...
Hadoop: HDFS Architecture
Hadoop: Architecture v1
Hadoop: Architecture v2
Hadoop: Architecture v3
M/R: Example(word counter)
Hadoop: EcoSystem         ●   FLUME                     ●   ZOOKEEPER
Hadoop: Advanced stuff●   Distributed caches●   Partitioner●   Sort comparator●   Group comparator●   Combiner●   Input fo...
Hadoop: Conclussions●   Simplify large-scale computation●   Hide parallel programming issues●   Easy to get into & develop...
MapReduce & Hadoop●   Juanjo Mostazo●   juanj.mostazo@gmail.com
Upcoming SlideShare
Loading in...5
×

Mr hadoop seedrocket

1,212

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,212
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Mr hadoop seedrocket

  1. 1. MapReduce & Hadoop● Juanjo Mostazo● SeedRocket BCN● June 2011
  2. 2. M/R: Motivation● Process big amount of data toproduce other data● Scale up VS Scale out
  3. 3. M/R: What is it?● Different programming paradigm● Based on a google paper (2004)● Automatic parallelization and distribution● I/O Scheduling● Fault tolerance● Status and monitoring
  4. 4. M/R: Basics● Input & Output: set of key/value pairs● Big amount of data group & sort● Job = Two phases = Mapper & Reducer● Map (in_key, in_value) → list(interm_key, interm_value)● Reduce (interm_key, list(interm_value)) → list (out_key, out_value)
  5. 5. M/R: Example(word counter)
  6. 6. M/R: Workflow
  7. 7. Hadoop: What is it?● Framework based on GMR / GFS● Apache project● Developed in Java● Multiple applications● Used by many companies● And growing!
  8. 8. Hadoop: HDFS dist. systems● Data is sent tocomputation(whereas herecomputation goes todata)● Nodes must get infofrom each other● Very difficult torecover from partialfailure
  9. 9. Hadoop: HDFS concepts● Distributed file system.Layer on top ext3, xfs...● Works better on huge files● Redundancy (default 3)● Bad seeking, no append!● Good rack scale. Notgood data center scale● File divided in 64Mb –128Mb blocks
  10. 10. Hadoop: HDFS Architecture
  11. 11. Hadoop: Architecture v1
  12. 12. Hadoop: Architecture v2
  13. 13. Hadoop: Architecture v3
  14. 14. M/R: Example(word counter)
  15. 15. Hadoop: EcoSystem ● FLUME ● ZOOKEEPER
  16. 16. Hadoop: Advanced stuff● Distributed caches● Partitioner● Sort comparator● Group comparator● Combiner● Input format & Record reader● MultiInput● MultiOutput● Compression (LZO)
  17. 17. Hadoop: Conclussions● Simplify large-scale computation● Hide parallel programming issues● Easy to get into & develop● Deeply used & maintained by community● Possibility yo throw away RDBMs! (Bottleneck)
  18. 18. MapReduce & Hadoop● Juanjo Mostazo● juanj.mostazo@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×