Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MapReduce succinctly

283 views

Published on

Slides from a presentation on the MapReduce algorithm. Complete code is available from https://github.com/danjebaraj/hadoopmr.

The code has a working implementation of the MapReduce algorithm implemented in stages using Scala. Gain a deeper understanding of how MapReduce works under the hood. A video recording of this presentation is available on YouTube - https://goo.gl/EDOpCp.

Published in: Software
  • Be the first to comment

MapReduce succinctly

  1. 1. MAPREDUCE SUCCINCTLY
  2. 2. Data everywhere Problem - We are drowning in data
  3. 3. Hadoop’s place Effective storage and processing of large chunks of data
  4. 4. Google GFS and MapReduce • Google was dealing a large amount of data over 10 years ago • Documented experience in a series of papers • The MapReduce programming model • Google File System • Scalable model that was implemented in Hadoop
  5. 5. Disk speeds • Processing 10 TB file • Time – ~430 minutes • Stored as 1TB on 10 machines • Time – ~43 minutes To store data at scale you need to use multiple disks/machines
  6. 6. Processor trends • CPU speeds are not growing exponentially • Processors take less power • Processors are able to do more in one cycle Product Name Intel® Core™ i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/s Intel® QPI) Intel® Core™ i7-6700K Processor (8M Cache, up to 4.20 GHz) Code Name Bloomfield Skylake Launch Date Q4'08 Q3'15 Lithography 45 nm 14 nm Recommended Customer Price BOX : $305.00 BOX : $350.00 # of Cores 4 4 # of Threads 8 8 Processor Base Frequency 2.66 GHz 4 GHz Max Turbo Frequency 2.93 GHz 4.2 GHz TDP 130 W 91 W Source - http://ark.intel.com/compare/88195,37147 To scale you need to use multiple CPUs/machines
  7. 7. Network speeds • Gigabit - Speed: 1000 mbps • Size: 1 TB • ~ 2 Hours Don’t move data unless you have to
  8. 8. Example scenario • Example that we will use to understand the problem • Data on favorite beverage • Calculate average cups consumed per day for each beverage Brianna, coffee, 3 Cameron, milk, 5 Thomas, milk, 4 Wyatt, coffee, 5 coffee, 4 milk, 4.5
  9. 9. Example – Single Threaded Average cups consumed by tea drinkers is 3.33 Transform Group by beverage Summarize and display results
  10. 10. The problem of shared state Can we avoid shared state?
  11. 11. Key idea – cooperating units • Organize program into independent but cooperating units • Programs need to be broken into a structure that will minimize the need for any shared state • Cooperating units can work in parallel without sharing resources and cooperate as needed
  12. 12. Key idea – avoid shared state Sum large list Add list 1 Add list 2 Add list 3 Add and display sum
  13. 13. How can we apply to our problem? • Data can be split into blocks • Each block of data can be processed by a thread Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  14. 14. The Akka Actor model • Units can send and receive messages • Mailbox
  15. 15. Implementation structured to avoid shared state
  16. 16. Implementation – Take 2
  17. 17. Implementation – Take 3 MapReduce Framework Sorts, groups and sends data by key [Sort/Shuffle step]
  18. 18. The MapReduce framework Preparation Map - input Map - output Sort/shuffle - output Reduce output Break files into blocks that can be processed independently Locate and use code to read each record Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  19. 19. Hadoop Distributed File System • Files are split into large blocks • Each block is stored on multiple nodes • Namenode tracks block location
  20. 20. Other aspects • Framework does a lot of the heavy lifting • Machines can fail • Tasks can fail • Stragglers • Users just write the Map and Reduce functions
  21. 21. Cup count demo – Apache Hadoop • Demo • Program is almost identical to what we wrote
  22. 22. Next steps • Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr • Read Google’s paper on Map Reduce and GFS (HDFS) • http://research.google.com/archive/mapreduce.html • http://research.google.com/archive/gfs.html • Get familiar with Hadoop and Apache Spark • Become familiar with functional programming • Scala, F#, Clojure • Check out Syncfusion’s free e-Books on related topics • If working with Windows checkout Syncfusion’s easy to use Big Data Platform - http://www.syncfusion.com/products/big-data
  23. 23. http://www.syncfusion.com/products/big-data http://www.syncfusion.com/resources/techportal/ebooks Related links
  24. 24. Thank you Daniel Jebaraj www.syncfusion.com

×