Ajug april 2011


Published on

My presentation on MapReduce, Hadoop and Cascading from the April 2011 Atlanta Java Users group

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ajug april 2011

  1. 1. Introduction to MapReduce<br />Christopher Curtin<br />
  2. 2. About Me<br />20+ years in Technology<br />Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop<br />CTO of Silverpop<br />Silverpop is a leading marketing automation and email marketing company<br />
  3. 3. Contrived Example<br />
  4. 4. What is MapReduce<br />“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “<br />http://labs.google.com/papers/mapreduce.html<br />
  5. 5. Back to the example<br />I need to know:<br /># of each color M&M<br />Average weight of each color<br />Average width of each color<br />
  6. 6.
  7. 7. Traditional approach<br />Initialize data structure<br />Read CSV<br />Split each row into parts<br />Find color in data structure<br />Increment count, add width, weight<br />Write final result<br />
  8. 8. ASSume with me<br />Determining weight is a CPU intensive step<br />8 core machine<br />5,000,000,000 pieces per shift to process<br />Files ‘rotated’ hourly<br />
  9. 9. Thread It!<br />Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate<br />
  10. 10. Issues with threading<br />Have to write coordination logic<br />Locking of the color data structure<br />Disk/Network I/O becomes next bottleneck<br />As volume increases, cost of CPUs/Disks isn’t linear<br />
  11. 11. Ideas to solve these problems?<br />Put it a database<br />Multiple machines, each processes a file<br />
  12. 12. MapReduce<br />Map<br />Parse the data into name/value pairs<br />Can be fast or expensive<br />Reduce<br />Collect the name/value pairs and perform function on each ‘name’ <br />Framework makes sure you get all the distinct ‘names’ and only one per invocation<br />
  13. 13. Distributed File System<br />System takes the files and makes copies across all the machines in the cluster<br />Often files are broken apart and spread around<br />
  14. 14. Move processing to the data!<br />Rather than copying files to the processes, push the application to the machine where the data lives!<br />System pushes jar files and launches JVMs to process <br />
  15. 15. Runtime Distribution © Concurrent 2009<br />
  16. 16. Hadoop<br />Apache’s MapReduce implementation<br />Lots of third party support<br />Yahoo<br />Cloudera<br />Others announcing almost daily<br />
  17. 17. Example<br />
  18. 18. Issues with example<br />/ajug/output can’t exist!<br />What’s with all the ‘Writable’ classes?<br />Data Structures have a lot of coding overhead<br />What if I want to do multiple things off the source?<br />What if I want to do something after the Reduce?<br />
  19. 19. Cascading<br />Layer on top of Hadoop<br />Introduces Pipes to abstract when mappers or reducers are needed<br />Can easily string together logic steps<br />No need to think about when to map, when to reduce<br />No need for intermediate data structures<br />
  20. 20. Sample Example in Cascading<br />
  21. 21. Multiple Output example in Cascading<br />
  22. 22. Unit testing<br />Kind of hard without some upfront thought<br />Separate business logic from hadoop/cascading specific parts<br />Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures<br />Cascading has a nice testing framework to implement<br />
  23. 23. Other testing<br />Known sets of data is critical at volume<br />
  24. 24. Common Use Cases<br />Evaluation of large volumes of data at a regular frequency<br />Algorithms that take a single pass through the data<br />Sensor data, log files, web analytics, transactional data<br />First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports<br />
  25. 25. Things it is not good for<br />Ad-hoc queries (though there are some tools on top of Hadoop to help)<br />Fast/real-time evaluations<br />OLTP<br />Well known analysis may be better off in a data wharehouse<br />
  26. 26. Issues to watch out for<br />Lots of small files<br />Default scheduler is pretty poor<br />Users need shell-level access?!?<br />
  27. 27. Getting started<br />Download latest from Cloudera or Apache<br />Setup local only cluster (really easy to do)<br />Download Cascading<br />Optional download Karmasphere if using Eclipse (http://www.karmasphere.com/)<br />Build some simple tests/apps<br />Running locally is almost the same as in the cluster<br />
  28. 28. Elastic Map Reduce<br />Amazon EC2-based Hadoop<br />Define as many servers as you want<br />Load the data and go<br />60 CENTS per hour per machine for a decent size<br />
  29. 29. So ask yourself<br />What could I do with 100 machines in an hour?<br />
  30. 30. Ask yourself again …<br />What design/ architecture do I have because I didn’t have a good way to store the data?<br />Or<br />What have I shoved into an RDBMS because I had one?<br />
  31. 31. Other Solutions<br />Apache Pig: http://hadoop.apache.org/pig/<br />More ‘sql-like’ <br />Not as easy to mix regular Java into processes<br />More ‘ad hoc’ than Cascading<br />Yahoo! Oozie: http://yahoo.github.com/oozie/<br />Work coordination via configuration not code<br />Allows integration of non-hadoop jobs into process<br />
  32. 32. Resources<br />Me: ccurtin@silverpop.com @ChrisCurtin<br />Chris Wensel: @cwensel <br />Web site: www.cascading.org, Mailing list off website<br />Atlanta Hadoop Users Group: http://www.meetup.com/Atlanta-Hadoop-Users-Group/<br />Cloud Computing Atlanta Meetup:<br />http://www.meetup.com/acloud/<br />O’Reilly Hadoop Book: <br />http://oreilly.com/catalog/9780596521974/<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.