About Me 20+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop Silverpop is a leading marketing automation and email marketing company
What is MapReduce “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “ http://labs.google.com/papers/mapreduce.html
Back to the example I need to know: # of each color M&M Average weight of each color Average width of each color
Traditional approach Initialize data structure Read CSV Split each row into parts Find color in data structure Increment count, add width, weight Write final result
ASSume with me Determining weight is a CPU intensive step 8 core machine 5,000,000,000 pieces per shift to process Files ‘rotated’ hourly
Thread It! Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
Issues with threading Have to write coordination logic Locking of the color data structure Disk/Network I/O becomes next bottleneck As volume increases, cost of CPUs/Disks isn’t linear
Ideas to solve these problems? Put it a database Multiple machines, each processes a file
MapReduce Map Parse the data into name/value pairs Can be fast or expensive Reduce Collect the name/value pairs and perform function on each ‘name’ Framework makes sure you get all the distinct ‘names’ and only one per invocation
Distributed File System System takes the files and makes copies across all the machines in the cluster Often files are broken apart and spread around
Move processing to the data! Rather than copying files to the processes, push the application to the machine where the data lives! System pushes jar files and launches JVMs to process
Issues with example /ajug/output can’t exist! What’s with all the ‘Writable’ classes? Data Structures have a lot of coding overhead What if I want to do multiple things off the source? What if I want to do something after the Reduce?
Cascading Layer on top of Hadoop Introduces Pipes to abstract when mappers or reducers are needed Can easily string together logic steps No need to think about when to map, when to reduce No need for intermediate data structures
Unit testing Kind of hard without some upfront thought Separate business logic from hadoop/cascading specific parts Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures Cascading has a nice testing framework to implement
Other testing Known sets of data is critical at volume
Common Use Cases Evaluation of large volumes of data at a regular frequency Algorithms that take a single pass through the data Sensor data, log files, web analytics, transactional data First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
Things it is not good for Ad-hoc queries (though there are some tools on top of Hadoop to help) Fast/real-time evaluations OLTP Well known analysis may be better off in a data wharehouse
Issues to watch out for Lots of small files Default scheduler is pretty poor Users need shell-level access?!?
Getting started Download latest from Cloudera or Apache Setup local only cluster (really easy to do) Download Cascading Optional download Karmasphere if using Eclipse (http://www.karmasphere.com/) Build some simple tests/apps Running locally is almost the same as in the cluster
Elastic Map Reduce Amazon EC2-based Hadoop Define as many servers as you want Load the data and go 60 CENTS per hour per machine for a decent size
So ask yourself What could I do with 100 machines in an hour?
Ask yourself again … What design/ architecture do I have because I didn’t have a good way to store the data? Or What have I shoved into an RDBMS because I had one?
Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’ Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Yahoo! Oozie: http://yahoo.github.com/oozie/ Work coordination via configuration not code Allows integration of non-hadoop jobs into process
Resources Me: firstname.lastname@example.org @ChrisCurtin Chris Wensel: @cwensel Web site: www.cascading.org, Mailing list off website Atlanta Hadoop Users Group: http://www.meetup.com/Atlanta-Hadoop-Users-Group/ Cloud Computing Atlanta Meetup: http://www.meetup.com/acloud/ O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/