Ajug april 2011
Upcoming SlideShare
Loading in...5
×
 

Ajug april 2011

on

  • 591 views

My presentation on MapReduce, Hadoop and Cascading from the April 2011 Atlanta Java Users group

My presentation on MapReduce, Hadoop and Cascading from the April 2011 Atlanta Java Users group

Statistics

Views

Total Views
591
Views on SlideShare
590
Embed Views
1

Actions

Likes
0
Downloads
10
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Ajug april 2011 Ajug april 2011 Presentation Transcript

    • Introduction to MapReduce
      Christopher Curtin
    • About Me
      20+ years in Technology
      Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop
      CTO of Silverpop
      Silverpop is a leading marketing automation and email marketing company
    • Contrived Example
    • What is MapReduce
      “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “
      http://labs.google.com/papers/mapreduce.html
    • Back to the example
      I need to know:
      # of each color M&M
      Average weight of each color
      Average width of each color
    • Traditional approach
      Initialize data structure
      Read CSV
      Split each row into parts
      Find color in data structure
      Increment count, add width, weight
      Write final result
    • ASSume with me
      Determining weight is a CPU intensive step
      8 core machine
      5,000,000,000 pieces per shift to process
      Files ‘rotated’ hourly
    • Thread It!
      Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
    • Issues with threading
      Have to write coordination logic
      Locking of the color data structure
      Disk/Network I/O becomes next bottleneck
      As volume increases, cost of CPUs/Disks isn’t linear
    • Ideas to solve these problems?
      Put it a database
      Multiple machines, each processes a file
    • MapReduce
      Map
      Parse the data into name/value pairs
      Can be fast or expensive
      Reduce
      Collect the name/value pairs and perform function on each ‘name’
      Framework makes sure you get all the distinct ‘names’ and only one per invocation
    • Distributed File System
      System takes the files and makes copies across all the machines in the cluster
      Often files are broken apart and spread around
    • Move processing to the data!
      Rather than copying files to the processes, push the application to the machine where the data lives!
      System pushes jar files and launches JVMs to process
    • Runtime Distribution © Concurrent 2009
    • Hadoop
      Apache’s MapReduce implementation
      Lots of third party support
      Yahoo
      Cloudera
      Others announcing almost daily
    • Example
    • Issues with example
      /ajug/output can’t exist!
      What’s with all the ‘Writable’ classes?
      Data Structures have a lot of coding overhead
      What if I want to do multiple things off the source?
      What if I want to do something after the Reduce?
    • Cascading
      Layer on top of Hadoop
      Introduces Pipes to abstract when mappers or reducers are needed
      Can easily string together logic steps
      No need to think about when to map, when to reduce
      No need for intermediate data structures
    • Sample Example in Cascading
    • Multiple Output example in Cascading
    • Unit testing
      Kind of hard without some upfront thought
      Separate business logic from hadoop/cascading specific parts
      Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures
      Cascading has a nice testing framework to implement
    • Other testing
      Known sets of data is critical at volume
    • Common Use Cases
      Evaluation of large volumes of data at a regular frequency
      Algorithms that take a single pass through the data
      Sensor data, log files, web analytics, transactional data
      First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
    • Things it is not good for
      Ad-hoc queries (though there are some tools on top of Hadoop to help)
      Fast/real-time evaluations
      OLTP
      Well known analysis may be better off in a data wharehouse
    • Issues to watch out for
      Lots of small files
      Default scheduler is pretty poor
      Users need shell-level access?!?
    • Getting started
      Download latest from Cloudera or Apache
      Setup local only cluster (really easy to do)
      Download Cascading
      Optional download Karmasphere if using Eclipse (http://www.karmasphere.com/)
      Build some simple tests/apps
      Running locally is almost the same as in the cluster
    • Elastic Map Reduce
      Amazon EC2-based Hadoop
      Define as many servers as you want
      Load the data and go
      60 CENTS per hour per machine for a decent size
    • So ask yourself
      What could I do with 100 machines in an hour?
    • Ask yourself again …
      What design/ architecture do I have because I didn’t have a good way to store the data?
      Or
      What have I shoved into an RDBMS because I had one?
    • Other Solutions
      Apache Pig: http://hadoop.apache.org/pig/
      More ‘sql-like’
      Not as easy to mix regular Java into processes
      More ‘ad hoc’ than Cascading
      Yahoo! Oozie: http://yahoo.github.com/oozie/
      Work coordination via configuration not code
      Allows integration of non-hadoop jobs into process
    • Resources
      Me: ccurtin@silverpop.com @ChrisCurtin
      Chris Wensel: @cwensel
      Web site: www.cascading.org, Mailing list off website
      Atlanta Hadoop Users Group: http://www.meetup.com/Atlanta-Hadoop-Users-Group/
      Cloud Computing Atlanta Meetup:
      http://www.meetup.com/acloud/
      O’Reilly Hadoop Book:
      http://oreilly.com/catalog/9780596521974/