Your SlideShare is downloading. ×
Ajug april 2011
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Ajug april 2011


Published on

My presentation on MapReduce, Hadoop and Cascading from the April 2011 Atlanta Java Users group

My presentation on MapReduce, Hadoop and Cascading from the April 2011 Atlanta Java Users group

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Introduction to MapReduce
    Christopher Curtin
  • 2. About Me
    20+ years in Technology
    Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop
    CTO of Silverpop
    Silverpop is a leading marketing automation and email marketing company
  • 3. Contrived Example
  • 4. What is MapReduce
    “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “
  • 5. Back to the example
    I need to know:
    # of each color M&M
    Average weight of each color
    Average width of each color
  • 6.
  • 7. Traditional approach
    Initialize data structure
    Read CSV
    Split each row into parts
    Find color in data structure
    Increment count, add width, weight
    Write final result
  • 8. ASSume with me
    Determining weight is a CPU intensive step
    8 core machine
    5,000,000,000 pieces per shift to process
    Files ‘rotated’ hourly
  • 9. Thread It!
    Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
  • 10. Issues with threading
    Have to write coordination logic
    Locking of the color data structure
    Disk/Network I/O becomes next bottleneck
    As volume increases, cost of CPUs/Disks isn’t linear
  • 11. Ideas to solve these problems?
    Put it a database
    Multiple machines, each processes a file
  • 12. MapReduce
    Parse the data into name/value pairs
    Can be fast or expensive
    Collect the name/value pairs and perform function on each ‘name’
    Framework makes sure you get all the distinct ‘names’ and only one per invocation
  • 13. Distributed File System
    System takes the files and makes copies across all the machines in the cluster
    Often files are broken apart and spread around
  • 14. Move processing to the data!
    Rather than copying files to the processes, push the application to the machine where the data lives!
    System pushes jar files and launches JVMs to process
  • 15. Runtime Distribution © Concurrent 2009
  • 16. Hadoop
    Apache’s MapReduce implementation
    Lots of third party support
    Others announcing almost daily
  • 17. Example
  • 18. Issues with example
    /ajug/output can’t exist!
    What’s with all the ‘Writable’ classes?
    Data Structures have a lot of coding overhead
    What if I want to do multiple things off the source?
    What if I want to do something after the Reduce?
  • 19. Cascading
    Layer on top of Hadoop
    Introduces Pipes to abstract when mappers or reducers are needed
    Can easily string together logic steps
    No need to think about when to map, when to reduce
    No need for intermediate data structures
  • 20. Sample Example in Cascading
  • 21. Multiple Output example in Cascading
  • 22. Unit testing
    Kind of hard without some upfront thought
    Separate business logic from hadoop/cascading specific parts
    Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures
    Cascading has a nice testing framework to implement
  • 23. Other testing
    Known sets of data is critical at volume
  • 24. Common Use Cases
    Evaluation of large volumes of data at a regular frequency
    Algorithms that take a single pass through the data
    Sensor data, log files, web analytics, transactional data
    First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
  • 25. Things it is not good for
    Ad-hoc queries (though there are some tools on top of Hadoop to help)
    Fast/real-time evaluations
    Well known analysis may be better off in a data wharehouse
  • 26. Issues to watch out for
    Lots of small files
    Default scheduler is pretty poor
    Users need shell-level access?!?
  • 27. Getting started
    Download latest from Cloudera or Apache
    Setup local only cluster (really easy to do)
    Download Cascading
    Optional download Karmasphere if using Eclipse (
    Build some simple tests/apps
    Running locally is almost the same as in the cluster
  • 28. Elastic Map Reduce
    Amazon EC2-based Hadoop
    Define as many servers as you want
    Load the data and go
    60 CENTS per hour per machine for a decent size
  • 29. So ask yourself
    What could I do with 100 machines in an hour?
  • 30. Ask yourself again …
    What design/ architecture do I have because I didn’t have a good way to store the data?
    What have I shoved into an RDBMS because I had one?
  • 31. Other Solutions
    Apache Pig:
    More ‘sql-like’
    Not as easy to mix regular Java into processes
    More ‘ad hoc’ than Cascading
    Yahoo! Oozie:
    Work coordination via configuration not code
    Allows integration of non-hadoop jobs into process
  • 32. Resources
    Me: @ChrisCurtin
    Chris Wensel: @cwensel
    Web site:, Mailing list off website
    Atlanta Hadoop Users Group:
    Cloud Computing Atlanta Meetup:
    O’Reilly Hadoop Book: