Hadoop and Cascading At AJUG July 2009

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    4 Favorites

    Hadoop and Cascading At AJUG July 2009 - Presentation Transcript

    1. Clouds, Hadoop and Cascading
      Christopher Curtin
    2. About Me
      19+ years in Technology
      Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop
      CTO of Silverpop
      Silverpop is a leading marketing automation and email marketing company
    3. Cloud Computing
      What exactly is ‘cloud computing’?
      Beats me
      Most overused term since ‘dot com’
      Ask 10 people, get 11 answers
    4. What is Map/Reduce?
      Pioneered by Google
      Parallel processing of large data sets across many computers
      Highly fault tolerant
      Splits work into two steps
      Map
      Reduce
    5. Map
      Identifies what is in the input that you want to process
      Can be simple: occurrence of a word
      Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*
      Output is a list of name/value pairs
      Name and value do not have to be primitives
    6. Reduce
      Takes the name/value pairs from the Map step and does something useful with them
      Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values
      Output is the ‘answer’ to the question
      Example: bytes streamed by IP address from Apache logs
    7. Hadoop
      Apache’s Map/Reduce framework
      Apache License
      Yahoo! Uses a version and they release their enhancements back to the community
    8. HDFS
      Distributed File System WITHOUT NFS etc.
      Hadoop knows which parts of which files are on which machine (say that 5 times fast!)
      “Move the processing to the data” if possible
      Simple API to move files in and out of HDFS
    9. Runtime Distribution © Concurrent 2009
    10. Getting Started with Map/Reduce
      First challenge: real examples
      Second challenge: when to map and when to reduce?
      Third challenge: what if I need more than one of each? How to coordinate?
      Fourth challenge: non-trivial business logic
    11. Cascading
      Open Source
      Puts a wrapper on top of Hadoop
      And so much more …
    12. Main Concepts
      Tuple
      Tap
      Operations
      Pipes
      Flows
      Cascades
    13. Tuple
      A single ‘row’ of data being processed
      Each column is named
      Can access data by name or position
    14. TAP
      Abstraction on top of Hadoop files
      Allows you to define own parser for files
      Example:
      Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
    15. Operations
      Define what to do on the data
      Each – for each “tuple” in data do this to it
      Group – similar to a ‘group by’ in SQL
      CoGroup – joins of tuple streams together
      Every – for every key in the Group or CoGroup do this
    16. Operations - advanced
      Each operations allow logic on the row, such a parsing dates, creating new attributes etc.
      Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.
      Both allow multiple operations in same function, so no nested function calls!
    17. Pipes
      Pipes tie Operations together
      Pipes can be thought of as ‘tuple streams’
      Pipes can be split, allowing parallel execution of Operations
    18. Example Operation
      RowAggregator aggr = new RowAggregator(row);
      Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME);
      Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);
      formatPipe = new GroupBy(formatPipe, groupBy);
      formatPipe = new Every(formatPipe, Fields.ALL, aggr);
    19. Flows
      Flows are reusable combinations of Taps, Pipes and Operations
      Allows you to build library of functions
      Flows are where the Cascading scheduler comes into play
    20. Cascades
      Cascades are groups of Flows to address a need
      Cascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow)
      Determines what operations can be run in parallel and which need to be serialized
    21. Cascading Scheduler
      Once the Flows and Cascades are defined, looks for dependencies
      When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used
      Knows what can be executed in parallel
      Knows when a step completes what other steps can execute
    22. Dynamic Flow Creation
      Flows can be created at run time based on inputs.
      5 input files one week, 10 the next, Java code creates 10 Flows instead of 5
      Group and Every don’t care how many input Pipes
    23. Dynamic Tuple Definition
      Each operations on input Taps can parse text lines into different Fields
      So one source may have 5 inputs, another 10
      Each operations can used meta data to know how to parse
      Can write Each operations to output common Tuples
      Every operations can output new Tuples as well
    24. Example: Dynamic Fields
      splice = new Each(splice,
      new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes),
      new Fields("mailing_id", "offset"));
    25. Example: Custom Every Operation
      RowAggregator.java
    26. Mixing non-Hadoop code
      Cascading allows you to mix regular java between Flows in a Cascade
      So you can call out to databases, write intermediates to a file etc.
      We use it to load meta data about the columns in the source files
    27. Features I don’t use
      Failure traps – allow you to write Tuples into ‘error’ files when something goes wrong
      Non-Java scripting
      Assert() statements
      Sampling – throw away % of rows being imported
    28. Quick HDFS/Local Disk
      Using a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classes
      So you can pass configuration information into these operations in bulk
      You can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!
    29. Real Example
      For the hundreds of mailings sent last year
      To millions of recipients
      Show me who opened, how often
      Break it down by how long they have been a subscriber
      And their Gender
      And the number of times clicked on the offer
    30. RDBMS solution
      Lots of million + row joins
      Lots of million + row counts
      Temporary tables since we want multiple answers
      Lots of memory
      Lots of CPU and I/O
      $$ becomes bottleneck to adding more rows or more clients to same logic
    31. Cascading Solution
      Let Hadoop parse input files
      Let Hadoop group all inputs by recipient’s email
      Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data
      Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links
      Bandwidth to export data from RDBMS becomes bottleneck
    32. Pros and Cons
      Pros
      Mix java between map/reduce steps
      Don’t have to worry about when to map, when to reduce
      Don’t have to think about dependencies or how to process
      Data definition can change on the fly
      Cons
      Level above Hadoop – sometimes ‘black magic’
      Data must (should) be outside of database to get most concurrency
    33. Other Solutions
      Apache Pig: http://hadoop.apache.org/pig/
      More ‘sql-like’
      Not as easy to mix regular Java into processes
      More ‘ad hoc’ than Cascading
      Amazon Hadoop http://aws.amazon.com/elasticmapreduce/
      Runs on EC2
      Provide Map and Reduce functions
      Can use Cascading
      Pay as you go
    34. Resources
      Me: ccurtin@silverpop.com @ChrisCurtin
      Chris Wensel: @cwensel
      Web site: www.cascading.org
      Mailing list off website
      AWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/
      O’Reilly Hadoop Book:
      http://oreilly.com/catalog/9780596521974/
    SlideShare Zeitgeist 2009

    + Christopher CurtinChristopher Curtin Nominate

    custom

    908 views, 4 favs, 2 embeds more stats

    I presented at the Atlanta Java User's Group in Jul more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 908
      • 882 on SlideShare
      • 26 from embeds
    • Comments 0
    • Favorites 4
    • Downloads 45
    Most viewed embeds
    • 25 views on http://www.johnmwillis.com
    • 1 views on http://static.slidesharecdn.com

    more

    All embeds
    • 25 views on http://www.johnmwillis.com
    • 1 views on http://static.slidesharecdn.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories