Your SlideShare is downloading. ×
0
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and Cascading At AJUG July 2009

1,990

Published on

I presented at the Atlanta Java User's Group in July 2009 about Hadoop and Cascading.

I presented at the Atlanta Java User's Group in July 2009 about Hadoop and Cascading.

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,990
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
103
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Clouds, Hadoop and Cascading<br />Christopher Curtin<br />
  • 2. About Me<br />19+ years in Technology<br />Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop<br />CTO of Silverpop<br />Silverpop is a leading marketing automation and email marketing company<br />
  • 3. Cloud Computing<br />What exactly is ‘cloud computing’?<br />Beats me<br />Most overused term since ‘dot com’<br />Ask 10 people, get 11 answers<br />
  • 4. What is Map/Reduce?<br />Pioneered by Google<br />Parallel processing of large data sets across many computers<br />Highly fault tolerant<br />Splits work into two steps<br />Map<br />Reduce<br />
  • 5. Map<br />Identifies what is in the input that you want to process<br />Can be simple: occurrence of a word<br />Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*<br />Output is a list of name/value pairs<br />Name and value do not have to be primitives<br />
  • 6. Reduce<br />Takes the name/value pairs from the Map step and does something useful with them<br />Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values<br />Output is the ‘answer’ to the question<br />Example: bytes streamed by IP address from Apache logs<br />
  • 7. Hadoop<br />Apache’s Map/Reduce framework<br />Apache License<br />Yahoo! Uses a version and they release their enhancements back to the community<br />
  • 8. HDFS<br />Distributed File System WITHOUT NFS etc.<br />Hadoop knows which parts of which files are on which machine (say that 5 times fast!)<br />“Move the processing to the data” if possible<br />Simple API to move files in and out of HDFS<br />
  • 9. Runtime Distribution © Concurrent 2009<br />
  • 10. Getting Started with Map/Reduce<br />First challenge: real examples<br />Second challenge: when to map and when to reduce?<br />Third challenge: what if I need more than one of each? How to coordinate?<br />Fourth challenge: non-trivial business logic<br />
  • 11. Cascading<br />Open Source<br />Puts a wrapper on top of Hadoop<br />And so much more …<br />
  • 12. Main Concepts<br />Tuple<br />Tap<br />Operations<br />Pipes<br />Flows<br />Cascades<br />
  • 13. Tuple<br />A single ‘row’ of data being processed<br />Each column is named<br />Can access data by name or position<br />
  • 14. TAP<br />Abstraction on top of Hadoop files<br />Allows you to define own parser for files<br />Example:<br />Input = new Hfs(new TextLine(), a_hdfsDirectory + &quot;/&quot; + name);<br />
  • 15. Operations<br />Define what to do on the data<br />Each – for each “tuple” in data do this to it<br />Group – similar to a ‘group by’ in SQL<br />CoGroup – joins of tuple streams together<br />Every – for every key in the Group or CoGroup do this<br />
  • 16. Operations - advanced<br />Each operations allow logic on the row, such a parsing dates, creating new attributes etc.<br />Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.<br />Both allow multiple operations in same function, so no nested function calls!<br />
  • 17. Pipes<br />Pipes tie Operations together<br />Pipes can be thought of as ‘tuple streams’<br />Pipes can be split, allowing parallel execution of Operations<br />
  • 18. Example Operation<br />RowAggregator aggr = new RowAggregator(row);<br />Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME);<br />Pipe formatPipe = new Each(&quot;reformat_“ new Fields(&quot;line&quot;), a_sentFile);<br />formatPipe = new GroupBy(formatPipe, groupBy);<br />formatPipe = new Every(formatPipe, Fields.ALL, aggr);<br />
  • 19. Flows<br />Flows are reusable combinations of Taps, Pipes and Operations<br />Allows you to build library of functions <br />Flows are where the Cascading scheduler comes into play<br />
  • 20. Cascades<br />Cascades are groups of Flows to address a need<br />Cascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow)<br />Determines what operations can be run in parallel and which need to be serialized<br />
  • 21. Cascading Scheduler<br />Once the Flows and Cascades are defined, looks for dependencies<br />When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used<br />Knows what can be executed in parallel<br />Knows when a step completes what other steps can execute<br />
  • 22. Dynamic Flow Creation<br />Flows can be created at run time based on inputs.<br />5 input files one week, 10 the next, Java code creates 10 Flows instead of 5<br />Group and Every don’t care how many input Pipes<br />
  • 23. Dynamic Tuple Definition<br />Each operations on input Taps can parse text lines into different Fields<br />So one source may have 5 inputs, another 10<br />Each operations can used meta data to know how to parse<br />Can write Each operations to output common Tuples<br />Every operations can output new Tuples as well<br />
  • 24. Example: Dynamic Fields<br />splice = new Each(splice,<br /> new ExpressionFunction(new Fields(&quot;offset&quot;), &quot;(open_time - sent_time)/(60*60*1000)&quot;, parameters, classes),<br /> new Fields(&quot;mailing_id&quot;, &quot;offset&quot;));<br />
  • 25. Example: Custom Every Operation<br />RowAggregator.java<br />
  • 26. Mixing non-Hadoop code<br />Cascading allows you to mix regular java between Flows in a Cascade<br />So you can call out to databases, write intermediates to a file etc.<br />We use it to load meta data about the columns in the source files<br />
  • 27. Features I don’t use<br />Failure traps – allow you to write Tuples into ‘error’ files when something goes wrong<br />Non-Java scripting<br />Assert() statements<br />Sampling – throw away % of rows being imported<br />
  • 28. Quick HDFS/Local Disk<br />Using a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classes<br />So you can pass configuration information into these operations in bulk<br />You can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!<br />
  • 29. Real Example<br />For the hundreds of mailings sent last year<br />To millions of recipients<br />Show me who opened, how often<br />Break it down by how long they have been a subscriber<br />And their Gender<br />And the number of times clicked on the offer<br />
  • 30. RDBMS solution<br />Lots of million + row joins<br />Lots of million + row counts<br />Temporary tables since we want multiple answers<br />Lots of memory<br />Lots of CPU and I/O<br />$$ becomes bottleneck to adding more rows or more clients to same logic<br />
  • 31. Cascading Solution<br />Let Hadoop parse input files<br />Let Hadoop group all inputs by recipient’s email<br />Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data<br />Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links<br />Bandwidth to export data from RDBMS becomes bottleneck<br />
  • 32. Pros and Cons<br />Pros<br />Mix java between map/reduce steps<br />Don’t have to worry about when to map, when to reduce<br />Don’t have to think about dependencies or how to process<br />Data definition can change on the fly<br />Cons<br />Level above Hadoop – sometimes ‘black magic’<br />Data must (should) be outside of database to get most concurrency <br />
  • 33. Other Solutions<br />Apache Pig: http://hadoop.apache.org/pig/<br />More ‘sql-like’ <br />Not as easy to mix regular Java into processes<br />More ‘ad hoc’ than Cascading<br />Amazon Hadoop http://aws.amazon.com/elasticmapreduce/<br />Runs on EC2<br />Provide Map and Reduce functions<br />Can use Cascading <br />Pay as you go<br />
  • 34. Resources<br />Me: ccurtin@silverpop.com @ChrisCurtin<br />Chris Wensel: @cwensel <br />Web site: www.cascading.org<br />Mailing list off website<br />AWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/<br />O’Reilly Hadoop Book: <br />http://oreilly.com/catalog/9780596521974/<br />

×