Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
NoSQL, Hadoop and Cascading<br />Christopher Curtin<br />
About Me<br />20+ years in Technology<br />Background in Factory Automation, Warehouse Management and Food Safety system d...
Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
Why NoSQL?<br />Should be ‘not just SQL’<br />Not all problems are relational<br />Not all schemas are known when developi...
Cost<br />Obligatory slam on Oracle/IBM/Microsoft<br />CPU/Core costs <br />Disks for fast RDBMS performance<br />Clusteri...
NIH<br />Some of this is ‘not invented here’<br />But many of these solutions came from the Internet-scale businesses like...
Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
What is NoSQL?<br />Lots of definitions, but  a few key ideas<br />Eventual Consistency vs. ACID (usually)<br />Distribute...
Graph Databases<br />For handling deeply associative data (graphs/networks)<br />Relational isn’t good at recursive struct...
Key/Value Stores<br />Don’t call them databases ;-)<br />No JOIN concept, instead duplicate data<br />Think of it as a hug...
Document Databases<br />Inspired by Lotus Notes. Really.<br />Collections of Key/Value stores<br />But with some additiona...
Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
Map/Reduce<br />Made famous by Google for how the index the web<br />Parallel processing of large datasets across multiple...
Map<br />Identifies what is in the input that you want to process<br />Can be simple: occurrence of a word<br />Can be dif...
Reduce<br />Takes the name/value pairs from the Map step and does something useful with them<br />Map/Reduce Framework det...
HDFS<br />Distributed File System WITHOUT NFS etc.<br />Hadoop knows which parts of which files are on which machine (say ...
Runtime Distribution © Concurrent 2009<br />
Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
Getting Started with Map/Reduce<br />First challenge: real examples<br />Second challenge: when to map and when to reduce?...
Cascading<br />Hadoop coding is non-trivial<br />Hadoop is looking for a class to do Map steps and a class to do Reduce st...
Tuple<br />A single ‘row’ of data being processed<br />Each column is named<br />Can access data by name or position<br />
TAP<br />Abstraction on top of Hadoop files<br />Allows you to define own parser for files<br />Example:<br />Input = new ...
Operations<br />Define what to do on the data<br />Each – for each “tuple” in data do this to it<br />Group – similar to a...
Pipes<br />Pipes tie Operations together<br />Pipes can be thought of as ‘tuple streams’<br />Pipes can be split, allowing...
Operations - advanced<br />Each operations allow logic on the row, such a parsing dates, creating new attributes etc.<br /...
PIPE Splitting<br />PIPEs define processing flows<br />A SPLIT of a PIPE will tell Cascading to pass the TUPLE to another ...
Flows<br />Flows are reusable combinations of Taps, Pipes and Operations<br />Allows you to build library of functions <br...
Cascading Scheduler<br />Once the Flows and Cascades are defined, looks for dependencies<br />When executed, tells Hadoop ...
Example Operation<br />RowAggregator aggr = new RowAggregator(row);<br />Fields groupBy = new Fields(ColumnDefinition.RECI...
Runtime Distribution © Concurrent 2009<br />
Dynamic Flow Creation<br />Flows can be created at run time based on inputs.<br />5 input files one week, 10 the next, Jav...
Dynamic Tuple Definition<br />Each operations on input Taps can parse text lines into different Fields<br />So one source ...
Mixing non-Hadoop code<br />Cascading allows you to mix regular java between Flows in a Cascade<br />So you can call out t...
External File Creation<br />An aggregator can write to the local file system<br />Sometimes you don’t need or want to push...
Real Example<br />For the hundreds of mailings sent last year<br />To millions of recipients<br />Show me who opened, how ...
RDBMS solution<br />Lots of million + row joins<br />Lots of million + row counts<br />Temporary tables since we want mult...
Cascading Solution<br />Let Hadoop parse input files<br />Let Cascading group all inputs by recipient’s email<br />Let Cas...
Pros and Cons<br />Pros<br />Mix java between map/reduce steps<br />Don’t have to worry about when to map, when to reduce<...
Other Solutions<br />Apache Pig: http://hadoop.apache.org/pig/<br />More ‘sql-like’ <br />Not as easy to mix regular Java ...
Resources<br />Me: ccurtin@silverpop.com @ChrisCurtin<br />Chris Wensel: @cwensel <br />Web site: www.cascading.org<br />M...
Upcoming SlideShare
Loading in …5
×

NoSQL, Hadoop, Cascading June 2010

4,408 views

Published on

Introduction to NoSQL and Hadoop/Cascading for the Atlanta IASA Chapter

Published in: Technology
  • Be the first to comment

NoSQL, Hadoop, Cascading June 2010

  1. 1. NoSQL, Hadoop and Cascading<br />Christopher Curtin<br />
  2. 2. About Me<br />20+ years in Technology<br />Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop<br />CTO of Silverpop<br />
  3. 3. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  4. 4. Why NoSQL?<br />Should be ‘not just SQL’<br />Not all problems are relational<br />Not all schemas are known when developing a solution<br />Is LAMP the reason?<br />Difficult to assign a value to a lot of the data in an enterprise<br />
  5. 5. Cost<br />Obligatory slam on Oracle/IBM/Microsoft<br />CPU/Core costs <br />Disks for fast RDBMS performance<br />Clustering, Redundancy and DR<br />Obligatory slam on Accenture, IBM and Corp IT<br />Data warehouses<br />Report writers <br />Lead Time<br />
  6. 6. NIH<br />Some of this is ‘not invented here’<br />But many of these solutions came from the Internet-scale businesses like Facebook, Google, Twitter, Amazon<br />Problems most of us will never see or really understand<br />
  7. 7. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  8. 8. What is NoSQL?<br />Lots of definitions, but a few key ideas<br />Eventual Consistency vs. ACID (usually)<br />Distributed on commodity hardware, easy to add/remove nodes<br />(Typically) Open Source<br />(Usually) non-SQL interface<br />A few examples<br />Graph databases<br />Key/Value stores<br />Document databases<br />
  9. 9. Graph Databases<br />For handling deeply associative data (graphs/networks)<br />Relational isn’t good at recursive structures<br />Data both at the leaves (nodes) and edges (relationships) <br />Very fast to navigate<br />“Whiteboard Friendly”<br />Neo4j (www.neo4j.org)<br />
  10. 10. Key/Value Stores<br />Don’t call them databases ;-)<br />No JOIN concept, instead duplicate data<br />Think of it as a huge Associated Array (Map)<br />Not all values have all attributes<br />Designed for fast reading<br />‘Ridiculous’ sized<br />SimpleDB(Amazon), BigTable (Google)<br />
  11. 11. Document Databases<br />Inspired by Lotus Notes. Really.<br />Collections of Key/Value stores<br />But with some additional visibility into the data of the stores <br />CouchDB (couchdb.apache.org)<br />Stores JSON documents<br />Can search in documents quickly<br />No schema so Domain changes are easier<br />
  12. 12. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  13. 13. Map/Reduce<br />Made famous by Google for how the index the web<br />Parallel processing of large datasets across multiple commodity nodes<br />Highly fault tolerant<br />Hadoop is Apache’s implementation of Map/Reduce<br />
  14. 14. Map<br />Identifies what is in the input that you want to process<br />Can be simple: occurrence of a word<br />Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*<br />Output is a list of name/value pairs<br />Name and value do not have to be primitives<br />
  15. 15. Reduce<br />Takes the name/value pairs from the Map step and does something useful with them<br />Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values<br />Output is the ‘answer’ to the question<br />Example: bytes streamed by IP address from Apache logs<br />
  16. 16. HDFS<br />Distributed File System WITHOUT NFS etc.<br />Hadoop knows which parts of which files are on which machine (say that 5 times fast!)<br />“Move the processing to the data” if possible<br />Simple API to move files in and out of HDFS<br />3 Copies of the data in the cluster for redundancy AND performance<br />
  17. 17. Runtime Distribution © Concurrent 2009<br />
  18. 18. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  19. 19. Getting Started with Map/Reduce<br />First challenge: real examples<br />Second challenge: when to map and when to reduce?<br />Third challenge: what if I need more than one of each? How to coordinate?<br />Fourth challenge: non-trivial business logic<br />
  20. 20. Cascading<br />Hadoop coding is non-trivial<br />Hadoop is looking for a class to do Map steps and a class to do Reduce step<br />What if you need multiple in your application? Who coordinates what can be run in parallel?<br />What if you need to do non-Hadoop logic between Hadoop steps?<br />
  21. 21. Tuple<br />A single ‘row’ of data being processed<br />Each column is named<br />Can access data by name or position<br />
  22. 22. TAP<br />Abstraction on top of Hadoop files<br />Allows you to define own parser for files<br />Example:<br />Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);<br />
  23. 23. Operations<br />Define what to do on the data<br />Each – for each “tuple” in data do this to it<br />Group – similar to a ‘group by’ in SQL<br />CoGroup – joins of tuple streams together<br />Every – for every key in the Group or CoGroup do this<br />
  24. 24. Pipes<br />Pipes tie Operations together<br />Pipes can be thought of as ‘tuple streams’<br />Pipes can be split, allowing parallel execution of Operations<br />
  25. 25. Operations - advanced<br />Each operations allow logic on the row, such a parsing dates, creating new attributes etc.<br />Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.<br />Both allow multiple operations in same function, so no nested function calls!<br />
  26. 26. PIPE Splitting<br />PIPEs define processing flows<br />A SPLIT of a PIPE will tell Cascading to pass the TUPLE to another set of logic<br />Allows single parsing EACH to feed multiple EVERY steps<br />Or output from one EVERY to feed different EVERY steps<br />
  27. 27. Flows<br />Flows are reusable combinations of Taps, Pipes and Operations<br />Allows you to build library of functions <br />Groups of Flow are called Cascades<br />
  28. 28. Cascading Scheduler<br />Once the Flows and Cascades are defined, looks for dependencies<br />When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used<br />Knows what can be executed in parallel<br />Knows when a step completes what other steps can execute<br />
  29. 29. Example Operation<br />RowAggregator aggr = new RowAggregator(row);<br />Fields groupBy = new Fields(ColumnDefinition.RECIPIENT_ID_NAME);<br />Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);<br />formatPipe = new GroupBy(formatPipe, groupBy);<br />formatPipe = new Every(formatPipe, Fields.ALL, aggr);<br />
  30. 30. Runtime Distribution © Concurrent 2009<br />
  31. 31. Dynamic Flow Creation<br />Flows can be created at run time based on inputs.<br />5 input files one week, 10 the next, Java code creates 10 Flows instead of 5<br />Group and Every don’t care how many input Taps<br />
  32. 32. Dynamic Tuple Definition<br />Each operations on input Taps can parse text lines into different Fields<br />So one source may have 5 inputs, another 10<br />Each operations can used meta data to know how to parse<br />Can write Each operations to output common Tuples<br />Every operations can output new Tuples as well<br />Dynamically provide GroupBy fields so end users can build own ad-hoc logic<br />
  33. 33. Mixing non-Hadoop code<br />Cascading allows you to mix regular java between Flows in a Cascade<br />So you can call out to databases, write intermediates to a file etc.<br />
  34. 34. External File Creation<br />An aggregator can write to the local file system<br />Sometimes you don’t need or want to push back to HDFS <br />NFS mounts needed since you don’t know where the logic is going to execute<br />GROUP BY is critical to doing this right<br />
  35. 35. Real Example<br />For the hundreds of mailings sent last year<br />To millions of recipients<br />Show me who opened, how often<br />Break it down by how long they have been a subscriber<br />And their Gender<br />And the number of times clicked on the offer<br />
  36. 36. RDBMS solution<br />Lots of million + row joins<br />Lots of million + row counts<br />Temporary tables since we want multiple answers<br />Lots of memory<br />Lots of CPU and I/O<br />$$ becomes bottleneck to adding more rows or more clients to same logic<br />
  37. 37. Cascading Solution<br />Let Hadoop parse input files<br />Let Cascading group all inputs by recipient’s email<br />Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data<br />Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links<br />Bandwidth to export data from RDBMS becomes bottleneck<br />
  38. 38. Pros and Cons<br />Pros<br />Mix java between map/reduce steps<br />Don’t have to worry about when to map, when to reduce<br />Don’t have to think about dependencies or how to process<br />Data definition can change on the fly<br />Cons<br />Level above Hadoop – sometimes ‘black magic’<br />Data must (should) be outside of database to get most concurrency <br />
  39. 39. Other Solutions<br />Apache Pig: http://hadoop.apache.org/pig/<br />More ‘sql-like’ <br />Not as easy to mix regular Java into processes<br />More ‘ad hoc’ than Cascading<br />Amazon Hadoop http://aws.amazon.com/elasticmapreduce/<br />Runs on EC2<br />Provide Map and Reduce functions<br />Can use Cascading <br />Pay as you go<br />
  40. 40. Resources<br />Me: ccurtin@silverpop.com @ChrisCurtin<br />Chris Wensel: @cwensel <br />Web site: www.cascading.org<br />Mailing list off website<br />AWSome Atlanta Group: http://www.meetup.com/awsomeatlanta/<br />O’Reilly Hadoop Book: <br />http://oreilly.com/catalog/9780596521974/<br />

×