NoSQL, Hadoop, Cascading June 2010


Published on

Introduction to NoSQL and Hadoop/Cascading for the Atlanta IASA Chapter

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NoSQL, Hadoop, Cascading June 2010

  1. 1. NoSQL, Hadoop and Cascading<br />Christopher Curtin<br />
  2. 2. About Me<br />20+ years in Technology<br />Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop<br />CTO of Silverpop<br />
  3. 3. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  4. 4. Why NoSQL?<br />Should be ‘not just SQL’<br />Not all problems are relational<br />Not all schemas are known when developing a solution<br />Is LAMP the reason?<br />Difficult to assign a value to a lot of the data in an enterprise<br />
  5. 5. Cost<br />Obligatory slam on Oracle/IBM/Microsoft<br />CPU/Core costs <br />Disks for fast RDBMS performance<br />Clustering, Redundancy and DR<br />Obligatory slam on Accenture, IBM and Corp IT<br />Data warehouses<br />Report writers <br />Lead Time<br />
  6. 6. NIH<br />Some of this is ‘not invented here’<br />But many of these solutions came from the Internet-scale businesses like Facebook, Google, Twitter, Amazon<br />Problems most of us will never see or really understand<br />
  7. 7. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  8. 8. What is NoSQL?<br />Lots of definitions, but a few key ideas<br />Eventual Consistency vs. ACID (usually)<br />Distributed on commodity hardware, easy to add/remove nodes<br />(Typically) Open Source<br />(Usually) non-SQL interface<br />A few examples<br />Graph databases<br />Key/Value stores<br />Document databases<br />
  9. 9. Graph Databases<br />For handling deeply associative data (graphs/networks)<br />Relational isn’t good at recursive structures<br />Data both at the leaves (nodes) and edges (relationships) <br />Very fast to navigate<br />“Whiteboard Friendly”<br />Neo4j (<br />
  10. 10. Key/Value Stores<br />Don’t call them databases ;-)<br />No JOIN concept, instead duplicate data<br />Think of it as a huge Associated Array (Map)<br />Not all values have all attributes<br />Designed for fast reading<br />‘Ridiculous’ sized<br />SimpleDB(Amazon), BigTable (Google)<br />
  11. 11. Document Databases<br />Inspired by Lotus Notes. Really.<br />Collections of Key/Value stores<br />But with some additional visibility into the data of the stores <br />CouchDB (<br />Stores JSON documents<br />Can search in documents quickly<br />No schema so Domain changes are easier<br />
  12. 12. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  13. 13. Map/Reduce<br />Made famous by Google for how the index the web<br />Parallel processing of large datasets across multiple commodity nodes<br />Highly fault tolerant<br />Hadoop is Apache’s implementation of Map/Reduce<br />
  14. 14. Map<br />Identifies what is in the input that you want to process<br />Can be simple: occurrence of a word<br />Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*<br />Output is a list of name/value pairs<br />Name and value do not have to be primitives<br />
  15. 15. Reduce<br />Takes the name/value pairs from the Map step and does something useful with them<br />Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values<br />Output is the ‘answer’ to the question<br />Example: bytes streamed by IP address from Apache logs<br />
  16. 16. HDFS<br />Distributed File System WITHOUT NFS etc.<br />Hadoop knows which parts of which files are on which machine (say that 5 times fast!)<br />“Move the processing to the data” if possible<br />Simple API to move files in and out of HDFS<br />3 Copies of the data in the cluster for redundancy AND performance<br />
  17. 17. Runtime Distribution © Concurrent 2009<br />
  18. 18. Topics<br />Why NoSQL?<br />What is NoSQL?<br />Hadoop<br />Cascading<br />Questions<br />
  19. 19. Getting Started with Map/Reduce<br />First challenge: real examples<br />Second challenge: when to map and when to reduce?<br />Third challenge: what if I need more than one of each? How to coordinate?<br />Fourth challenge: non-trivial business logic<br />
  20. 20. Cascading<br />Hadoop coding is non-trivial<br />Hadoop is looking for a class to do Map steps and a class to do Reduce step<br />What if you need multiple in your application? Who coordinates what can be run in parallel?<br />What if you need to do non-Hadoop logic between Hadoop steps?<br />
  21. 21. Tuple<br />A single ‘row’ of data being processed<br />Each column is named<br />Can access data by name or position<br />
  22. 22. TAP<br />Abstraction on top of Hadoop files<br />Allows you to define own parser for files<br />Example:<br />Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);<br />
  23. 23. Operations<br />Define what to do on the data<br />Each – for each “tuple” in data do this to it<br />Group – similar to a ‘group by’ in SQL<br />CoGroup – joins of tuple streams together<br />Every – for every key in the Group or CoGroup do this<br />
  24. 24. Pipes<br />Pipes tie Operations together<br />Pipes can be thought of as ‘tuple streams’<br />Pipes can be split, allowing parallel execution of Operations<br />
  25. 25. Operations - advanced<br />Each operations allow logic on the row, such a parsing dates, creating new attributes etc.<br />Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.<br />Both allow multiple operations in same function, so no nested function calls!<br />
  26. 26. PIPE Splitting<br />PIPEs define processing flows<br />A SPLIT of a PIPE will tell Cascading to pass the TUPLE to another set of logic<br />Allows single parsing EACH to feed multiple EVERY steps<br />Or output from one EVERY to feed different EVERY steps<br />
  27. 27. Flows<br />Flows are reusable combinations of Taps, Pipes and Operations<br />Allows you to build library of functions <br />Groups of Flow are called Cascades<br />
  28. 28. Cascading Scheduler<br />Once the Flows and Cascades are defined, looks for dependencies<br />When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used<br />Knows what can be executed in parallel<br />Knows when a step completes what other steps can execute<br />
  29. 29. Example Operation<br />RowAggregator aggr = new RowAggregator(row);<br />Fields groupBy = new Fields(ColumnDefinition.RECIPIENT_ID_NAME);<br />Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);<br />formatPipe = new GroupBy(formatPipe, groupBy);<br />formatPipe = new Every(formatPipe, Fields.ALL, aggr);<br />
  30. 30. Runtime Distribution © Concurrent 2009<br />
  31. 31. Dynamic Flow Creation<br />Flows can be created at run time based on inputs.<br />5 input files one week, 10 the next, Java code creates 10 Flows instead of 5<br />Group and Every don’t care how many input Taps<br />
  32. 32. Dynamic Tuple Definition<br />Each operations on input Taps can parse text lines into different Fields<br />So one source may have 5 inputs, another 10<br />Each operations can used meta data to know how to parse<br />Can write Each operations to output common Tuples<br />Every operations can output new Tuples as well<br />Dynamically provide GroupBy fields so end users can build own ad-hoc logic<br />
  33. 33. Mixing non-Hadoop code<br />Cascading allows you to mix regular java between Flows in a Cascade<br />So you can call out to databases, write intermediates to a file etc.<br />
  34. 34. External File Creation<br />An aggregator can write to the local file system<br />Sometimes you don’t need or want to push back to HDFS <br />NFS mounts needed since you don’t know where the logic is going to execute<br />GROUP BY is critical to doing this right<br />
  35. 35. Real Example<br />For the hundreds of mailings sent last year<br />To millions of recipients<br />Show me who opened, how often<br />Break it down by how long they have been a subscriber<br />And their Gender<br />And the number of times clicked on the offer<br />
  36. 36. RDBMS solution<br />Lots of million + row joins<br />Lots of million + row counts<br />Temporary tables since we want multiple answers<br />Lots of memory<br />Lots of CPU and I/O<br />$$ becomes bottleneck to adding more rows or more clients to same logic<br />
  37. 37. Cascading Solution<br />Let Hadoop parse input files<br />Let Cascading group all inputs by recipient’s email<br />Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data<br />Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links<br />Bandwidth to export data from RDBMS becomes bottleneck<br />
  38. 38. Pros and Cons<br />Pros<br />Mix java between map/reduce steps<br />Don’t have to worry about when to map, when to reduce<br />Don’t have to think about dependencies or how to process<br />Data definition can change on the fly<br />Cons<br />Level above Hadoop – sometimes ‘black magic’<br />Data must (should) be outside of database to get most concurrency <br />
  39. 39. Other Solutions<br />Apache Pig:<br />More ‘sql-like’ <br />Not as easy to mix regular Java into processes<br />More ‘ad hoc’ than Cascading<br />Amazon Hadoop<br />Runs on EC2<br />Provide Map and Reduce functions<br />Can use Cascading <br />Pay as you go<br />
  40. 40. Resources<br />Me: @ChrisCurtin<br />Chris Wensel: @cwensel <br />Web site:<br />Mailing list off website<br />AWSome Atlanta Group:<br />O’Reilly Hadoop Book: <br /><br />