NEARING THE EVENT HORIZON.HADOOP WAS PREDICTABLE, WHAT’S NEXT?            May 23, 2012       Mike Miller                  ...
What I Am    Cloudant Founder, Chief Scientist    (we’re hiring at all positions)    Affiliate Assistant Professor, Particle...
What I Am                                A CDN for your Application DataMike Miller, GlueCon May 2012                     ...
What I Am Not                                didn’t see these coming                                Super luminal neutrino...
My First Postulate of Big-Data                                     Google Matters           What matters for google...    ...
Evidence               Business Week, 12/24/2007Mike Miller, GlueCon May 2012              6
Evidence               Business Week, 12/24/2007Mike Miller, GlueCon May 2012              6
Evidence               Business Week, 12/24/2007Mike Miller, GlueCon May 2012              6
The Old Canon         • Google File System (the important one)           http://labs.google.com/papers/gfs.html         • ...
MapReduce: The Awesome         • Approachable interface           “What do I do with a single piece of data?”         • Da...
So... is that it?   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/Mike Miller, GlueCon May 2012  ...
So... is that it?   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/                               ...
So... is that it?   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/                               ...
MapReduce: The not so Awesome         • Hadoop doesn’t power big data applications           Not a transactional datastore...
To the Event HorizonMike Miller, GlueCon May 2012                          11
Enter The New Canon         • Percolator           incremental processing           http://research.google.com/pubs/pub367...
PercolatorMike Miller, GlueCon May 2012   13
Percolator: incremental processing         • Replaced MapReduce as the tool to build search index           “However, repr...
Percolator: incremental processing  • BigTable plus...    Multi-row ACID Transactions    snapshot isolation, lazy locks   ...
Percolator: incremental processing                                Near Linear Scaling to 15k CoresMike Miller, GlueCon May...
Percolator: incremental processing                                Latency lower than MapReduce by 100xMike Miller, GlueCon...
DremelMike Miller, GlueCon May 2012   18
Dremel: ad-hoc Query         • Scalable, interactive ad-hoc query system for read-only nested data           “...capable o...
Dremel: ad-hoc Query • Ingredients   In situ data   SQL like interface   Serving trees for query execution   Column stripe...
Dremel: ad-hoc Query                                Columns ~10x faster than Records   21Mike Miller, GlueCon May 2012
Dremel: ad-hoc Query                Benchmark Data   MapReduce (via Sawzall)                                       Dremel ...
Dremel: ad-hoc Query                                     Significant Optimization Possible Dremel ~100x Faster than Stock M...
Dremel: ad-hoc Query                          Most Production Queries Executed in <10 secondsMike Miller, GlueCon May 2012...
PregelMike Miller, GlueCon May 2012   25
Pregel: Big Graphs         • Massively parallel processing of big graphs           billions of vertices, trillions of edge...
Pregel: Big Graphs  • Master “Name” node    connects processes for messaging  • Message Passing    no remote procedures, r...
Pregel: Big GraphsMike Miller, GlueCon May 2012   28
Pregel: Big Graphs                                Near Linear Scaling to 1B nodesMike Miller, GlueCon May 2012            ...
Learn More         • Incremental Processing           Incremental, in-database map/reduce in Cloudant’s BigCouch          ...
Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizon...
Upcoming SlideShare
Loading in …5
×

Gluecon miller horizon

8,538 views

Published on

Published in: Technology, News & Politics

Gluecon miller horizon

  1. 1. NEARING THE EVENT HORIZON.HADOOP WAS PREDICTABLE, WHAT’S NEXT? May 23, 2012 Mike Miller mike@cloudant.com @mlmilleratmit
  2. 2. What I Am Cloudant Founder, Chief Scientist (we’re hiring at all positions) Affiliate Assistant Professor, Particle Physics(UW) Background: machine learning, analysis, big data, globally distributed systemsMike Miller, GlueCon May 2012 2
  3. 3. What I Am A CDN for your Application DataMike Miller, GlueCon May 2012 3
  4. 4. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox epic collapse in September Red Wings losing in the first round ... But here I go anywayMike Miller, GlueCon May 2012 4
  5. 5. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you.Mike Miller, GlueCon May 2012 5
  6. 6. Evidence Business Week, 12/24/2007Mike Miller, GlueCon May 2012 6
  7. 7. Evidence Business Week, 12/24/2007Mike Miller, GlueCon May 2012 6
  8. 8. Evidence Business Week, 12/24/2007Mike Miller, GlueCon May 2012 6
  9. 9. The Old Canon • Google File System (the important one) http://labs.google.com/papers/gfs.html • MapReduce (the big one) http://labs.google.com/papers/mapreduce.html • BigTable (clone me!) http://labs.google.com/papers/bigtable.html • Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$Mike Miller, GlueCon May 2012 7
  10. 10. MapReduce: The Awesome • Approachable interface “What do I do with a single piece of data?” • Data Parallel Developers can basically forget about scatter-gather • Fault Tolerant Failure at scale is the norm! Protects both user and system operator • IO Optimized Built for sequential IO commodity disks spinning forward at O(20 MB/sec) eachMike Miller, GlueCon May 2012 8
  11. 11. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/Mike Miller, GlueCon May 2012 9
  12. 12. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/Mike Miller, GlueCon May 2012 9
  13. 13. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/Mike Miller, GlueCon May 2012 9
  14. 14. MapReduce: The not so Awesome • Hadoop doesn’t power big data applications Not a transactional datastore. Slosh back and forth via ETL • Processing latency Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries Bare metal interface, data import • Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120Mike Miller, GlueCon May 2012 10
  15. 15. To the Event HorizonMike Miller, GlueCon May 2012 11
  16. 16. Enter The New Canon • Percolator incremental processing http://research.google.com/pubs/pub36726.html • Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html • Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, ApproachableMike Miller, GlueCon May 2012 12
  17. 17. PercolatorMike Miller, GlueCon May 2012 13
  18. 18. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...”Mike Miller, GlueCon May 2012 14
  19. 19. Percolator: incremental processing • BigTable plus... Multi-row ACID Transactions snapshot isolation, lazy locks up to 10s write latencies Timestamps Notifications Start Timestamp (read) Do not maintain invariants Commit Timestamp (write) Observer Framework your code to be run upon notification of an updateMike Miller, GlueCon May 2012 15
  20. 20. Percolator: incremental processing Near Linear Scaling to 15k CoresMike Miller, GlueCon May 2012 16
  21. 21. Percolator: incremental processing Latency lower than MapReduce by 100xMike Miller, GlueCon May 2012 17
  22. 22. DremelMike Miller, GlueCon May 2012 18
  23. 23. Dremel: ad-hoc Query • Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.” • ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google • Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t • Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... Dream BI ToolMike Miller, GlueCon May 2012 19
  24. 24. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data (3-10x) Analysis CatalogsMike Miller, GlueCon May 2012 20
  25. 25. Dremel: ad-hoc Query Columns ~10x faster than Records 21Mike Miller, GlueCon May 2012
  26. 26. Dremel: ad-hoc Query Benchmark Data MapReduce (via Sawzall) Dremel (via SQL)Mike Miller, GlueCon May 2012 22
  27. 27. Dremel: ad-hoc Query Significant Optimization Possible Dremel ~100x Faster than Stock MRMike Miller, GlueCon May 2012 23
  28. 28. Dremel: ad-hoc Query Most Production Queries Executed in <10 secondsMike Miller, GlueCon May 2012 24
  29. 29. PregelMike Miller, GlueCon May 2012 25
  30. 30. Pregel: Big Graphs • Massively parallel processing of big graphs billions of vertices, trillions of edges • Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology • Expressive, easy to program distribution details hidden behind abstract API • Iterative computation continues until each vertex votes to terminate • In production PageRank 15 lines of codeMike Miller, GlueCon May 2012 26
  31. 31. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recoveryMike Miller, GlueCon May 2012 27
  32. 32. Pregel: Big GraphsMike Miller, GlueCon May 2012 28
  33. 33. Pregel: Big Graphs Near Linear Scaling to 1B nodesMike Miller, GlueCon May 2012 29
  34. 34. Learn More • Incremental Processing Incremental, in-database map/reduce in Cloudant’s BigCouch HBase 0.92 supports observers/coprocessors Stream processing via Storm, HStreaming, etc. • Ad Hoc Query Google BigQuery Column stores (Vertica, etc) OpenDremel (stalled?) ? • Big Graphs Giraph on Hadoop (Apache Incubator) Golden Orb (stalled?)Mike Miller, GlueCon May 2012 30
  35. 35. Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizonMike Miller, GlueCon May 2012 31

×