Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Horizon 20110928

2,404 views

Published on

Slides from Seattle Scalability talk at Amazon on 9/27/2011.

Published in: Technology, News & Politics

Horizon 20110928

  1. 1. NEARING THE EVENT HORIZON.HADOOP WAS PREDICTABLE, WHAT’S NEXT? Mike Miller (UW) _mlmilleratmit September 28, 2011
  2. 2. What I AmAssistant Professor, Particle Physics(UW)Cloudant Founder, Chief ScientistBackground: machine learning, analysis,big data, globally distributed systems Mike Miller 2
  3. 3. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox blow 9 game lead in September Amazon Silk ... But here I go anyway Mike Miller 3
  4. 4. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you. Mike Miller 4
  5. 5. EvidenceBusiness Week, 12/24/2007 Mike Miller 5
  6. 6. EvidenceBusiness Week, 12/24/2007 Mike Miller 5
  7. 7. EvidenceBusiness Week, 12/24/2007 Mike Miller 5
  8. 8. The Old Canon• Google File System (the important one) http://labs.google.com/papers/gfs.html• MapReduce (the big one) http://labs.google.com/papers/mapreduce.html• BigTable (clone me!) http://labs.google.com/papers/bigtable.html• Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$ Mike Miller 6
  9. 9. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ Mike Miller 7
  10. 10. What’s Painful about MapReduce?• Processing latency Non-incremental, must re-slurp entire dataset every pass• Ad-Hoc queries Bare metal interface, data import• Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Mike Miller 8
  11. 11. Enter The New Canon• Percolator incremental processing http://research.google.com/pubs/pub36726.html• Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html• Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, Approachable Mike Miller 9
  12. 12. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...” Mike Miller 10
  13. 13. Percolator: incremental processing• BigTable plus... Transactions snapshot isolation, locks Timestamps Notifications Observers your code to be run upon notification of an update Mike Miller 11
  14. 14. Dremel: ad-hoc Query• Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.”• ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google• Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t• Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... dream BI tool Mike Miller 12
  15. 15. Dremel: ad-hoc Query• Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  16. 16. Dremel: ad-hoc Query• Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  17. 17. Dremel: ad-hoc Query• Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  18. 18. Pregel: Big Graphs• Massively parallel processing of big graphs billions of vertices, trillions of edges• Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology• Expressive, easy to program distribution details hidden behind abstract API• Iterative computation continues until each vertex votes to terminate• In production PageRank 15 lines of code Mike Miller Nothing like this exists in open source 14
  19. 19. Pregel: Big Graphs• Master “Name” node connects processes for messaging• Message Passing no remote procedures, reads• Graph hashed across nodes vertex, outgoing edges stored in RAM• Aggregators global mechanism for aggregation all but final reduce computed on node local data• Checkpointing configurable, enables automatic recovery Mike Miller 15
  20. 20. Pregel: Big Graphs• Master “Name” node connects processes for messaging• Message Passing no remote procedures, reads• Graph hashed across nodes vertex, outgoing edges stored in RAM• Aggregators global mechanism for aggregation all but final reduce computed on node local data• Checkpointing configurable, enables automatic recovery Mike Miller 15
  21. 21. Pregel: Big Graphs• Master “Name” node connects processes for messaging• Message Passing no remote procedures, reads• Graph hashed across nodes vertex, outgoing edges stored in RAM• Aggregators global mechanism for aggregation all but final reduce computed on node local data• Checkpointing configurable, enables automatic recovery Mike Miller 15
  22. 22. Lessons Learned• Hire Jeff Dean and Sanjay Ghemawat• GFS enables everything• There is massive opportunity on the horizon Mike Miller 16

×