08448380779 Call Girls In Civil Lines Women Seeking Men
Horizon 20110928
1. NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?
Mike Miller (UW)
_mlmilleratmit
September 28, 2011
2. What I Am
Assistant Professor, Particle Physics
(UW)
Cloudant Founder, Chief Scientist
Background: machine learning, analysis,
big data, globally distributed systems
Mike Miller
2
3. What I Am Not
didn’t see these coming
Super luminal neutrinos
Red Sox blow 9 game lead in September
Amazon Silk
...
But here I go anyway
Mike Miller
3
4. My First Postulate of Big-Data
Google Matters
What matters for google...
... matters for the internet...
...and therefore matters for the enterprise...
... will therefore be re-architected by Apache...
... and therefore matters to you.
Mike Miller
4
8. The Old Canon
• Google File System (the important one)
http://labs.google.com/papers/gfs.html
• MapReduce (the big one)
http://labs.google.com/papers/mapreduce.html
• BigTable (clone me!)
http://labs.google.com/papers/bigtable.html
• Dynamo (ok, AWS. but masterless quorum)
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
copy these. use these. print $$$
Mike Miller
6
9. So... is that it?
http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/
Mike Miller
7
10. What’s Painful about MapReduce?
• Processing latency
Non-incremental, must re-slurp entire dataset every pass
• Ad-Hoc queries
Bare metal interface, data import
• Graphs
Only a handful of graph problems amenable to MR
http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120
Mike Miller
8
11. Enter The New Canon
• Percolator
incremental processing
http://research.google.com/pubs/pub36726.html
• Dremel
ad-hoc analysis queries
http://research.google.com/pubs/pub36632.html
• Pregel
Big graphs
http://dl.acm.org/citation.cfm?id=1807184
Scalable, Fault Tolerant, Approachable
Mike Miller
9
12. Percolator: incremental processing
• Replaced MapReduce as the tool to build search index
“However, reprocessing the entire web discards the work done in earlier
runs and makes latency proportional to the size of the repository, rather
than the size of the update.”
• Bigtable alone can’t do it
“BigTable scales...but doesn’t provide tools to help programmers maintain
data invariants in the face of concurrent updates.”
• Applicability
Incrementally updating data
Computational output can be broken down into small pieces
Computation large in some dimension (data size, cpu, etc)
• Does it matter?
“...Converting the indexing system to an incremental system ... reduced the
averaging document processing latency by a factor of 100...”
Mike Miller
10
13. Percolator: incremental processing
• BigTable plus...
Transactions
snapshot isolation, locks
Timestamps
Notifications
Observers
your code to be run upon notification
of an update
Mike Miller
11
14. Dremel: ad-hoc Query
• Scalable, interactive ad-hoc query system for read-only nested
data
“...capable of running aggregation queries over trillion-row tables in seconds.”
• ... on nested data structures in situ
Web and scientific data is often non-relational
nested data (protobuffs) underlies most structured data at Google
• Usage
DEFINE TABLE t AS /path/to/data/*
SELECT TOP(signal1,100), COUNT(*) FROM t
• Applicability
Analysis of crawled documents
Tracking of install data for apps on Android Market
Crash reports
Spam analysis...
dream BI tool
Mike Miller
12
15. Dremel: ad-hoc Query
• Ingredients
In situ data
SQL like interface
Serving trees for query execution
Column striped data
Mike Miller
13
16. Dremel: ad-hoc Query
• Ingredients
In situ data
SQL like interface
Serving trees for query execution
Column striped data
Mike Miller
13
17. Dremel: ad-hoc Query
• Ingredients
In situ data
SQL like interface
Serving trees for query execution
Column striped data
Mike Miller
13
18. Pregel: Big Graphs
• Massively parallel processing of big graphs
billions of vertices, trillions of edges
• Bulk synchronous parallel model
sequence of vertex oriented iterations
send/receive messages from other vertex computations
read/modify state of vertex, outgoing edges, graph topology
• Expressive, easy to program
distribution details hidden behind abstract API
• Iterative
computation continues until each vertex votes to terminate
• In production
PageRank 15 lines of code
Mike Miller Nothing like this exists in open source
14
19. Pregel: Big Graphs
• Master “Name” node
connects processes for messaging
• Message Passing
no remote procedures, reads
• Graph hashed across nodes
vertex, outgoing edges stored in RAM
• Aggregators
global mechanism for aggregation
all but final reduce computed on node
local data
• Checkpointing
configurable, enables automatic recovery
Mike Miller
15
20. Pregel: Big Graphs
• Master “Name” node
connects processes for messaging
• Message Passing
no remote procedures, reads
• Graph hashed across nodes
vertex, outgoing edges stored in RAM
• Aggregators
global mechanism for aggregation
all but final reduce computed on node
local data
• Checkpointing
configurable, enables automatic recovery
Mike Miller
15
21. Pregel: Big Graphs
• Master “Name” node
connects processes for messaging
• Message Passing
no remote procedures, reads
• Graph hashed across nodes
vertex, outgoing edges stored in RAM
• Aggregators
global mechanism for aggregation
all but final reduce computed on node
local data
• Checkpointing
configurable, enables automatic recovery
Mike Miller
15
22. Lessons Learned
• Hire Jeff Dean and Sanjay Ghemawat
• GFS enables everything
• There is massive opportunity on the horizon
Mike Miller
16