SlideShare a Scribd company logo
1 of 35
Download to read offline
NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?




            May 23, 2012       Mike Miller
                              mike@cloudant.com
                                @mlmilleratmit
What I Am

    Cloudant Founder, Chief Scientist
    (we’re hiring at all positions)

    Affiliate Assistant Professor, Particle Physics(UW)

    Background: machine learning, analysis, big data,
    globally distributed systems




Mike Miller, GlueCon May 2012                           2
What I Am




                                A CDN for your Application Data
Mike Miller, GlueCon May 2012                                     3
What I Am Not


                                didn’t see these coming
                                Super luminal neutrinos
                                Red Sox epic collapse in September
                                Red Wings losing in the first round
                                ...

                                But here I go anyway




Mike Miller, GlueCon May 2012                                        4
My First Postulate of Big-Data

                                     Google Matters

           What matters for google...
           ... matters for the internet...
           ...and therefore matters for the enterprise...
           ... will therefore be re-architected by Apache...
           ... and therefore matters to you.




Mike Miller, GlueCon May 2012                                  5
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
The Old Canon
         • Google File System (the important one)
           http://labs.google.com/papers/gfs.html

         • MapReduce (the big one)
           http://labs.google.com/papers/mapreduce.html

         • BigTable (clone me!)
           http://labs.google.com/papers/bigtable.html

         • Dynamo (ok, AWS. but masterless quorum)
           http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf



                                copy these. use these. print $$$
Mike Miller, GlueCon May 2012                                                             7
MapReduce: The Awesome
         • Approachable interface
           “What do I do with a single piece of data?”

         • Data Parallel
           Developers can basically forget about scatter-gather

         • Fault Tolerant
           Failure at scale is the norm!
           Protects both user and system operator

         • IO Optimized
           Built for sequential IO
           commodity disks spinning forward at O(20 MB/sec) each




Mike Miller, GlueCon May 2012                                      8
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




Mike Miller, GlueCon May 2012                                                9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/




Mike Miller, GlueCon May 2012                                                                                              9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/


                                                                                      http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/
Mike Miller, GlueCon May 2012                                                                                                                                            9
MapReduce: The not so Awesome
         • Hadoop doesn’t power big data applications
           Not a transactional datastore. Slosh back and forth via ETL

         • Processing latency
           Non-incremental, must re-slurp entire dataset every pass

         • Ad-Hoc queries
           Bare metal interface, data import

         • Graphs
           Only a handful of graph problems amenable to MR
           http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120




Mike Miller, GlueCon May 2012                                                  10
To the Event Horizon




Mike Miller, GlueCon May 2012                          11
Enter The New Canon
         • Percolator
           incremental processing
           http://research.google.com/pubs/pub36726.html

         • Dremel
           ad-hoc analysis queries
           http://research.google.com/pubs/pub36632.html

         • Pregel
           Big graphs
           http://dl.acm.org/citation.cfm?id=1807184


                                Scalable, Fault Tolerant, Approachable

Mike Miller, GlueCon May 2012                                            12
Percolator




Mike Miller, GlueCon May 2012   13
Percolator: incremental processing
         • Replaced MapReduce as the tool to build search index
           “However, reprocessing the entire web discards the work done in earlier runs and makes latency
           proportional to the size of the repository, rather than the size of the update.”

         • Bigtable alone can’t do it
           “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the
           face of concurrent updates.”

         • Applicability
           Incrementally updating data
           Computational output can be broken down into small pieces
           Computation large in some dimension (data size, cpu, etc)

         • Does it matter?
           “...Converting the indexing system to an incremental system ... reduced the averaging document
           processing latency by a factor of 100...”


Mike Miller, GlueCon May 2012                                                                                 14
Percolator: incremental processing
  • BigTable plus...
    Multi-row ACID Transactions
    snapshot isolation, lazy locks
    up to 10s write latencies

    Timestamps

    Notifications                                        Start Timestamp (read)
    Do not maintain invariants
                                                        Commit Timestamp (write)
    Observer Framework
    your code to be run upon notification of an update


Mike Miller, GlueCon May 2012                                                      15
Percolator: incremental processing




                                Near Linear Scaling to 15k Cores
Mike Miller, GlueCon May 2012                                      16
Percolator: incremental processing




                                Latency lower than MapReduce by 100x
Mike Miller, GlueCon May 2012                                          17
Dremel




Mike Miller, GlueCon May 2012   18
Dremel: ad-hoc Query
         • Scalable, interactive ad-hoc query system for read-only nested data
           “...capable of running aggregation queries over trillion-row tables in seconds.”

         • ... on nested data structures in situ
           Web and scientific data is often non-relational
           nested data (protobuffs) underlies most structured data at Google

         • Usage
           DEFINE TABLE t AS /path/to/data/*
           SELECT TOP(signal1,100), COUNT(*) FROM t

         • Applicability
           Analysis of crawled documents
           Tracking of install data for apps on Android Market
           Crash reports
           Spam analysis...

                                                      Dream BI Tool
Mike Miller, GlueCon May 2012                                                                 19
Dremel: ad-hoc Query
 • Ingredients
   In situ data
   SQL like interface
   Serving trees for query execution
   Column striped data (3-10x)
   Analysis Catalogs




Mike Miller, GlueCon May 2012          20
Dremel: ad-hoc Query




                                Columns ~10x faster than Records   21
Mike Miller, GlueCon May 2012
Dremel: ad-hoc Query



                Benchmark Data   MapReduce (via Sawzall)




                                       Dremel (via SQL)

Mike Miller, GlueCon May 2012                              22
Dremel: ad-hoc Query



                                     Significant Optimization Possible


 Dremel ~100x Faster than Stock MR




Mike Miller, GlueCon May 2012                                           23
Dremel: ad-hoc Query




                          Most Production Queries Executed in <10 seconds

Mike Miller, GlueCon May 2012                                               24
Pregel




Mike Miller, GlueCon May 2012   25
Pregel: Big Graphs
         • Massively parallel processing of big graphs
           billions of vertices, trillions of edges

         • Bulk synchronous parallel model
           sequence of vertex oriented iterations
           send/receive messages from other vertex computations
           read/modify state of vertex, outgoing edges, graph topology

         • Expressive, easy to program
           distribution details hidden behind abstract API

         • Iterative
           computation continues until each vertex votes to terminate

         • In production
           PageRank 15 lines of code


Mike Miller, GlueCon May 2012                                            26
Pregel: Big Graphs
  • Master “Name” node
    connects processes for messaging

  • Message Passing
    no remote procedures, reads

  • Graph hashed across nodes
    vertex, outgoing edges stored in RAM

  • Aggregators
    global mechanism for aggregation
    all but final reduce computed on node local data

  • Checkpointing
    configurable, enables automatic recovery


Mike Miller, GlueCon May 2012                         27
Pregel: Big Graphs




Mike Miller, GlueCon May 2012   28
Pregel: Big Graphs




                                Near Linear Scaling to 1B nodes
Mike Miller, GlueCon May 2012                                     29
Learn More
         • Incremental Processing
           Incremental, in-database map/reduce in Cloudant’s BigCouch
           HBase 0.92 supports observers/coprocessors
           Stream processing via Storm, HStreaming, etc.

         • Ad Hoc Query
           Google BigQuery
           Column stores (Vertica, etc)
           OpenDremel (stalled?)
           ?

         • Big Graphs
           Giraph on Hadoop (Apache Incubator)
           Golden Orb (stalled?)


Mike Miller, GlueCon May 2012                                           30
Lessons Learned


 • Hire Jeff Dean and Sanjay Ghemawat
 • GFS enables everything
 • There is massive opportunity on the horizon




Mike Miller, GlueCon May 2012                    31

More Related Content

Similar to Gluecon miller horizon

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012Mike Miller
 
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012GoGrid Cloud Hosting
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopChung-Tsai Su
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Chris Jang
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technologyUpside Energy Ltd
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data CenterAbe Usher
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 

Similar to Gluecon miller horizon (20)

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
CloudCamp
CloudCampCloudCamp
CloudCamp
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technology
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data Center
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Gluecon miller horizon

  • 1. NEARING THE EVENT HORIZON. HADOOP WAS PREDICTABLE, WHAT’S NEXT? May 23, 2012 Mike Miller mike@cloudant.com @mlmilleratmit
  • 2. What I Am Cloudant Founder, Chief Scientist (we’re hiring at all positions) Affiliate Assistant Professor, Particle Physics(UW) Background: machine learning, analysis, big data, globally distributed systems Mike Miller, GlueCon May 2012 2
  • 3. What I Am A CDN for your Application Data Mike Miller, GlueCon May 2012 3
  • 4. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox epic collapse in September Red Wings losing in the first round ... But here I go anyway Mike Miller, GlueCon May 2012 4
  • 5. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you. Mike Miller, GlueCon May 2012 5
  • 6. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 7. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 8. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 9. The Old Canon • Google File System (the important one) http://labs.google.com/papers/gfs.html • MapReduce (the big one) http://labs.google.com/papers/mapreduce.html • BigTable (clone me!) http://labs.google.com/papers/bigtable.html • Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$ Mike Miller, GlueCon May 2012 7
  • 10. MapReduce: The Awesome • Approachable interface “What do I do with a single piece of data?” • Data Parallel Developers can basically forget about scatter-gather • Fault Tolerant Failure at scale is the norm! Protects both user and system operator • IO Optimized Built for sequential IO commodity disks spinning forward at O(20 MB/sec) each Mike Miller, GlueCon May 2012 8
  • 11. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ Mike Miller, GlueCon May 2012 9
  • 12. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ Mike Miller, GlueCon May 2012 9
  • 13. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/ Mike Miller, GlueCon May 2012 9
  • 14. MapReduce: The not so Awesome • Hadoop doesn’t power big data applications Not a transactional datastore. Slosh back and forth via ETL • Processing latency Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries Bare metal interface, data import • Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Mike Miller, GlueCon May 2012 10
  • 15. To the Event Horizon Mike Miller, GlueCon May 2012 11
  • 16. Enter The New Canon • Percolator incremental processing http://research.google.com/pubs/pub36726.html • Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html • Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, Approachable Mike Miller, GlueCon May 2012 12
  • 18. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...” Mike Miller, GlueCon May 2012 14
  • 19. Percolator: incremental processing • BigTable plus... Multi-row ACID Transactions snapshot isolation, lazy locks up to 10s write latencies Timestamps Notifications Start Timestamp (read) Do not maintain invariants Commit Timestamp (write) Observer Framework your code to be run upon notification of an update Mike Miller, GlueCon May 2012 15
  • 20. Percolator: incremental processing Near Linear Scaling to 15k Cores Mike Miller, GlueCon May 2012 16
  • 21. Percolator: incremental processing Latency lower than MapReduce by 100x Mike Miller, GlueCon May 2012 17
  • 23. Dremel: ad-hoc Query • Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.” • ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google • Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t • Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... Dream BI Tool Mike Miller, GlueCon May 2012 19
  • 24. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data (3-10x) Analysis Catalogs Mike Miller, GlueCon May 2012 20
  • 25. Dremel: ad-hoc Query Columns ~10x faster than Records 21 Mike Miller, GlueCon May 2012
  • 26. Dremel: ad-hoc Query Benchmark Data MapReduce (via Sawzall) Dremel (via SQL) Mike Miller, GlueCon May 2012 22
  • 27. Dremel: ad-hoc Query Significant Optimization Possible Dremel ~100x Faster than Stock MR Mike Miller, GlueCon May 2012 23
  • 28. Dremel: ad-hoc Query Most Production Queries Executed in <10 seconds Mike Miller, GlueCon May 2012 24
  • 30. Pregel: Big Graphs • Massively parallel processing of big graphs billions of vertices, trillions of edges • Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology • Expressive, easy to program distribution details hidden behind abstract API • Iterative computation continues until each vertex votes to terminate • In production PageRank 15 lines of code Mike Miller, GlueCon May 2012 26
  • 31. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller, GlueCon May 2012 27
  • 32. Pregel: Big Graphs Mike Miller, GlueCon May 2012 28
  • 33. Pregel: Big Graphs Near Linear Scaling to 1B nodes Mike Miller, GlueCon May 2012 29
  • 34. Learn More • Incremental Processing Incremental, in-database map/reduce in Cloudant’s BigCouch HBase 0.92 supports observers/coprocessors Stream processing via Storm, HStreaming, etc. • Ad Hoc Query Google BigQuery Column stores (Vertica, etc) OpenDremel (stalled?) ? • Big Graphs Giraph on Hadoop (Apache Incubator) Golden Orb (stalled?) Mike Miller, GlueCon May 2012 30
  • 35. Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizon Mike Miller, GlueCon May 2012 31