SlideShare a Scribd company logo
1 of 22
Download to read offline
NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?




                       Mike Miller (UW)
                        _mlmilleratmit
                       September 28, 2011
What I Am

Assistant Professor, Particle Physics
(UW)

Cloudant Founder, Chief Scientist

Background: machine learning, analysis,
big data, globally distributed systems




  Mike Miller
                                          2
What I Am Not

               didn’t see these coming
               Super luminal neutrinos
               Red Sox blow 9 game lead in September
               Amazon Silk
               ...

               But here I go anyway




 Mike Miller
                                                       3
My First Postulate of Big-Data

               Google Matters

  What matters for google...
  ... matters for the internet...
  ...and therefore matters for the enterprise...
  ... will therefore be re-architected by Apache...
  ... and therefore matters to you.



 Mike Miller
                                                      4
Evidence




Business Week, 12/24/2007




  Mike Miller
                            5
Evidence




Business Week, 12/24/2007




  Mike Miller
                            5
Evidence




Business Week, 12/24/2007




  Mike Miller
                            5
The Old Canon
• Google File System (the important one)
  http://labs.google.com/papers/gfs.html

• MapReduce (the big one)
  http://labs.google.com/papers/mapreduce.html

• BigTable (clone me!)
  http://labs.google.com/papers/bigtable.html

• Dynamo (ok, AWS. but masterless quorum)
  http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf



          copy these. use these. print $$$
 Mike Miller
                                                                                 6
So... is that it?




 http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/


 Mike Miller
                                                                           7
What’s Painful about MapReduce?


• Processing latency
  Non-incremental, must re-slurp entire dataset every pass

• Ad-Hoc queries
  Bare metal interface, data import

• Graphs
  Only a handful of graph problems amenable to MR
  http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120




 Mike Miller
                                                                      8
Enter The New Canon
• Percolator
  incremental processing
  http://research.google.com/pubs/pub36726.html

• Dremel
  ad-hoc analysis queries
  http://research.google.com/pubs/pub36632.html

• Pregel
  Big graphs
  http://dl.acm.org/citation.cfm?id=1807184

    Scalable, Fault Tolerant, Approachable
 Mike Miller
                                                  9
Percolator: incremental processing
 • Replaced MapReduce as the tool to build search index
  “However, reprocessing the entire web discards the work done in earlier
  runs and makes latency proportional to the size of the repository, rather
  than the size of the update.”

 • Bigtable alone can’t do it
  “BigTable scales...but doesn’t provide tools to help programmers maintain
  data invariants in the face of concurrent updates.”

 • Applicability
  Incrementally updating data
  Computational output can be broken down into small pieces
  Computation large in some dimension (data size, cpu, etc)

 • Does it matter?
  “...Converting the indexing system to an incremental system ... reduced the
  averaging document processing latency by a factor of 100...”

 Mike Miller
                                                                                10
Percolator: incremental processing

• BigTable plus...
 Transactions
 snapshot isolation, locks

 Timestamps

 Notifications

 Observers
 your code to be run upon notification
 of an update

   Mike Miller
                                        11
Dremel: ad-hoc Query
•    Scalable, interactive ad-hoc query system for read-only nested
     data
     “...capable of running aggregation queries over trillion-row tables in seconds.”

•    ... on nested data structures in situ
     Web and scientific data is often non-relational
     nested data (protobuffs) underlies most structured data at Google

•    Usage
     DEFINE TABLE t AS /path/to/data/*
     SELECT TOP(signal1,100), COUNT(*) FROM t

•    Applicability
     Analysis of crawled documents
     Tracking of install data for apps on Android Market
     Crash reports
     Spam analysis...

                                dream BI tool
    Mike Miller
                                                                                        12
Dremel: ad-hoc Query
• Ingredients
 In situ data
 SQL like interface
 Serving trees for query execution
 Column striped data




  Mike Miller
                                     13
Dremel: ad-hoc Query
• Ingredients
 In situ data
 SQL like interface
 Serving trees for query execution
 Column striped data




  Mike Miller
                                     13
Dremel: ad-hoc Query
• Ingredients
 In situ data
 SQL like interface
 Serving trees for query execution
 Column striped data




  Mike Miller
                                     13
Pregel: Big Graphs
• Massively parallel processing of big graphs
  billions of vertices, trillions of edges

• Bulk synchronous parallel model
  sequence of vertex oriented iterations
  send/receive messages from other vertex computations
  read/modify state of vertex, outgoing edges, graph topology

• Expressive, easy to program
  distribution details hidden behind abstract API

• Iterative
  computation continues until each vertex votes to terminate

• In production
  PageRank 15 lines of code

 Mike Miller   Nothing like this exists in open source
                                                                14
Pregel: Big Graphs
• Master “Name” node
 connects processes for messaging

• Message Passing
 no remote procedures, reads

• Graph hashed across nodes
 vertex, outgoing edges stored in RAM

• Aggregators
 global mechanism for aggregation
 all but final reduce computed on node
 local data

• Checkpointing
 configurable, enables automatic recovery

   Mike Miller
                                           15
Pregel: Big Graphs
• Master “Name” node
 connects processes for messaging

• Message Passing
 no remote procedures, reads

• Graph hashed across nodes
 vertex, outgoing edges stored in RAM

• Aggregators
 global mechanism for aggregation
 all but final reduce computed on node
 local data

• Checkpointing
 configurable, enables automatic recovery

   Mike Miller
                                           15
Pregel: Big Graphs
• Master “Name” node
 connects processes for messaging

• Message Passing
 no remote procedures, reads

• Graph hashed across nodes
 vertex, outgoing edges stored in RAM

• Aggregators
 global mechanism for aggregation
 all but final reduce computed on node
 local data

• Checkpointing
 configurable, enables automatic recovery

   Mike Miller
                                           15
Lessons Learned


• Hire Jeff Dean and Sanjay Ghemawat
• GFS enables everything
• There is massive opportunity on the horizon




 Mike Miller
                                                16

More Related Content

What's hot

Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018VMware Tanzu
 
Making it easy to work with data
Making it easy to work with dataMaking it easy to work with data
Making it easy to work with dataCharles Smith
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesKinetica
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...Databricks
 
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...Domino Data Lab
 
Operationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Operationalizing Machine Learning Using GPU Accelerated, In-Database AnalyticsOperationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Operationalizing Machine Learning Using GPU Accelerated, In-Database AnalyticsKinetica
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce ParadigmTarjMehta1
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Google Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGoogle Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGlobant
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)Spark Summit
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyBig Data Spain
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 

What's hot (20)

Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018
 
OSCON 2015
OSCON 2015OSCON 2015
OSCON 2015
 
Making it easy to work with data
Making it easy to work with dataMaking it easy to work with data
Making it easy to work with data
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
 
Operationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Operationalizing Machine Learning Using GPU Accelerated, In-Database AnalyticsOperationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Operationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce Paradigm
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Google Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQLGoogle Cloud Spanner y NewSQL
Google Cloud Spanner y NewSQL
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
Make your data talk
Make your data talkMake your data talk
Make your data talk
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 

Viewers also liked

Critical Notification
Critical NotificationCritical Notification
Critical NotificationHerbert043053
 
Oscon miller 2011
Oscon miller 2011Oscon miller 2011
Oscon miller 2011Mike Miller
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizonMike Miller
 
Prou Impost Successions
Prou Impost SuccessionsProu Impost Successions
Prou Impost Successionsguest18e37c
 
20100310 Miller Sts
20100310 Miller Sts20100310 Miller Sts
20100310 Miller StsMike Miller
 
Table Of Contents
Table Of ContentsTable Of Contents
Table Of Contentsguest9e71fd
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Viewers also liked (7)

Critical Notification
Critical NotificationCritical Notification
Critical Notification
 
Oscon miller 2011
Oscon miller 2011Oscon miller 2011
Oscon miller 2011
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
 
Prou Impost Successions
Prou Impost SuccessionsProu Impost Successions
Prou Impost Successions
 
20100310 Miller Sts
20100310 Miller Sts20100310 Miller Sts
20100310 Miller Sts
 
Table Of Contents
Table Of ContentsTable Of Contents
Table Of Contents
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to Horizon 20110928

AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityDevOps.com
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataAravindharamanan S
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStackNati Shalom
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Vladi Vexler
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 

Similar to Horizon 20110928 (20)

Making Sense of Remote Sensing
Making Sense of Remote SensingMaking Sense of Remote Sensing
Making Sense of Remote Sensing
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application Quality
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Big Data
Big DataBig Data
Big Data
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-data
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStack
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Horizon 20110928

  • 1. NEARING THE EVENT HORIZON. HADOOP WAS PREDICTABLE, WHAT’S NEXT? Mike Miller (UW) _mlmilleratmit September 28, 2011
  • 2. What I Am Assistant Professor, Particle Physics (UW) Cloudant Founder, Chief Scientist Background: machine learning, analysis, big data, globally distributed systems Mike Miller 2
  • 3. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox blow 9 game lead in September Amazon Silk ... But here I go anyway Mike Miller 3
  • 4. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you. Mike Miller 4
  • 8. The Old Canon • Google File System (the important one) http://labs.google.com/papers/gfs.html • MapReduce (the big one) http://labs.google.com/papers/mapreduce.html • BigTable (clone me!) http://labs.google.com/papers/bigtable.html • Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$ Mike Miller 6
  • 9. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ Mike Miller 7
  • 10. What’s Painful about MapReduce? • Processing latency Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries Bare metal interface, data import • Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Mike Miller 8
  • 11. Enter The New Canon • Percolator incremental processing http://research.google.com/pubs/pub36726.html • Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html • Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, Approachable Mike Miller 9
  • 12. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...” Mike Miller 10
  • 13. Percolator: incremental processing • BigTable plus... Transactions snapshot isolation, locks Timestamps Notifications Observers your code to be run upon notification of an update Mike Miller 11
  • 14. Dremel: ad-hoc Query • Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.” • ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google • Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t • Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... dream BI tool Mike Miller 12
  • 15. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  • 16. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  • 17. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data Mike Miller 13
  • 18. Pregel: Big Graphs • Massively parallel processing of big graphs billions of vertices, trillions of edges • Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology • Expressive, easy to program distribution details hidden behind abstract API • Iterative computation continues until each vertex votes to terminate • In production PageRank 15 lines of code Mike Miller Nothing like this exists in open source 14
  • 19. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller 15
  • 20. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller 15
  • 21. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller 15
  • 22. Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizon Mike Miller 16