SlideShare a Scribd company logo
Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
Select eXtreme Computing Group (XCG) Initiatives Cloud Computing Futures ab initio R&D on cloud hardware/software infrastructure Multicore academic engagement Universal Parallel Computing Research Centers (UPCRCs) Software incubations Multicore applications, power management, scheduling Quantum computing Topological quantum computing investigations Security and cryptography Theoretical explorations and software tools ,[object Object]
Worldwide government and academic research partnerships
Inform next generation cloud computing infrastructure,[object Object]
Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea  Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache 1 MB  1 MB  PCIe  PCIe  NoC NoC ctlr ctlr cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86
Drinking from the Twitter Fire Hose On the “input” end ,[object Object]
Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these    challenges is a great way to allow others       to contribute to a shared research agenda Make simulated and reference data sets available    to ground such a distributed research effort
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities 	A combination of live data, including streaming, and historical data  	Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult 	because in actuality it is not simple…
This Talk is About Effort to build & port tools for data intensive research in the cloud ,[object Object],Able to handle torrential streams of live and historical data ,[object Object],Intersection of four fundamental strategies  Distribute Data and perform Parallel Processing Parallel operations to take advantage of multiple cores; Reduce the size of the data accessed Data compression Data structures that limit the amount of data required for queries; Stream data processing to extract information before storage
Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
Simple Programming Model Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985] ,[object Object]
DryadLINQ provides simple but powerful programming model
 Only few lines of code needed to implement Terasort, benchmark May 2008
DryadLINQ result: 349 seconds (5.8 min)
 Cluster of 240 AMD64 (quad) machines, 920 disks
 Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records =    ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
Dryad Generalizes Unix Pipes Unix Pipes: 1-D 		grep |  sed  | sort | awk |  perl Dryad: 2-D, multi-machine, virtualized 	 grep1000 |  sed500  | sort1000 | awk500 |  perl50
Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes) Channel is a finite streams of items ,[object Object]
  TCP pipes (inter-machine)
  Memory FIFOs (intra-machine),[object Object]
Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
Dryad Scheduler is a State Machine Static optimizer builds execution graph Vertex can run anywhere once all its inputs are ready. Dynamic optimizer mutates running graph  Distributes code, routes data; Schedules processes on machines near data; Adjusts available compute resources at each stage; Automatically recovers computation, adjusts for overload ,[object Object]
If A’s inputs are gone, run upstream vertices again (recursively);
If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
LINQ == Tree of Operators A query is comprised of a tree of operators As with a program AST, these trees can be analyzed, rewritten This is why PLINQ can safely introduce parallelism q = from x in A where p(x) select x3; ,[object Object]

More Related Content

What's hot

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Denodo
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
IBM Power Systems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
Precisely
 
In memory computing
In memory computingIn memory computing
In memory computing
Gagan Reddy
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Databricks
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Cumulus Networks
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
MSAdvAnalytics
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28
groberts52
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale Computing
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is Here
VMUG IT
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Ontico
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
Dell EMC World
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
DataStax
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single page
Joe Krotz
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack Together
David La Motta
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
inside-BigData.com
 

What's hot (20)

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
 
In memory computing
In memory computingIn memory computing
In memory computing
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production Environment
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is Here
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single page
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack Together
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
 

Similar to Microsoft Dryad

High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
ratthaslip ranokphanuwat
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
confluent
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
Ian Foster
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
DataWorks Summit
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Animesh Chaturvedi
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Evention
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud ready
GetInData
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
Krzysztof Adamski
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsRod Soto
 

Similar to Microsoft Dryad (20)

High performance computing
High performance computingHigh performance computing
High performance computing
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud ready
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Microsoft Dryad

  • 1. Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
  • 2.
  • 3. Worldwide government and academic research partnerships
  • 4.
  • 5. Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
  • 6. The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache 1 MB 1 MB PCIe PCIe NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86
  • 7.
  • 8. Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
  • 9. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
  • 10. Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these challenges is a great way to allow others to contribute to a shared research agenda Make simulated and reference data sets available to ground such a distributed research effort
  • 11. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities A combination of live data, including streaming, and historical data Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…
  • 12.
  • 13. Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
  • 14. Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
  • 15.
  • 16. DryadLINQ provides simple but powerful programming model
  • 17. Only few lines of code needed to implement Terasort, benchmark May 2008
  • 18. DryadLINQ result: 349 seconds (5.8 min)
  • 19. Cluster of 240 AMD64 (quad) machines, 920 disks
  • 20. Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
  • 21. LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
  • 22. Dryad Generalizes Unix Pipes Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D, multi-machine, virtualized grep1000 | sed500 | sort1000 | awk500 | perl50
  • 23.
  • 24. TCP pipes (inter-machine)
  • 25.
  • 26. Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
  • 27.
  • 28. If A’s inputs are gone, run upstream vertices again (recursively);
  • 29. If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
  • 30. Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
  • 31.
  • 34. Nesting queries inside of others is commonPLINQ can fuse partitions var q1 = from x in A select x*2; var q2 = q1.Sum();
  • 35. Combining with PLINQ Query DryadLINQ subquery PLINQ
  • 36. Combining with LINQ-to-IMDB Query DryadLINQ Subquery Subquery Subquery Subquery Historical Reference Data LINQ-to-IMDB
  • 37. Combining with LINQ-to-CEP Query DryadLINQ Subquery Subquery Subquery Subquery Subquery ‘Live’ Streaming Data LINQ-to-IMDB LINQ-to-CEP
  • 38. Cost of storing data – few cents/month/MB Cost of acquiring data – negligible Extracting insight while acquiring data - priceless Mining historical data for ways to extract insight – precious CEDR CEP – the engine that makes it possible Consistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 39. Complex Event Processing Complex Event Processing (CEP) is the continuous and incremental processing of event (data) streams from multiple sources based on declarative query and pattern specifications with near-zero latency.
  • 40.
  • 41.
  • 42. Separately specify desired disorder handling strategy
  • 43. Many interesting repercussionsConsistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 44. CEDR (Orinoco) Overview Currently processing over 400M events per day for internal application (5000 events/sec)
  • 45.
  • 46.
  • 47.
  • 48.

Editor's Notes

  1. Language Integrated Query is an extension of .Net which allows one to write declarative computations on collections
  2. Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
  3. This is the basic Dryad terminology.
  4. The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
  5. Computation Staging