Microsoft Dryad

Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research

Select eXtreme Computing Group (XCG) Initiatives Cloud Computing Futures ab initio R&D on cloud hardware/software infrastructure Multicore academic engagement Universal Parallel Computing Research Centers (UPCRCs) Software incubations Multicore applications, power management, scheduling Quantum computing Topological quantum computing investigations Security and cryptography Theoretical explorations and software tools ,[object Object]

Worldwide government and academic research partnerships

Inform next generation cloud computing infrastructure,[object Object]

Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)

The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache 1 MB 1 MB PCIe PCIe NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86

Drinking from the Twitter Fire Hose On the “input” end ,[object Object]

Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address. So let’s say the input rate grows again to 10x-100x what it was in the previous step.

Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream. This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page. This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time. Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.

Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these challenges is a great way to allow others to contribute to a shared research agenda Make simulated and reference data sets available to ground such a distributed research effort

Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities A combination of live data, including streaming, and historical data Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…

This Talk is About Effort to build & port tools for data intensive research in the cloud ,[object Object],Able to handle torrential streams of live and historical data ,[object Object],Intersection of four fundamental strategies Distribute Data and perform Parallel Processing Parallel operations to take advantage of multiple cores; Reduce the size of the data accessed Data compression Data structures that limit the amount of data required for queries; Stream data processing to extract information before storage

Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly

Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use

Simple Programming Model Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985] ,[object Object]

DryadLINQ provides simple but powerful programming model

Only few lines of code needed to implement Terasort, benchmark May 2008

DryadLINQ result: 349 seconds (5.8 min)

Cluster of 240 AMD64 (quad) machines, 920 disks

Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);

LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers

Dryad Generalizes Unix Pipes Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D, multi-machine, virtualized grep1000 | sed500 | sort1000 | awk500 | perl50

Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes) Channel is a finite streams of items ,[object Object]

Memory FIFOs (intra-machine),[object Object]

Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources

Dryad Scheduler is a State Machine Static optimizer builds execution graph Vertex can run anywhere once all its inputs are ready. Dynamic optimizer mutates running graph Distributes code, routes data; Schedules processes on machines near data; Adjusts available compute resources at each stage; Automatically recovers computation, adjusts for overload ,[object Object]

If A’s inputs are gone, run upstream vertices again (recursively);

If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;

Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core

LINQ == Tree of Operators A query is comprised of a tree of operators As with a program AST, these trees can be analyzed, rewritten This is why PLINQ can safely introduce parallelism q = from x in A where p(x) select x3; ,[object Object]

Microsoft Dryad

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Microsoft Dryad

Similar to Microsoft Dryad (20)

Recently uploaded

Recently uploaded (20)

Microsoft Dryad

Editor's Notes