Tools and Services for Data Intensive Research<br />An Elephant Through the Eye of a Needle<br />Roger Barga, Architect<br...
Select eXtreme Computing Group (XCG) Initiatives<br />Cloud Computing Futures<br />ab initio R&D on cloud hardware/softwar...
Worldwide government and academic research partnerships
Inform next generation cloud computing infrastructure</li></li></ul><li>Data Intensive Research<br />The nature of scienti...
Why Commercial Clouds are Important*<br />Research<br />Have good idea<br />Write proposal<br />Wait 6 months<br />If succ...
The Pull of Economics (follow the money)<br />Moore’s “Law” favored consumer commodities<br />Economics drove enormous imp...
Drinking from the Twitter Fire Hose<br />On the “input” end<br /><ul><li>Start with the ‘twitter fire hose’, messages flow...
Enrich each element with significantly more metadata, e.g. geolocation.</li></ul>Assume the order of magnitude of the twit...
Drinking from the Twitter Fire Hose<br />On the “input” end<br />On the “output” end: three different usage modalities<br ...
Pause for Moment…<br />Defining representative challenges or quests to focus group attention is an excellent way to procee...
Drinking from the Twitter Fire Hose<br />On the “input” end<br />On the “output” end: three different usage modalities<br ...
This Talk is About<br />Effort to build & port tools for data intensive research in the cloud<br /><ul><li>None have run i...
Microsoft’s Dryad<br />Continuously deployed since 2006<br />Running on >> 104 machines<br />Sifting through > 10Pb data d...
Pause for Moment…<br />Data-Intensive Computing Symposium, 2007<br />Dryad is now freely available<br />http://research.mi...
Simple Programming Model<br />Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985]<br /><ul><li> Seq...
DryadLINQ provides simple but powerful programming model
 Only few lines of code needed to implement Terasort, benchmark May 2008
DryadLINQ result: 349 seconds (5.8 min)
 Cluster of 240 AMD64 (quad) machines, 920 disks
 Code: 17 lines of LINQ</li></ul>DryadDataContext ddc = newDryadDataContext(fileDir);<br />DryadTable<TeraRecord> records ...
LINQ<br />Microsoft’s Language INtegrated Query<br />Available in Visual Studio 2008<br />A set of operators to manipulate...
Dryad Generalizes Unix Pipes<br />Unix Pipes: 1-D<br />		grep |  sed  | sort | awk |  perl<br />Dryad: 2-D, multi-machine,...
Dryad Job Structure<br />Channels<br />Inputfiles<br />Stage<br />Outputfiles<br />sort<br />grep<br />awk<br />sed<br />p...
  TCP pipes (inter-machine)
  Memory FIFOs (intra-machine)</li></li></ul><li>Dryad System Architecture<br />data plane<br />Files, TCP, FIFO, Network<...
Dryad Job Staging<br />1. Build<br />7. Serialize vertices<br />Vertex Code<br />2. Send .exe<br />5. Generate graph<br />...
Dryad Scheduler is a State Machine<br />Static optimizer builds execution graph<br />Vertex can run anywhere once all its ...
If A’s inputs are gone, run upstream vertices again (recursively);
If A is slow, run a copy elsewhere and use output from one that finishes first.</li></ul>Masks failures in cluster and net...
Combining Query Providers<br />Local Machine<br />Execution Engines<br />Scalability<br />.Netprogram<br />(C#, VB, F#, et...
LINQ == Tree of Operators<br />A query is comprised of a tree of operators<br />As with a program AST, these trees can be ...
Upcoming SlideShare
Loading in …5
×

Microsoft Dryad

2,218 views
2,096 views

Published on

Tools and Services for Data Intensive Research

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,218
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
69
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Language Integrated Query is an extension of .Net which allows one to write declarative computations on collections
  • Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
  • This is the basic Dryad terminology.
  • The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
  • Computation Staging
  • Microsoft Dryad

    1. 1. Tools and Services for Data Intensive Research<br />An Elephant Through the Eye of a Needle<br />Roger Barga, Architect<br />eXtreme Computing Group, Microsoft Research<br />
    2. 2. Select eXtreme Computing Group (XCG) Initiatives<br />Cloud Computing Futures<br />ab initio R&D on cloud hardware/software infrastructure<br />Multicore academic engagement<br />Universal Parallel Computing Research Centers (UPCRCs)<br />Software incubations<br />Multicore applications, power management, scheduling<br />Quantum computing<br />Topological quantum computing investigations<br />Security and cryptography<br />Theoretical explorations and software tools<br /><ul><li>Research cloud engagement
    3. 3. Worldwide government and academic research partnerships
    4. 4. Inform next generation cloud computing infrastructure</li></li></ul><li>Data Intensive Research<br />The nature of scientific computing is changing<br />It is about the data…<br />Hypothesis-driven research<br />“I have an idea, let me verify it...”<br />Exploratory<br />“What correlations can I glean from everyone’s data?”<br />Requires different tools and techniques<br />Exploratory analysis relies on data mining, viz analytics<br />“grep” is not a data mining tool, and neither is a DBMS…<br />Massive, multidisciplinary data<br />Rising rapidly and at unprecedented scale<br />
    5. 5. Why Commercial Clouds are Important*<br />Research<br />Have good idea<br />Write proposal<br />Wait 6 months<br />If successful, wait 3 months<br />Install Computers<br />Start Work<br />Science Start-ups<br />Have good idea<br /> Write Business Plan<br />Ask VCs to fund<br />If successful..<br />Install Computers<br />Start Work<br />Cloud Computing Model<br />Have good idea<br />Grab nodes from Cloud provider<br />Start Work<br />Pay for what you used<br />also scalability, cost, sustainability<br />* Slide used with permission of Paul Watson, University of Newcastle (UK) <br />
    6. 6. The Pull of Economics (follow the money)<br />Moore’s “Law” favored consumer commodities<br />Economics drove enormous improvements<br />Specialized processors and mainframes faltered<br />The commodity software industry was born<br />LPIA <br />LPIA <br />DRAM <br />DRAM <br />OoO <br />x86<br />x86<br />ctlr<br />ctlr<br />x86<br />Today’s economics<br />Unprecedented economies of scale<br />Enterprise moving to PaaS, SaaS, cloud computing<br />Opportunities for Analysis as a Service, multi-disciplinary data sets,…<br />LPIA <br />LPIA <br />1 MB <br />1 MB <br />x86<br />x86<br />cache<br />cache<br />LPIA <br />LPIA <br />1 MB <br />GPU<br />GPU<br />x86<br />x86<br />cache<br />1 MB <br />1 MB <br />PCIe <br />PCIe <br />NoC<br />NoC<br />ctlr<br />ctlr<br />cache<br />cache<br />LPIA <br />LPIA <br />1 MB <br />GPU<br />GPU<br />x86<br />x86<br />cache<br />This will drive changes in research computing and cloud infrastructure<br />Just as did “killer micros” and inexpensive clusters<br />LPIA <br />LPIA <br />1 MB <br />1 MB <br />x86<br />x86<br />cache<br />cache<br />LPIA <br />LPIA <br />DRAM <br />DRAM <br />OoO <br />x86<br />x86<br />ctlr<br />ctlr<br />x86<br />
    7. 7. Drinking from the Twitter Fire Hose<br />On the “input” end<br /><ul><li>Start with the ‘twitter fire hose’, messages flowing inbound at specific rate.
    8. 8. Enrich each element with significantly more metadata, e.g. geolocation.</li></ul>Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range.<br />The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.<br />
    9. 9. Drinking from the Twitter Fire Hose<br />On the “input” end<br />On the “output” end: three different usage modalities<br />Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries).<br />A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years…<br />Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.<br />
    10. 10. Pause for Moment…<br />Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community<br />Publishing a whitepaper articulating these<br /> challenges is a great way to allow others<br /> to contribute to a shared research agenda<br />Make simulated and reference data sets available<br /> to ground such a distributed research effort <br />
    11. 11. Drinking from the Twitter Fire Hose<br />On the “input” end<br />On the “output” end: three different usage modalities<br /> A combination of live data, including streaming, and historical data <br /> Lots of necessary technology, but no single technology is sufficient<br />If this is going to be successful it must be accessible to the masses<br /> Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…<br />
    12. 12. This Talk is About<br />Effort to build & port tools for data intensive research in the cloud<br /><ul><li>None have run in the cloud to date or at scale we are targeting…</li></ul>Able to handle torrential streams of live and historical data<br /><ul><li>Goal is simplicity and ease-of-use combined with scalability</li></ul>Intersection of four fundamental strategies <br />Distribute Data and perform Parallel Processing<br />Parallel operations to take advantage of multiple cores;<br />Reduce the size of the data accessed<br />Data compression<br />Data structures that limit the amount of data required for queries;<br />Stream data processing to extract information before storage<br />
    13. 13. Microsoft’s Dryad<br />Continuously deployed since 2006<br />Running on >> 104 machines<br />Sifting through > 10Pb data daily<br />Runs on clusters > 3000 machines<br />Handles jobs with > 105 processes each<br />Used by >> 100 developers<br />Rich platform for data analysis<br />Microsoft Research, Silicon Valley<br />Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly<br />
    14. 14. Pause for Moment…<br />Data-Intensive Computing Symposium, 2007<br />Dryad is now freely available<br />http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx<br />Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters<br />Commitment by External Research (MSR) to support research community use<br />
    15. 15. Simple Programming Model<br />Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985]<br /><ul><li> Sequential scan/disk = 4.6 hours
    16. 16. DryadLINQ provides simple but powerful programming model
    17. 17. Only few lines of code needed to implement Terasort, benchmark May 2008
    18. 18. DryadLINQ result: 349 seconds (5.8 min)
    19. 19. Cluster of 240 AMD64 (quad) machines, 920 disks
    20. 20. Code: 17 lines of LINQ</li></ul>DryadDataContext ddc = newDryadDataContext(fileDir);<br />DryadTable<TeraRecord> records = <br />ddc.GetPartitionedTable<TeraRecord>(file);<br />varq = records.OrderBy(x => x);<br />q.ToDryadPartitionedTable(output);<br />
    21. 21. LINQ<br />Microsoft’s Language INtegrated Query<br />Available in Visual Studio 2008<br />A set of operators to manipulate datasets in .NET<br />Support traditional relational operators<br />Select, Join, GroupBy, Aggregate, etc.<br />Data model<br />Data elements are strongly typed .NET objects<br />Much more expressive than SQL tables<br />Extremely extensible<br />Add new custom operators<br />Add new execution providers<br />
    22. 22. Dryad Generalizes Unix Pipes<br />Unix Pipes: 1-D<br /> grep | sed | sort | awk | perl<br />Dryad: 2-D, multi-machine, virtualized<br /> grep1000 | sed500 | sort1000 | awk500 | perl50<br />
    23. 23. Dryad Job Structure<br />Channels<br />Inputfiles<br />Stage<br />Outputfiles<br />sort<br />grep<br />awk<br />sed<br />perl<br />sort<br />grep<br />awk<br />sed<br />grep<br />sort<br />Vertices (processes)<br />Channel is a finite streams of items<br /><ul><li> NTFS files (temporary)
    24. 24. TCP pipes (inter-machine)
    25. 25. Memory FIFOs (intra-machine)</li></li></ul><li>Dryad System Architecture<br />data plane<br />Files, TCP, FIFO, Network<br />job schedule<br />V<br />V<br />V<br />NS<br />PD<br />PD<br />PD<br />Job manager<br />cluster<br />control plane<br />
    26. 26. Dryad Job Staging<br />1. Build<br />7. Serialize vertices<br />Vertex Code<br />2. Send .exe<br />5. Generate graph<br />JM code<br />Cluster services<br />6. Initialize vertices<br />3. Start JM<br />8. Monitor vertex execution<br />4. Query cluster resources<br />
    27. 27. Dryad Scheduler is a State Machine<br />Static optimizer builds execution graph<br />Vertex can run anywhere once all its inputs are ready.<br />Dynamic optimizer mutates running graph <br />Distributes code, routes data;<br />Schedules processes on machines near data;<br />Adjusts available compute resources at each stage;<br />Automatically recovers computation, adjusts for overload<br /><ul><li>If A fails, run it again;
    28. 28. If A’s inputs are gone, run upstream vertices again (recursively);
    29. 29. If A is slow, run a copy elsewhere and use output from one that finishes first.</li></ul>Masks failures in cluster and network;<br />
    30. 30. Combining Query Providers<br />Local Machine<br />Execution Engines<br />Scalability<br />.Netprogram<br />(C#, VB, F#, etc)<br />DryadLINQ<br />Cluster<br />Query<br />PLINQ<br />LINQ provider interface<br />Multi-core<br />LINQ-to-IMDB<br />Objects<br />LINQ-to-CEP<br />Single-core<br />
    31. 31. LINQ == Tree of Operators<br />A query is comprised of a tree of operators<br />As with a program AST, these trees can be analyzed, rewritten<br />This is why PLINQ can safely introduce parallelism<br />q = from x in A where p(x) select x3;<br /><ul><li>Intra-operator:
    32. 32. Inter-operator:
    33. 33. Both composed:
    34. 34. Nesting queries inside of others is common</li></ul>PLINQ can fuse partitions<br />var q1 = from x in A select x*2;<br />var q2 = q1.Sum();<br />
    35. 35. Combining with PLINQ<br />Query<br />DryadLINQ<br />subquery<br />PLINQ<br />
    36. 36. Combining with LINQ-to-IMDB<br />Query<br />DryadLINQ<br />Subquery<br />Subquery<br />Subquery<br />Subquery<br />Historical<br />Reference<br />Data<br />LINQ-to-IMDB<br />
    37. 37. Combining with LINQ-to-CEP<br />Query<br />DryadLINQ<br />Subquery<br />Subquery<br />Subquery<br />Subquery<br />Subquery<br />‘Live’<br />Streaming<br />Data<br />LINQ-to-IMDB<br />LINQ-to-CEP<br />
    38. 38. Cost of storing data – few cents/month/MB <br />Cost of acquiring data – negligible <br />Extracting insight while acquiring data - priceless<br />Mining historical data for ways to extract insight – precious <br />CEDR CEP – the engine that makes it possible<br />Consistent Streaming Through Time: A Vision for Event Stream Processing<br />Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong<br />In the proceedings of CIDR 2007<br />
    39. 39. Complex Event Processing<br />Complex Event Processing (CEP) is the continuous and incremental processing of event (data) streams from multiple sources based on declarative query and pattern specifications with near-zero latency. <br />
    40. 40. The CEDR (Orinoco) Algebra<br />Leverages existing SQL understanding<br /><ul><li>Streaming extensions to relational algebra
    41. 41. Query integration with host languages (LINQ)</li></ul>Semantics are independent of order of arrival<br /><ul><li>Specify a standing event query
    42. 42. Separately specify desired disorder handling strategy
    43. 43. Many interesting repercussions</li></ul>Consistent Streaming Through Time: A Vision for Event Stream Processing<br />Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong<br />In the proceedings of CIDR 2007<br />
    44. 44. CEDR (Orinoco) Overview<br />Currently processing over 400M events per day for internal application (5000 events/sec)<br />
    45. 45. Reference Data on Azure<br />Ocean Science data on Azure SDS-relational<br /><ul><li>Two terabytes of coastal and model data
    46. 46. Collaboration with Bill Howe (Univ of Washington)</li></ul>Computational finance data on Azure SDS-relational<br /><ul><li>BATS, daily tick data for stocks (10 years)
    47. 47. XBRL call report for banks (10,000 banks)</li></ul>Working with IRIS to store select seismic data on Azure. IRIS consortium based in Seattle (NSF) collects and distributes global seismological data.<br /><ul><li>Data sets requested by researchers worldwide
    48. 48. Includes HD videos, seismograms, images, data from major seismic events.</li></li></ul><li>Summary<br />Data growing exponentially: big data, with big implications…<br />Implications for research environments and cloud infrastructure<br />Building cloud analysis & storage tools for data intensive research<br />Implementing key services for science (PhyloD for HIV researchers)<br />Host select data sets for multidisciplinary data analysis<br />Ongoing discussions for research access to Azure<br />Many PB of storage and hundreds of thousands of core-hours<br />Internet2/ESnet connections, w/ service peering at high bandwidth<br />Drive negotiations with ISVs for pay-as-you-go licensing (MATLAB)<br />Academic access to Azure through our MSDN program<br />Technical engagement team to onboard research groups<br />Tools for data analysis, data storage services, and visual analytics<br />
    49. 49. Questions<br />

    ×