Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SVR17: Data-Intensive Computing on Windows HPC Server with the ...


Published on

  • Be the first to comment

  • Be the first to like this

SVR17: Data-Intensive Computing on Windows HPC Server with the ...

  1. 1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework<br />John Vert<br />Architect<br />Microsoft Corporation<br />SVR17 <br />
  2. 2. Moving Parts<br />Windows HPC Server 2008 – cluster management, job scheduling<br />Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets<br />LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model<br />PLINQ – multi-core parallelism across LINQ queries.<br />DryadLINQ – Bring LINQ ease of programming to Dryad<br />
  3. 3. Software Stack<br />…<br />Image<br />Processing<br />MachineLearning<br />Graph<br />Analysis<br />DataMining<br />.NET Applications<br />DryadLINQ<br />Dryad<br />HPC Job Scheduler<br />Windows HPC Server 2008<br />Windows HPC Server 2008<br />Windows HPC Server 2008<br />Windows HPC Server 2008<br />
  4. 4. Dryad<br />Provides a general, flexible distributed execution layer<br />Dataflow graph as the computation model<br />Can be modified by runtime optimizations<br />Higher language layer supplies graph, vertex code, serialization code, hints for data locality<br />Automatically handles distributed execution<br />Distributes code, routes data<br />Schedules processes on machines near data<br />Masks failures in cluster and network<br />
  5. 5. A Dryad JobDirected acyclic graph (DAG)<br />Outputs<br />Processing<br />vertices<br />Channels<br />(file, fifo,<br /> pipe)<br />Inputs<br />
  6. 6. 2-D Piping<br />Unix Pipes: 1-D<br />grep | sed | sort | awk | perl<br />Dryad: 2-D<br /> grep1000 | sed500 | sort1000 | awk500 | perl50<br />6<br />
  7. 7. LINQLanguage Integrated Query<br />Declarative extensions to C# and VB.NET for iterating over collections<br />In memory<br />Via data providers<br />SQL-Like<br />Broadly adoptable by developers<br />Easy to use<br />Reduces written code<br />Predictable results<br />Scalable experience<br />Deep tooling support<br />
  8. 8. PLINQ Parallel Language Integrated Query<br />Value Proposition:<br />Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism.<br />Declarative data parallelism (focus on the “what” not the “how”)<br />Alternative to LINQ-to-Objects<br />Same set of query operators + some extras<br />Default is IEnumerable<T> based<br />Preview in Parallel Extensions to .NET Framework 3.5 CTP<br />Shipping in .NET Framework 4.0 Beta 2<br />
  9. 9. DryadLINQLINQ to clusters<br />Declarative programming style of LINQ for clusters<br />Automatic parallelization<br />Parallel query plan exploits multi-node parallelism<br />PLINQ underneath exploits multi-core parallelism<br />Integration with VS and .NET<br />Type safety, automatic serialization<br />Query plan optimizations<br />Static optimization rules to optimize locality<br />Dynamic run-time optimizations <br />
  10. 10. DryadLINQ: From LINQ to Dryad<br />Automatic query plan generation<br />Distributed query execution by Dryad<br />LINQ query<br />Query plan<br />Dryad<br />varlogentries =<br />from line in logs<br />where !line.StartsWith("#")<br />select new LogEntry(line);<br />logs<br />where<br />select<br />
  11. 11. A Simple LINQ Query<br />IEnumerable<BabyInfo> babies = ...; <br />varresults = from baby in babies<br />where baby.Name == queryName &&<br />baby.State == queryState &&<br />baby.Year >= yearStart && <br />baby.Year <= yearEnd<br />orderbybaby.Yearascending<br />select baby;<br />
  12. 12. A Simple PLINQ Query<br />IEnumerable<BabyInfo> babies = ...; <br />varresults = from baby in babies.AsParallel()<br />where baby.Name == queryName &&<br />baby.State == queryState &&<br />baby.Year >= yearStart && <br />baby.Year <= yearEnd<br />orderbybaby.Yearascending<br />select baby;<br />
  13. 13. A Simple DryadLINQQuery<br />PartitionedTable<BabyInfo> babies = <br />PartitionedTable.Get<BabyInfo>(“”);<br />varresults = from baby in babies<br /> where baby.Name == queryName &&<br />baby.State == queryState &&<br />baby.Year >= yearStart && <br />baby.Year <= yearEnd<br />orderbybaby.Yearascending<br />select baby;<br />
  14. 14. PartitionedTable<T>Core data structure for DryadLINQ<br />Scale-out, partitioned container for .NET objects<br />Derives from IQueryable<T>, IEnumerable<T><br />ToPartitionedTable() extension methods<br />DryadLINQ operators consume and produce PartitionedTable<T><br />DryadLINQ generates code to serialize/deserialize your .NET objects<br />Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem<br />
  15. 15. Partitioned FileFile-based container for PartitionedTable<T> metadata<br />XCoutput520a0fcfPart<br />20<br />0,1855000,HPCMETAHN01<br />1,1630000,HPCA1CN13<br />2,1707500,HPCA1CN12<br />3,1828820,HPCA1CN22<br />4,1802140,HPCA1CN07<br />5,1741000,HPCA1CN08<br />6,1733980,HPCA1CN11<br />7,1762620,HPCA1CN06<br />8,1861300,HPCA1CN14<br />9,1807460,HPCA1CN17<br />10,1807560,HPCA1CN23<br />11,1768120,HPCA1CN20<br />12,1847220,HPCA1CN03<br />13,1729160,HPCA1CN16<br />14,1767500,HPCA1CN05<br />15,1781520,HPCA1CN04<br />16,1728480,HPCA1CN09<br />17,1802580,HPCA1CN18<br />18,1862380,HPCA1CN10<br />19,1762540,HPCA1CN21<br />PCMETAHN01XCoutput520a0fcfPart.00000000<br />
  16. 16. PartitionedFileFile-based container for PartitionedTable<T> metadata<br />XCoutput520a0fcfPart<br />20<br />0,1855000,HPCMETAHN01<br />1,1630000,HPCA1CN13<br />2,1707500,HPCA1CN12<br />3,1828820,HPCA1CN22<br />4,1802140,HPCA1CN07<br />5,1741000,HPCA1CN08<br />6,1733980,HPCA1CN11<br />7,1762620,HPCA1CN06<br />8,1861300,HPCA1CN14<br />9,1807460,HPCA1CN17<br />10,1807560,HPCA1CN23<br />11,1768120,HPCA1CN20<br />12,1847220,HPCA1CN03<br />13,1729160,HPCA1CN16<br />14,1767500,HPCA1CN05<br />15,1781520,HPCA1CN04<br />16,1728480,HPCA1CN09<br />17,1802580,HPCA1CN18<br />18,1862380,HPCA1CN10<br />19,1762540,HPCA1CN21<br />PCMETAHN01XCoutput520a0fcfPart.00000000<br />PCA1CN13XCoutput520a0fcfPart.00000001<br />PCA1CN12XCoutput520a0fcfPart.00000002<br />PCA1CN22XCoutput520a0fcfPart.00000003<br />PCA1CN07XCoutput520a0fcfPart.00000004<br />PCA1CN08XCoutput520a0fcfPart.00000005<br />PCA1CN11XCoutput520a0fcfPart.00000006<br />PCA1CN06XCoutput520a0fcfPart.00000007<br />PCA1CN14XCoutput520a0fcfPart.00000008<br />PCA1CN17XCoutput520a0fcfPart.00000009<br />PCA1CN23XCoutput520a0fcfPart.00000010<br />PCA1CN20XCoutput520a0fcfPart.00000011<br />PCA1CN03XCoutput520a0fcfPart.00000012<br />PCA1CN16XCoutput520a0fcfPart.00000013<br />PCA1CN05XCoutput520a0fcfPart.00000014<br />PCA1CN04XCoutput520a0fcfPart.00000015<br />PCA1CN09XCoutput520a0fcfPart.00000016<br />PCA1CN18XCoutput520a0fcfPart.00000017<br />PCA1CN10XCoutput520a0fcfPart.00000018<br />PCA1CN21XCoutput520a0fcfPart.00000019<br />
  17. 17. A typical data-intensive query<br />var logs = PartitionedTable.Get<string>(“”);<br />varlogentries =<br /> from line in logs<br /> where !line.StartsWith("#")<br /> select new LogEntry(line);<br />var user = <br /> from access in logentries<br /> where access.user.EndsWith(@"jvert")<br /> select access;<br />var accesses =<br /> from access in user<br /> group access by into pages<br /> select new UserPageCount(“jvert", <br />pages.Key, pages.Count());<br />varhtmAccesses =<br /> from access in accesses<br /> where".htm")<br />orderbyaccess.count descending<br /> select access; <br />Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject.<br />Go through logentries and keep only entries that are accesses by jvert.<br />Group jvertaccesses according to what page they correspond to. For each page, count the occurrences.<br />Sort the pages jverthas accessed according to access frequency.<br />
  18. 18. Dryad Parallel DAG execution<br />logs<br />logentries<br />varlogentries =<br />from line in logs<br /> where !line.StartsWith("#")<br /> select new LogEntry(line);<br />var user = <br /> from access in logentries<br /> where access.user.EndsWith(@"jvert")<br /> select access;<br />var accesses =<br /> from access in user<br /> group access by into pages<br /> select new UserPageCount(“jvert", <br />pages.Key, pages.Count());<br />varhtmAccesses =<br /> from access in accesses<br /> where".htm")<br />orderbyaccess.count descending<br /> select access; <br />user<br />accesses<br />htmAccesses<br />output<br />
  19. 19. Query plan generation<br />Separation of query from its execution context<br />Add all the loaded assemblies as resources<br />Eliminate references to local variables by partially evaluating all the expressions in the query<br />Distribute objects used by the query<br />Detect impure queries when possible<br />Automatic code generation<br />Object serialization code for Dryad channels<br />Managed code for Dryad Vertices<br />Static query plan optimizations<br />Pipelining: composing multiple operators into one vertex<br />Minimize unnecessary data repartitions<br />Other standard DB optimizations<br />
  20. 20. DryadLINQ query plan<br />Query 0 Output: file://<br />DryadLinq0.dll was built successfully.<br />Input:<br /> [PartitionedTable: file://]<br />Super__1:<br /> Where(line => !(line.StartsWith(_)))<br /> Select(line => new logdemo.LogEntry(line))<br /> Where(access => access.user.EndsWith(_))<br />DryadGroupBy(access =>,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count()))<br />DryadHashPartition(e => e.Key,e => e.Key)<br />Super__12:<br />DryadMerge()<br />DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum()))<br /> Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))<br />
  21. 21. XML representationGenerated by DryadLINQ and passed to Dryad<br /><Query><br /> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion><br /> <ClusterName>hpcmetahn01</ClusterName><br /> ...<br /> <Resources><br /> <Resource>wrappernativeinfo.dll</Resource><br /> <Resource>DryadLinq0.dll</Resource><br /> <Resource>System.Threading.dll</Resource><br /> <Resource>logdemo.exe</Resource><br /> <Resource>LinqToDryad.dll</Resource><br /> </Resources><br /> <QueryPlan><br /> <Vertex><br /> <UniqueId>0</UniqueId> <br /><Type>InputTable</Type><br /> <Name></Name><br /> ...<br /> </Vertex><br /><Vertex><br /><UniqueId>1</UniqueId> <br /><Type>Super</Type><br /> <Name>Super__1</Name><br /> ...<br /><Children><br /><Child><br /> <UniqueId>0</UniqueId><br /> </Child><br /></Children><br /></Vertex><br /> ...<br /> </QueryPlan><br /><Query><br />List of files to be shipped to the cluster<br />Vertex definitions<br />
  22. 22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices<br /> public sealed class DryadLinq__Vertex<br /> {<br /> public static int Super__1(string args)<br />{<br /> < . . . ><br />DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam);<br />var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0);<br />var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString);<br />var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true);<br />var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true);<br />var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"jvert"), true);<br />var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access =>, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false);<br />DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2);<br />DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff"));<br /> return 0;<br /> }<br /> public static int Super__12(string args)<br />{<br />< . . . ><br /> }<br />
  23. 23. DryadLINQ query operators<br />Almost all the useful LINQ operators<br />Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate<br />Operators introduced by DryadLINQ<br />HashPartition, RangePartition, Merge, Fork<br />Dryad Apply<br />Operates on sequences rather than items<br />
  24. 24. MapReduce in DryadLINQ<br />MapReduce(source, // sequence of Ts<br /> mapper, // T -> Ms<br />keySelector, // M -> K<br /> reducer) // (K, Ms) -> Rs<br />{<br />var map = source.SelectMany(mapper);<br />var group = map.GroupBy(keySelector);<br />var result = group.SelectMany(reducer);<br /> return result; // sequence of Rs<br />}<br />
  25. 25. K-means in DryadLINQ<br />public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) {<br /> return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);<br />}<br />public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) {<br /> return vectors.GroupBy(point => NearestCenter(point, centers))<br />.Select(group => group.Aggregate((x,y) => x + y) / group.Count());<br />}<br />var vectors = PartitionedTable.Get<Vector>("");<br />IQueryable<Vector> centers = vectors.Take(100);<br />for (int i = 0; i < 10; i++) {<br /> centers = Step(vectors, centers);<br />}<br />centers.ToPartitionedTable<Vector>(“”);<br />public class Vector {<br /> public double[] entries;<br /> [Associative]<br /> public static Vector operator +(Vector v1, Vector v2) { … }<br /> public static Vector operator -(Vector v1, Vector v2) { … }<br /> public double Norm2() {…}<br />}<br />
  26. 26. Putting it all togetherIt’s LINQ all the way down<br />Major League Baseball dataset<br />Pitch-by-pitch data for every MLB game since 2007<br />47,909 pitch XML files (one for each pitcher appearance)<br />6,127 player XML files (one for each player)<br />Hash partition the input data files to distribute the work<br />LINQ to XML to shred the data<br />DryadLINQ to analyze dataset<br />
  27. 27. Load the dataset and partitionDefine Pitch and Player classes<br />void StagePitchData(string[] fileList, string PartitionedFile)<br />{<br />// partition the list of filenames across <br /> // 20 nodes of the cluster<br />varpitches = fileList.ToPartitionedTable("filelist")<br /> .HashPartition((x) => (x), 20)<br />.SelectMany((f) => XElement.Load(f).Elements("atbat"))<br />.SelectMany((a) => a.Elements("pitch")<br />.Select((p) => new Pitch((string)a.Attribute("pitcher"),<br /> (string)a.Attribute("batter"),<br />p)));<br />pitches.ToPartitionedTable(PartitionedFile);<br />}<br />Void StagePlayerData(string[] fileList, string PartitionedFile)<br />{<br />varplayers = fileList.Select((p) => new Player(XElement.Load(p)));<br />players.ToPartitionedTable(PartitionedFile);<br /> return 0;<br />}<br />
  28. 28. Analyze dataset with LINQ<br />IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount)<br />{<br /> return pitches.OrderByDescending((p) => p.StartSpeed)<br /> .Take(count);<br />}<br />
  29. 29. Supports LINQ Joins<br />IQueryable<string> <br />FindFastestPitchers(IQueryable<Pitch> pitches,<br />IQueryable<Player> players,<br />intcount)<br />{<br /> return pitches.OrderByDescending((p) => p.StartSpeed)<br /> .Take(count)<br /> .Join(players,<br /> (o) => o.Pitcher,<br /> (i) => i.Id,<br /> (o, i) => i.FirstName + " " + i.LastName)<br /> .Distinct();<br />}<br />
  30. 30. DryadLINQ on HPC Server<br />DryadLINQ program runs on client workstation<br />Develop, debug, run locally<br />When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server<br />HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager<br />The JM then schedules additional tasks to execute the vertices of the DryadLINQ query<br />When the job completes, the client program picks up the output result and continues.<br />
  31. 31. Examples of DryadLINQ Applications<br />Data mining<br />Analysis of service logs for network security<br />Analysis of Windows Watson/SQM data<br />Cluster monitoring and performance analysis<br />Graph analysis<br />Accelerated Page-Rank computation<br />Road network shortest-path preprocessing<br />Image processing<br />Image indexing<br />Decision tree training<br />Epitome computation<br />Simulation<br />light flow simulations for next-generation display research<br />Monte-Carlo simulations for mobile data<br />eScience<br />Machine learning platform for health solutions<br />Astrophysics simulation<br />
  32. 32. Ongoing Work<br />Advanced query optimizations<br />Combination of static analysis and annotations<br />Sampling execution of the query plan<br />Dynamic query optimization<br />Incremental computation<br />Real-time event processing<br />Global scheduling<br />Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications<br />Scale-out partitioned storage<br />Pluggable storage providers<br />DryadLINQ on Azure<br />Better debugging, performance analysis, visualization, etc.<br />
  33. 33. Additional Resources<br />Dryad and DryadLINQ<br /><br />DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc.<br />PLINQ<br />Available in Parallel Extensions to .NET Framework 3.5 CTP<br />Available in .NET Framework 4.0 Beta 2<br /><br /><br />Windows HPC Server 2008<br /><br />Download it, try it, we want your feedback!<br />
  34. 34. Questions?<br />
  35. 35. YOUR FEEDBACK IS IMPORTANT TO US!<br />Please fill out session evaluation forms online at<br /><br />
  36. 36. Learn More On Channel 9<br />Expand your PDC experience through Channel 9.<br />Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.<br /><br />Built by Developers for Developers….<br />