Your SlideShare is downloading. ×
0
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SVR17: Data-Intensive Computing on Windows HPC Server with the ...

748

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
748
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework
    John Vert
    Architect
    Microsoft Corporation
    SVR17
  • 2. Moving Parts
    Windows HPC Server 2008 – cluster management, job scheduling
    Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets
    LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model
    PLINQ – multi-core parallelism across LINQ queries.
    DryadLINQ – Bring LINQ ease of programming to Dryad
  • 3. Software Stack

    Image
    Processing
    MachineLearning
    Graph
    Analysis
    DataMining
    .NET Applications
    DryadLINQ
    Dryad
    HPC Job Scheduler
    Windows HPC Server 2008
    Windows HPC Server 2008
    Windows HPC Server 2008
    Windows HPC Server 2008
  • 4. Dryad
    Provides a general, flexible distributed execution layer
    Dataflow graph as the computation model
    Can be modified by runtime optimizations
    Higher language layer supplies graph, vertex code, serialization code, hints for data locality
    Automatically handles distributed execution
    Distributes code, routes data
    Schedules processes on machines near data
    Masks failures in cluster and network
  • 5. A Dryad JobDirected acyclic graph (DAG)
    Outputs
    Processing
    vertices
    Channels
    (file, fifo,
    pipe)
    Inputs
  • 6. 2-D Piping
    Unix Pipes: 1-D
    grep | sed | sort | awk | perl
    Dryad: 2-D
    grep1000 | sed500 | sort1000 | awk500 | perl50
    6
  • 7. LINQLanguage Integrated Query
    Declarative extensions to C# and VB.NET for iterating over collections
    In memory
    Via data providers
    SQL-Like
    Broadly adoptable by developers
    Easy to use
    Reduces written code
    Predictable results
    Scalable experience
    Deep tooling support
  • 8. PLINQ Parallel Language Integrated Query
    Value Proposition:
    Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism.
    Declarative data parallelism (focus on the “what” not the “how”)
    Alternative to LINQ-to-Objects
    Same set of query operators + some extras
    Default is IEnumerable<T> based
    Preview in Parallel Extensions to .NET Framework 3.5 CTP
    Shipping in .NET Framework 4.0 Beta 2
  • 9. DryadLINQLINQ to clusters
    Declarative programming style of LINQ for clusters
    Automatic parallelization
    Parallel query plan exploits multi-node parallelism
    PLINQ underneath exploits multi-core parallelism
    Integration with VS and .NET
    Type safety, automatic serialization
    Query plan optimizations
    Static optimization rules to optimize locality
    Dynamic run-time optimizations
  • 10. DryadLINQ: From LINQ to Dryad
    Automatic query plan generation
    Distributed query execution by Dryad
    LINQ query
    Query plan
    Dryad
    varlogentries =
    from line in logs
    where !line.StartsWith("#")
    select new LogEntry(line);
    logs
    where
    select
  • 11. A Simple LINQ Query
    IEnumerable<BabyInfo> babies = ...;
    varresults = from baby in babies
    where baby.Name == queryName &&
    baby.State == queryState &&
    baby.Year >= yearStart &&
    baby.Year <= yearEnd
    orderbybaby.Yearascending
    select baby;
  • 12. A Simple PLINQ Query
    IEnumerable<BabyInfo> babies = ...;
    varresults = from baby in babies.AsParallel()
    where baby.Name == queryName &&
    baby.State == queryState &&
    baby.Year >= yearStart &&
    baby.Year <= yearEnd
    orderbybaby.Yearascending
    select baby;
  • 13. A Simple DryadLINQQuery
    PartitionedTable<BabyInfo> babies =
    PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”);
    varresults = from baby in babies
    where baby.Name == queryName &&
    baby.State == queryState &&
    baby.Year >= yearStart &&
    baby.Year <= yearEnd
    orderbybaby.Yearascending
    select baby;
  • 14. PartitionedTable<T>Core data structure for DryadLINQ
    Scale-out, partitioned container for .NET objects
    Derives from IQueryable<T>, IEnumerable<T>
    ToPartitionedTable() extension methods
    DryadLINQ operators consume and produce PartitionedTable<T>
    DryadLINQ generates code to serialize/deserialize your .NET objects
    Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
  • 15. Partitioned FileFile-based container for PartitionedTable<T> metadata
    XCoutput520a0fcfPart
    20
    0,1855000,HPCMETAHN01
    1,1630000,HPCA1CN13
    2,1707500,HPCA1CN12
    3,1828820,HPCA1CN22
    4,1802140,HPCA1CN07
    5,1741000,HPCA1CN08
    6,1733980,HPCA1CN11
    7,1762620,HPCA1CN06
    8,1861300,HPCA1CN14
    9,1807460,HPCA1CN17
    10,1807560,HPCA1CN23
    11,1768120,HPCA1CN20
    12,1847220,HPCA1CN03
    13,1729160,HPCA1CN16
    14,1767500,HPCA1CN05
    15,1781520,HPCA1CN04
    16,1728480,HPCA1CN09
    17,1802580,HPCA1CN18
    18,1862380,HPCA1CN10
    19,1762540,HPCA1CN21
    PCMETAHN01XCoutput520a0fcfPart.00000000
  • 16. PartitionedFileFile-based container for PartitionedTable<T> metadata
    XCoutput520a0fcfPart
    20
    0,1855000,HPCMETAHN01
    1,1630000,HPCA1CN13
    2,1707500,HPCA1CN12
    3,1828820,HPCA1CN22
    4,1802140,HPCA1CN07
    5,1741000,HPCA1CN08
    6,1733980,HPCA1CN11
    7,1762620,HPCA1CN06
    8,1861300,HPCA1CN14
    9,1807460,HPCA1CN17
    10,1807560,HPCA1CN23
    11,1768120,HPCA1CN20
    12,1847220,HPCA1CN03
    13,1729160,HPCA1CN16
    14,1767500,HPCA1CN05
    15,1781520,HPCA1CN04
    16,1728480,HPCA1CN09
    17,1802580,HPCA1CN18
    18,1862380,HPCA1CN10
    19,1762540,HPCA1CN21
    PCMETAHN01XCoutput520a0fcfPart.00000000
    PCA1CN13XCoutput520a0fcfPart.00000001
    PCA1CN12XCoutput520a0fcfPart.00000002
    PCA1CN22XCoutput520a0fcfPart.00000003
    PCA1CN07XCoutput520a0fcfPart.00000004
    PCA1CN08XCoutput520a0fcfPart.00000005
    PCA1CN11XCoutput520a0fcfPart.00000006
    PCA1CN06XCoutput520a0fcfPart.00000007
    PCA1CN14XCoutput520a0fcfPart.00000008
    PCA1CN17XCoutput520a0fcfPart.00000009
    PCA1CN23XCoutput520a0fcfPart.00000010
    PCA1CN20XCoutput520a0fcfPart.00000011
    PCA1CN03XCoutput520a0fcfPart.00000012
    PCA1CN16XCoutput520a0fcfPart.00000013
    PCA1CN05XCoutput520a0fcfPart.00000014
    PCA1CN04XCoutput520a0fcfPart.00000015
    PCA1CN09XCoutput520a0fcfPart.00000016
    PCA1CN18XCoutput520a0fcfPart.00000017
    PCA1CN10XCoutput520a0fcfPart.00000018
    PCA1CN21XCoutput520a0fcfPart.00000019
  • 17. A typical data-intensive query
    var logs = PartitionedTable.Get<string>(“weblogs.pt”);
    varlogentries =
    from line in logs
    where !line.StartsWith("#")
    select new LogEntry(line);
    var user =
    from access in logentries
    where access.user.EndsWith(@"jvert")
    select access;
    var accesses =
    from access in user
    group access by access.page into pages
    select new UserPageCount(“jvert",
    pages.Key, pages.Count());
    varhtmAccesses =
    from access in accesses
    where access.page.EndsWith(".htm")
    orderbyaccess.count descending
    select access;
    Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject.
    Go through logentries and keep only entries that are accesses by jvert.
    Group jvertaccesses according to what page they correspond to. For each page, count the occurrences.
    Sort the pages jverthas accessed according to access frequency.
  • 18. Dryad Parallel DAG execution
    logs
    logentries
    varlogentries =
    from line in logs
    where !line.StartsWith("#")
    select new LogEntry(line);
    var user =
    from access in logentries
    where access.user.EndsWith(@"jvert")
    select access;
    var accesses =
    from access in user
    group access by access.page into pages
    select new UserPageCount(“jvert",
    pages.Key, pages.Count());
    varhtmAccesses =
    from access in accesses
    where access.page.EndsWith(".htm")
    orderbyaccess.count descending
    select access;
    user
    accesses
    htmAccesses
    output
  • 19. Query plan generation
    Separation of query from its execution context
    Add all the loaded assemblies as resources
    Eliminate references to local variables by partially evaluating all the expressions in the query
    Distribute objects used by the query
    Detect impure queries when possible
    Automatic code generation
    Object serialization code for Dryad channels
    Managed code for Dryad Vertices
    Static query plan optimizations
    Pipelining: composing multiple operators into one vertex
    Minimize unnecessary data repartitions
    Other standard DB optimizations
  • 20. DryadLINQ query plan
    Query 0 Output: file://pcmetahn01XCoutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt
    DryadLinq0.dll was built successfully.
    Input:
    [PartitionedTable: file://weblogs.pt]
    Super__1:
    Where(line => !(line.StartsWith(_)))
    Select(line => new logdemo.LogEntry(line))
    Where(access => access.user.EndsWith(_))
    DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count()))
    DryadHashPartition(e => e.Key,e => e.Key)
    Super__12:
    DryadMerge()
    DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum()))
    Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
  • 21. XML representationGenerated by DryadLINQ and passed to Dryad
    <Query>
    <DryadLinqVersion>1.0.1401.0</DryadLinqVersion>
    <ClusterName>hpcmetahn01</ClusterName>
    ...
    <Resources>
    <Resource>wrappernativeinfo.dll</Resource>
    <Resource>DryadLinq0.dll</Resource>
    <Resource>System.Threading.dll</Resource>
    <Resource>logdemo.exe</Resource>
    <Resource>LinqToDryad.dll</Resource>
    </Resources>
    <QueryPlan>
    <Vertex>
    <UniqueId>0</UniqueId>
    <Type>InputTable</Type>
    <Name>weblogs.pt</Name>
    ...
    </Vertex>
    <Vertex>
    <UniqueId>1</UniqueId>
    <Type>Super</Type>
    <Name>Super__1</Name>
    ...
    <Children>
    <Child>
    <UniqueId>0</UniqueId>
    </Child>
    </Children>
    </Vertex>
    ...
    </QueryPlan>
    <Query>
    List of files to be shipped to the cluster
    Vertex definitions
  • 22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices
    public sealed class DryadLinq__Vertex
    {
    public static int Super__1(string args)
    {
    < . . . >
    DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam);
    var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0);
    var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString);
    var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true);
    var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true);
    var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"jvert"), true);
    var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false);
    DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2);
    DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff"));
    return 0;
    }
    public static int Super__12(string args)
    {
    < . . . >
    }
  • 23. DryadLINQ query operators
    Almost all the useful LINQ operators
    Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate
    Operators introduced by DryadLINQ
    HashPartition, RangePartition, Merge, Fork
    Dryad Apply
    Operates on sequences rather than items
  • 24. MapReduce in DryadLINQ
    MapReduce(source, // sequence of Ts
    mapper, // T -> Ms
    keySelector, // M -> K
    reducer) // (K, Ms) -> Rs
    {
    var map = source.SelectMany(mapper);
    var group = map.GroupBy(keySelector);
    var result = group.SelectMany(reducer);
    return result; // sequence of Rs
    }
  • 25. K-means in DryadLINQ
    public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) {
    return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);
    }
    public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) {
    return vectors.GroupBy(point => NearestCenter(point, centers))
    .Select(group => group.Aggregate((x,y) => x + y) / group.Count());
    }
    var vectors = PartitionedTable.Get<Vector>("vectors.pt");
    IQueryable<Vector> centers = vectors.Take(100);
    for (int i = 0; i < 10; i++) {
    centers = Step(vectors, centers);
    }
    centers.ToPartitionedTable<Vector>(“centers.pt”);
    public class Vector {
    public double[] entries;
    [Associative]
    public static Vector operator +(Vector v1, Vector v2) { … }
    public static Vector operator -(Vector v1, Vector v2) { … }
    public double Norm2() {…}
    }
  • 26. Putting it all togetherIt’s LINQ all the way down
    Major League Baseball dataset
    Pitch-by-pitch data for every MLB game since 2007
    47,909 pitch XML files (one for each pitcher appearance)
    6,127 player XML files (one for each player)
    Hash partition the input data files to distribute the work
    LINQ to XML to shred the data
    DryadLINQ to analyze dataset
  • 27. Load the dataset and partitionDefine Pitch and Player classes
    void StagePitchData(string[] fileList, string PartitionedFile)
    {
    // partition the list of filenames across
    // 20 nodes of the cluster
    varpitches = fileList.ToPartitionedTable("filelist")
    .HashPartition((x) => (x), 20)
    .SelectMany((f) => XElement.Load(f).Elements("atbat"))
    .SelectMany((a) => a.Elements("pitch")
    .Select((p) => new Pitch((string)a.Attribute("pitcher"),
    (string)a.Attribute("batter"),
    p)));
    pitches.ToPartitionedTable(PartitionedFile);
    }
    Void StagePlayerData(string[] fileList, string PartitionedFile)
    {
    varplayers = fileList.Select((p) => new Player(XElement.Load(p)));
    players.ToPartitionedTable(PartitionedFile);
    return 0;
    }
  • 28. Analyze dataset with LINQ
    IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount)
    {
    return pitches.OrderByDescending((p) => p.StartSpeed)
    .Take(count);
    }
  • 29. Supports LINQ Joins
    IQueryable<string>
    FindFastestPitchers(IQueryable<Pitch> pitches,
    IQueryable<Player> players,
    intcount)
    {
    return pitches.OrderByDescending((p) => p.StartSpeed)
    .Take(count)
    .Join(players,
    (o) => o.Pitcher,
    (i) => i.Id,
    (o, i) => i.FirstName + " " + i.LastName)
    .Distinct();
    }
  • 30. DryadLINQ on HPC Server
    DryadLINQ program runs on client workstation
    Develop, debug, run locally
    When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server
    HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager
    The JM then schedules additional tasks to execute the vertices of the DryadLINQ query
    When the job completes, the client program picks up the output result and continues.
  • 31. Examples of DryadLINQ Applications
    Data mining
    Analysis of service logs for network security
    Analysis of Windows Watson/SQM data
    Cluster monitoring and performance analysis
    Graph analysis
    Accelerated Page-Rank computation
    Road network shortest-path preprocessing
    Image processing
    Image indexing
    Decision tree training
    Epitome computation
    Simulation
    light flow simulations for next-generation display research
    Monte-Carlo simulations for mobile data
    eScience
    Machine learning platform for health solutions
    Astrophysics simulation
  • 32. Ongoing Work
    Advanced query optimizations
    Combination of static analysis and annotations
    Sampling execution of the query plan
    Dynamic query optimization
    Incremental computation
    Real-time event processing
    Global scheduling
    Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications
    Scale-out partitioned storage
    Pluggable storage providers
    DryadLINQ on Azure
    Better debugging, performance analysis, visualization, etc.
  • 33. Additional Resources
    Dryad and DryadLINQ
    http://connect.microsoft.com/DryadLINQ
    DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc.
    PLINQ
    Available in Parallel Extensions to .NET Framework 3.5 CTP
    Available in .NET Framework 4.0 Beta 2
    http://msdn.microsoft.com/en-us/concurrency/default.aspx
    http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
    Windows HPC Server 2008
    http://www.microsoft.com/hpc
    Download it, try it, we want your feedback!
  • 34. Questions?
  • 35. YOUR FEEDBACK IS IMPORTANT TO US!
    Please fill out session evaluation forms online at
    MicrosoftPDC.com
  • 36. Learn More On Channel 9
    Expand your PDC experience through Channel 9.
    Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.
    channel9.msdn.com/learn
    Built by Developers for Developers….

×