Large Data Volume
 100s of TBs to 10s of PBs
 Large scale processing and analytics at
unprecedented low cost (hardware and
software)
New Economics
 Distributed Parallel Processing
Frameworks
 Easy to Scale on commodity hardware
 MapReduce-style programming models
New
Technologies
 Unstructured
 Weak relational schema
 Text, Images, Videos, Logs
Non-Traditional data
Types
 Sensors
 Devices
 Traditional applications
 Web Servers
 Public data
New Data Sources
 How popular is my product?
 What is the best ad to serve?
 Is this a fraudulent transaction?
New Questions & New
Insights
4
5
6
var logentries =
from line in logs
where !line.StartsWith("#")
select new LogEntry(line);
var user =
from access in logentries
where access.user.EndsWith(@"sen")
select access;
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount(“sen", pages.Key, pages.Count());
var htmAccesses =
from access in accesses
where access.page.EndsWith(".htm")
orderby access.count descending
select access;
LINQ query transformed into
computation graph
Input
Compute
Compute and
resort
Compute and
resort
Output
2
1
3
4 5
Processing
vertices
Edges
(files)
Inputs
Outputs
Processing
vertices
Edges
(files)
Inputs
Outputs
Free Compute Resources
Application that
calls LINQ to HPC
APIs
HPC Head Node
DSC
Submit LINQ to
HPC Job
1
1
The LINQ to HPC job also
starts a set of parametric
sweep tasks across the rest
of the nodes as DVH
2b
A LINQ to HPC job
starts 1 basic task
assigning a node as the
DGM
2a
2a
LINQ to HPC Vertices
read and write files
3b
Graph Manager starts/stops
Vertices
3a
HPC Compute Nodes
3a
3b
2b
Graph Manager
Vertex Host
Vertices read and write
files
3b
Graph Manager starts/stops
Dryad Vertices
3a
HPC Compute Nodes
3a
3b
Graph Manager
Vertex Host
Vertices in logical
computation graph
• Graph manager starts vertices on Vertex
Hosts
• Preferentially schedules vertices near input
files
When input is already on cluster, can make local
IO the common case
Application that
calls LINQ to HPC
APIs
HPC Head Node
DSC
Publish to share:
1. binaries for LINQ to HPC job
2. XML description of LINQ to
HPC graph
1
1
DVH loads binaries for this LINQ to HPC
job from share, executes them according
to commands from DGM
DGM reads XML description of graph from
share, calls DSC to locate files referenced in
XML
2a
3b
3a
HPC Compute Nodes
3a
3b
2b
LINQ to HPC Graph
Manager
LINQ to HPC Vertex
Host
The LINQ to HPC job also
starts a set of parametric
sweep tasks across the rest
of the nodes as DVH
2b
A LINQ to HPC job
starts 1 basic task
assigning a node as the
DGM
2a
DSC NODE ADD sen-cn1 /TEMPPATH:c:DryadHpcTemp /DATAPATH:c:DryadHpcData /SERVICE:sen-hn
using System;
using System.Linq;
using Microsoft.Hpc.Linq;
namespace MyProgram {
class Program {
static void Main(string[] args) {
var config = new HpcLinqConfiguration(“MyHpcClusterHeadNode”);
var context = new HpcLinqContext(config);
var lengths = context.FromDsc<LineRecord>("MyTextData")
.Select(r => r.Line.Length);
Console.WriteLine("The maximum line length is {0}", lengths.Max());
}
}
}
HPC provisioning, management,
etc.
MPI SOA
LINQ to HPC
runtime
Windows
Server
Azure*
Distributed runtimes
Cluster and cloud services
Platform
DSC (Distributed
Storage Catalog)
Bind individual NTFS shares
together to support the LINQ to
HPC distributed runtime
Programming models LINQ to HPC NEW
* Future support planned
Microsoft Big Data End-to-End
Sensors
Devices
Apps
Bots
Crawlers
Data Marts
SSAS
ERP
CRM
LOB
HPC Server
SQL EDW
S S
RS
Data & Compute
Intensive HPC App
Interactive Reports
Performance Scorecard
PowerPivot
Embedded BI Apps
Hadoop
Integration Services
Integration Services
microsoft.com/learning/en/us/exam.aspx?ID=70-690
www.microsoft.com/teched www.microsoft.com/learning
http://microsoft.com/technet http://microsoft.com/msdn
http://northamerica.msteched.com
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC Server

LINQ to HPC: Developing Big Data Applications on Windows HPC Server

  • 4.
    Large Data Volume 100s of TBs to 10s of PBs  Large scale processing and analytics at unprecedented low cost (hardware and software) New Economics  Distributed Parallel Processing Frameworks  Easy to Scale on commodity hardware  MapReduce-style programming models New Technologies  Unstructured  Weak relational schema  Text, Images, Videos, Logs Non-Traditional data Types  Sensors  Devices  Traditional applications  Web Servers  Public data New Data Sources  How popular is my product?  What is the best ad to serve?  Is this a fraudulent transaction? New Questions & New Insights 4
  • 5.
  • 6.
  • 8.
    var logentries = fromline in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"sen") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“sen", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; LINQ query transformed into computation graph Input Compute Compute and resort Compute and resort Output 2 1 3 4 5
  • 9.
  • 10.
  • 11.
    Application that calls LINQto HPC APIs HPC Head Node DSC Submit LINQ to HPC Job 1 1 The LINQ to HPC job also starts a set of parametric sweep tasks across the rest of the nodes as DVH 2b A LINQ to HPC job starts 1 basic task assigning a node as the DGM 2a 2a LINQ to HPC Vertices read and write files 3b Graph Manager starts/stops Vertices 3a HPC Compute Nodes 3a 3b 2b Graph Manager Vertex Host
  • 12.
    Vertices read andwrite files 3b Graph Manager starts/stops Dryad Vertices 3a HPC Compute Nodes 3a 3b Graph Manager Vertex Host Vertices in logical computation graph • Graph manager starts vertices on Vertex Hosts • Preferentially schedules vertices near input files When input is already on cluster, can make local IO the common case
  • 13.
    Application that calls LINQto HPC APIs HPC Head Node DSC Publish to share: 1. binaries for LINQ to HPC job 2. XML description of LINQ to HPC graph 1 1 DVH loads binaries for this LINQ to HPC job from share, executes them according to commands from DGM DGM reads XML description of graph from share, calls DSC to locate files referenced in XML 2a 3b 3a HPC Compute Nodes 3a 3b 2b LINQ to HPC Graph Manager LINQ to HPC Vertex Host The LINQ to HPC job also starts a set of parametric sweep tasks across the rest of the nodes as DVH 2b A LINQ to HPC job starts 1 basic task assigning a node as the DGM 2a
  • 14.
    DSC NODE ADDsen-cn1 /TEMPPATH:c:DryadHpcTemp /DATAPATH:c:DryadHpcData /SERVICE:sen-hn
  • 17.
    using System; using System.Linq; usingMicrosoft.Hpc.Linq; namespace MyProgram { class Program { static void Main(string[] args) { var config = new HpcLinqConfiguration(“MyHpcClusterHeadNode”); var context = new HpcLinqContext(config); var lengths = context.FromDsc<LineRecord>("MyTextData") .Select(r => r.Line.Length); Console.WriteLine("The maximum line length is {0}", lengths.Max()); } } }
  • 20.
    HPC provisioning, management, etc. MPISOA LINQ to HPC runtime Windows Server Azure* Distributed runtimes Cluster and cloud services Platform DSC (Distributed Storage Catalog) Bind individual NTFS shares together to support the LINQ to HPC distributed runtime Programming models LINQ to HPC NEW * Future support planned
  • 23.
    Microsoft Big DataEnd-to-End Sensors Devices Apps Bots Crawlers Data Marts SSAS ERP CRM LOB HPC Server SQL EDW S S RS Data & Compute Intensive HPC App Interactive Reports Performance Scorecard PowerPivot Embedded BI Apps Hadoop Integration Services Integration Services
  • 24.
  • 25.