Arvindsujeeth scaladays12

Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown,
Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun
Stanford University
Pervasive Parallelism Laboratory (PPL)

Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic,
Philipp Haller, Martin Odersky
Ecole Polytechnique Federale de Lausanne (EPFL)
Programming Methods Laboratory (LAMP)

DSLs can be used for
high performance, too

Pthreads Sun
OpenMP T2

CUDA Nvidia
OpenCL Fermi

Verilog Altera
VHDL FPGA

MPI
PGAS Cray
Jaguar

Applications
Pthreads Sun
Scientific OpenMP T2
Engineering

Virtual CUDA Nvidia
Worlds OpenCL Fermi

Personal
Robotics
Verilog Altera
VHDL FPGA
Data
Informatics
MPI
PGAS Cray
Jaguar

Applications
Pthreads Sun
Scientific OpenMP T2
Engineering

Virtual DSLs CUDA Nvidia
Worlds OpenCL Fermi

Personal
Robotics
Verilog Altera
VHDL FPGA
Data
Informatics
MPI
PGAS Cray
Jaguar
Too many different programming models

n  Tiark Rompf’s talk yesterday

n  In case you missed it:
n  Techniques for rewriting high-level
programs to high-performance programs

n  Build an intermediate representation (IR)
of Scala programs at runtime

n  IR can be optimized and code generated

n  Introduction to existing Delite DSLs

n  Constructing your own Delite DSL

n  Not covered – under the covers:
n  Implementation details about the Delite
framework

n  See http://cgo2012.hyperdsls.org/

n  Syntax is legal Scala
A B A C
n  Staged
to build an IR * *
(metaprogramming) +

n  Optimized at a high level

n  Compiled
to different low-level target
architectures

n  OptiML (Machine Learning)
n  OptiQL (Data querying)
n  OptiGraph (Large-scale graph analysis)
n  OptiCollections (Scala collections)
n  OptiMesh (Mesh-based PDE solvers)

Coming soon:

n  OptiSDR (Software-defined radio)
n  OptiCVX (Convex optimization)

OptiML: An Implicitly Parallel Domain-Specific Language for
Machine Learning, ICML 2011

n  Provides a familiar (MATLAB-like) language and
API for writing ML applications
n  Ex. val
c
=
a
*
b
(a, b are Matrix[Double])

n  Implicitly parallel data structures
n  Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T]
n  Subtypes: TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures
n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … }
n  Arguments to control structures are anonymous functions with
restricted semantics

untilconverged(mu,
tol){
mu
=>

//
calculate
distances
to
current
centroids

//
move
each
cluster
centroid
to
the

//
mean
of
the
points
assigned
to
it

}

untilconverged(mu,
tol){
mu
=>

//
calculate
distances
to
current
centroids

val
c
=
(0::m){i
=>

val
allDistances
=
mu
mapRows
{
centroid
=>

dist(x(i),
centroid)

}

allDistances.minIndex

}

//
move
each
cluster
centroid
to
the

//
mean
of
the
points
assigned
to
it

}

untilconverged(mu,
tol){
mu
=>

//
calculate
distances
to
current
centroids

val
c
=
(0::m){i
=>

val
allDistances
=
mu
mapRows
{
centroid
=>

dist(x(i),
centroid)

}
fused

allDistances.minIndex

}

//
move
each
cluster
centroid
to
the

//
mean
of
the
points
assigned
to
it

val
newMu
=
(0::k,*){
i
=>

val
(weightedpoints,
points)
=
sum(0,m)
{
j
=>

if
(c(i)
==
j)
(x(i),1)

}

val
d
=
if
(points
==
0)
1
else
points

weightedpoints
/
d

}

newMu

}

n  Dataquerying of in-memory
collections
n  inspired by LINQ

n  SQL-like declarative language

n  Use
high-level semantic knowledge to
implement query optimizer

//
lineItems:
Iterable[LineItem]

//
Similar
to
Q1
of
the
TPCH
benchmark
hoisted
val
q
=
lineItems
Where(_.l_shipdate
<=
Date(‘‘19981201’’)).

GroupBy(l
=>
(l.l_linestatus)).

Select(g
=>
new
Result
{

val
lineStatus
=
g.key

val
sumQty
=
g.Sum(_.l_quantity)

val
sumDiscountedPrice
=

g.Sum(r
=>
r.l_extendedprice*(1.0-‐r.l_discount))
fused

val
avgPrice
=
g.Average(_.l_extendedprice)

val
countOrder
=
g.Count

})
OrderBy(_.returnFlag)
ThenBy(_.lineStatus)

n  A DSL for large-scale graph analysis based
on Green-Marl
Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12

n  Directed and undirected graphs, nodes,
edges

n  Collections for node/edge storage
n  Set, sequence, order

n  Deferred assignment and parallel reductions
with bulk synchronous consistency

Implicitly parallel iteration

for(t
<-‐
G.Nodes)
{

val
rank
=
((1.0
d)/
N)
+

d
*
Sum(t.InNbrs){w
=>
PR(w)
/
w.OutDegree}

PR
<=
(t,rank)

diff
+=
Math.abs(rank
-‐
PR(t))

}

Deferred assignment and scalar reduction

Writes become visible after the loop completes

n  A port of a subset of Scala collections to a
staged Delite DSL

n  Demonstrates the benefits of high-level
optimization and code generation

val
sourcedests
=
pagelinks
flatMap
{
l
=>

val
sd
=
l.split(":")

val
source
=
Long.parseLong(sd(0))
Tuples

val
dests
=
sd(1).trim.split("
")
encoded

dests.map(d
=>
(Integer.parseInt(d),
source))
as longs
}
in back-
val
inverted
=
sourcedests
groupBy
(x
=>
x._1)
end
Reverse web-link benchmark in OptiCollections

Program at a high level
Get high performance

Scala CUDA
def
apply(x388:Int,x423:Int,x389:Int,
__device__
int

x419:Array[Double],x431:Int,
dev_collect_x478_x478(int
x423,int

x433:Array[Double])
{
x389,DeliteArray<double>
x419,int

x431,DeliteArray<double>
x433,int

val
x418
=
x413
*
x389
x413)
{

val
x912_zero

=
{
0
}
int
x418
=
x413
*
x389;

val
x912_zero_2
=
{

int
x919
=
0;

1.7976931348623157E308
}
double
x919_2
=
1.7976931348623157E308;

var
x912

=
x912_zero
int
x425
=
0;

var
x912_2
=
x912_zero_2

while
(x425
<
x423)
{

var
x425
=
0

int
x430
=
x425
*
1;

while
(x425
<
x423)
{

int
x432
=
x430
*
x431;

val
x430
=
x425
*
1

double
x923
=
0.0;

val
x432
=
x430
*
x431

int
x450
=
0;

val
x916_zero
=
{
.
.
.

0.0

}

.
.
.

1
1.60 k-means

Normalized Execution Time
1.40 Template 0.8
Matching OptiML

1.6
1.20

1.9
1.00 0.6
C++

OptiML
0.80
0.4

3.6
0.60

5.1
0.40

10.6
0.2
0.20
0.00 0
1 CPU 2 CPU 4 CPU 8 CPU 1 CPU 2 CPU 4 CPU 8 CPU GPU

2 0.63
0.52 1.6
TPCH-Q1 TPCH-Q2
Normalized Execution Time

1.5 1.2
1.0 OptiQL
OptiQL 1
1.0
1.2
0.8
LINQ

2.3
2.1
0.5 0.4
6.7

0 0
1P 8P 1P 8P

1 1
100k nodes x 8M nodes x
800k edges 64M edges
0.8 0.8 1.3

Normalized Execution
1.7 1.7 1.7 OptiGraph
0.6 0.6
2.1 Green Marl
OptiGraph
Time
2.4
0.4 0.4
3.93.8 4.3
4.8
(PageRank) 0.2 0.2

0 0
1P 2P 4P 8P 1P 2P 4P 8P
4 1.8
75 MB 0.61 463 MB
3.5 0.30 1.6

3
1.4 OptiCollections
1.2
2.5 1.0

OptiCollections 2
0.52

0.71
0.8
1 1.2
Scala Parallel
Collections
1.5 0.82 0.6
(Reverse web- 1
1.0
1.3
2.2 2.1
0.4 3.8 3.4
2.0
link benchmark) 0.5 3.1
0.2
5.6

0 0
1P 2P 4P 8P 1P 2P 4P 8P

How do I build my own Delite DSL?

Domain Data Physics Machine Graph
Specific Analytics Learning Analysis
(OptiQL) (OptiMesh) (OptiML) (OptiGraph)
Languages

Domain Embedding Language (Scala)
Modular Staging

Delite Compiler

Delite: DSL Parallel Patterns
Infrastructure Static Optimizations Heterogeneous Code Generation

Delite Runtime

Walk-time Optimizations Locality-aware Scheduling

Heterogeneous
SMP GPU
Hardware

1.  Types
n  abstract, front-end

2.  Operations
n  language operators and methods available on types;
represented by IR nodes

3.  Data Structures
n  platform-specific concrete implementation, back-end

4.  Code Generators
n  Scala traits that define how to emit code as strings for
various IR nodes and platforms

5.  Analyses and Optimizations (Optional)
n  IR rewriting via pattern matching, traversals/transformations
(e.g. fusion)

abstract
class
Vector[T]
extends
DeliteCollection[T]

abstract
class
Matrix[T]
extends
DeliteCollection[T]

abstract
class
Image[T]
extends
Matrix[T]

placeholders for static type
checking and method dispatch;

not bound to any implementation

The same abstract
trait
VectorOps
{
Vector we defined earlier

//
add
an
infix
+
operator
to
Rep[Vector[A]]

def
infix_+(lhs:
Rep[Vector[A]],
rhs:
Rep[Vector[A]])
=

vector_plus(lhs,
rhs)

//
abstract,
applications
cannot
inspect
what
happens

//
when
methods
are
called

def
vector_length(lhs:
Rep[Vector[A]]):
Rep[Int]

def
vector_plus(lhs:
Rep[Vector[A]],

rhs:
Rep[Vector[A]]):
Rep[Vector[A]]

}

trait
VectorOpsExp
extends
VectorOps
with
Expressions
{

//
a
Delite
parallel
op
IR
node

case
class
VectorPlus(inA:
Exp[Vector[A]],
inB:
Exp[Vector[A]])

extends
DeliteOpZipWith[Vector[A],
Vector[A],
Vector[A]]
{

//
number
of
elements
in
the
input
collections

def
size
=
inA.length

//
the
output
collection

def
alloc
=
Vector[A](inA.length)

//
the
ZipWith
function

def
func
=
(a,b)
=>
a
+
b

}

//
construct
IR
nodes

def
vector_plus(lhs:
Exp[Vector[A]],
rhs:
Exp[Vector[A]])

=
VectorPlus(lhs,
rhs)

}

//
a
concrete,
back-‐end
Scala
data
structure

//
will
be
instantiated
by
generated
code

class
Vector[T](__length:
Int)
{

var
_length
=
__length

var
_data:
Array[T]
=
new
Array[T](_length)

}

//
corresponding
data
structures
for
other
back-‐ends

//
(CUDA,
OpenCL,
etc.)

//
.
.
.

trait
ScalaGenVectorOps
extends
ScalaGen
{

val
IR:
VectorOpsExp

import
IR._

override
def
emitNode(sym:
Sym[Any],
rhs:
Def[Any])

(implicit
stream:
PrintWriter)
=

//
generate
code
for
particular
IR
nodes

rhs
match
{

The exact

case
v@VectorNew(length)
=>

back-end field

emitValDef(sym,
“new
"
+
remap("Vector")+"("
+

quote(length)
+
")")
name we

case
VectorLength(x)
=>

defined earlier

emitValDef(sym,
quote(x)
+
".
_length")

case
_
=>
super.emitNode(sym,
rhs)

}

}

override
def
matrix_plus[A:Manifest:Arith]

(x:
Exp[Matrix[A]],
y:
Exp[Matrix[A]])
=

(x,
y)
match
{

//
(AB
+
AD)
==
A(B
+
D)

case
(Def(MatrixTimes(a,
b)),

Def(MatrixTimes(c,
d)))
if
(a
==
c)
=>

//
return
optimized
version

matrix_times(a,
matrix_plus(b,d))

//
other
rewrites

//
case
.
.
.

case
_
=>
super.matrix_plus(x,
y)

}

trait
OptiML
extends
OptiMLScalaOpsPkg
with
VectorOps
with

MatrixOps

with
...

trait
OptiMLExp
extends
OptiMLScalaOpsPkgExp
with

VectorOpsExp
with
MatrixOpsExp

with
...

trait
OptiMLCodeGenScala
extends
OptiMLScalaCodeGenPkg
with

ScalaGenVectorOps
with
ScalaGenMatrixOps

with
...

trait
OptiMLCodeGenCuda
extends
OptiMLCudaCodeGenPkg
with

CudaGenVectorOps
with
CudaGenMatrixOps

with
...

n  Delite DSLs target high performance
architectures from Scala

n  Open source – use them to accelerate
your apps or build your own!
n  http://github.com/stanford-ppl/Delite

n  Mailing List:
n  http://groups.google.com/group/delite-devel

n  Thank you

Arvindsujeeth scaladays12

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Arvindsujeeth scaladays12

Similar to Arvindsujeeth scaladays12 (20)

More from Skills Matter Talks

More from Skills Matter Talks (9)

Recently uploaded

Recently uploaded (20)

Arvindsujeeth scaladays12