Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown,Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun                ...
Squeryl
DSLs can be used forhigh performance, too
Pthreads   SunOpenMP     T2CUDA       NvidiaOpenCL     FermiVerilog    AlteraVHDL       FPGAMPIPGAS         Cray          ...
Applications                 Pthreads   Sun   Scientific    OpenMP     T2  Engineering    Virtual      CUDA       Nvidia  ...
Applications                                                 Pthreads   Sun   Scientific                                  ...
n  Tiark    Rompf’s talk yesterdayn  In   case you missed it:   n    Techniques for rewriting high-level         progra...
n  Introduction     to existing Delite DSLsn  Constructing      your own Delite DSLn  Not    covered – under the covers...
n  Syntax   is legal Scala                                  A       B       A       Cn  Staged        to build an IR    ...
n  OptiML (Machine Learning)n  OptiQL (Data querying)n  OptiGraph (Large-scale graph analysis)n  OptiCollections (Scal...
OptiML: An Implicitly Parallel Domain-Specific Language for                        Machine Learning, ICML 2011n    Provid...
untilconverged(mu,	  tol){	  mu	  =>	  	  	  	  	  //	  calculate	  distances	  to	  current	  centroids	  	  	  	  	  //	...
untilconverged(mu,	  tol){	  mu	  =>	  	  	  	  	  //	  calculate	  distances	  to	  current	  centroids	  	  	  	  	  val...
untilconverged(mu,	  tol){	  mu	  =>	  	  	  	  	  //	  calculate	  distances	  to	  current	  centroids	  	  	  	  	  val...
n  Dataquerying of in-memory  collections  n    inspired by LINQn  SQL-like    declarative languagen  Use      high-le...
//	  lineItems:	  Iterable[LineItem]	  //	  Similar	  to	  Q1	  of	  the	  TPCH	  benchmark	          hoistedval	  q	  =	 ...
n    A DSL for large-scale graph analysis based      on Green-Marl      Green-Marl: A DSL for Easy and Efficient Graph An...
Implicitly parallel iterationfor(t	  <-­‐	  G.Nodes)	  {	  	  	  val	  rank	  =	  ((1.0	  d)/	  N)	  +	  	  	  	  	  	  	 ...
n    A port of a subset of Scala collections to a      staged Delite DSLn    Demonstrates the benefits of high-level    ...
Program at a high levelGet high performance
Scala                                               CUDAdef	  apply(x388:Int,x423:Int,x389:Int,	          __device__	  int...
1                                           1.60                                                                  k-means ...
1                                            1                                                               100k nodes x ...
How do I build my own Delite DSL?
Domain            Data           Physics              Machine            Graph Specific        Analytics                  ...
1.      Types      n    abstract, front-end2.      Operations      n    language operators and methods available on type...
abstract	  class	  Vector[T]	  extends	  DeliteCollection[T]	  abstract	  class	  Matrix[T]	  extends	  DeliteCollection[T...
The same abstracttrait	  VectorOps	  {	                                             Vector we defined earlier	  	  //	  ad...
trait	  VectorOpsExp	  extends	  VectorOps	  with	  Expressions	  {	  //	  a	  Delite	  parallel	  op	  IR	  node	  case	 ...
//	  a	  concrete,	  back-­‐end	  Scala	  data	  structure	  //	  will	  be	  instantiated	  by	  generated	  code	  class...
trait	  ScalaGenVectorOps	  extends	  ScalaGen	  {	    	  val	  IR:	  VectorOpsExp	    	  import	  IR._	     	  override	 ...
override	  def	  matrix_plus[A:Manifest:Arith]	    	  (x:	  Exp[Matrix[A]],	  y:	  Exp[Matrix[A]])	  =	        	  	  	  (x...
trait	  OptiML	  extends	  OptiMLScalaOpsPkg	  with	  VectorOps	  with	    MatrixOps	  	  with	  ...	  trait	  OptiMLExp	 ...
n    Delite DSLs target high performance      architectures from Scalan    Open source – use them to accelerate      you...
Upcoming SlideShare
Loading in …5
×

Arvindsujeeth scaladays12

806
-1

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
806
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Arvindsujeeth scaladays12

  1. 1. Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown,Hassan Chafi, Michael Wu, Victoria Popic, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL) Tiark Rompf, Aleksandar Prokopec, Vojin Jovanovic, Philipp Haller, Martin Odersky Ecole Polytechnique Federale de Lausanne (EPFL) Programming Methods Laboratory (LAMP)
  2. 2. Squeryl
  3. 3. DSLs can be used forhigh performance, too
  4. 4. Pthreads SunOpenMP T2CUDA NvidiaOpenCL FermiVerilog AlteraVHDL FPGAMPIPGAS Cray Jaguar
  5. 5. Applications Pthreads Sun Scientific OpenMP T2 Engineering Virtual CUDA Nvidia Worlds OpenCL Fermi Personal Robotics Verilog Altera VHDL FPGA Data Informatics MPI PGAS Cray Jaguar
  6. 6. Applications Pthreads Sun Scientific OpenMP T2 Engineering Virtual DSLs CUDA Nvidia Worlds OpenCL Fermi Personal Robotics Verilog Altera VHDL FPGA Data Informatics MPI PGAS Cray Jaguar Too many different programming models
  7. 7. n  Tiark Rompf’s talk yesterdayn  In case you missed it: n  Techniques for rewriting high-level programs to high-performance programs n  Build an intermediate representation (IR) of Scala programs at runtime n  IR can be optimized and code generated
  8. 8. n  Introduction to existing Delite DSLsn  Constructing your own Delite DSLn  Not covered – under the covers: n  Implementation details about the Delite framework n  See http://cgo2012.hyperdsls.org/
  9. 9. n  Syntax is legal Scala A B A Cn  Staged to build an IR * * (metaprogramming) +n  Optimized at a high leveln  Compiled to different low-level target architectures
  10. 10. n  OptiML (Machine Learning)n  OptiQL (Data querying)n  OptiGraph (Large-scale graph analysis)n  OptiCollections (Scala collections)n  OptiMesh (Mesh-based PDE solvers)Coming soon:n  OptiSDR (Software-defined radio)n  OptiCVX (Convex optimization)
  11. 11. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning, ICML 2011n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. val  c  =  a  *  b  (a, b are Matrix[Double])n  Implicitly parallel data structures n  Base types: Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes: TrainingSet, IndexVector, Image, …n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Arguments to control structures are anonymous functions with restricted semantics
  12. 12. untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  }  
  13. 13. untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                }                allDistances.minIndex          }          //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it  }  
  14. 14. untilconverged(mu,  tol){  mu  =>          //  calculate  distances  to  current  centroids          val  c  =  (0::m){i  =>                val  allDistances  =  mu  mapRows  {  centroid  =>                      dist(x(i),  centroid)                              }   fused              allDistances.minIndex          }          //  move  each  cluster  centroid  to  the          //  mean  of  the  points  assigned  to  it          val  newMu  =  (0::k,*){  i  =>                val  (weightedpoints,  points)  =  sum(0,m)  {  j  =>                      if  (c(i)  ==  j)  (x(i),1)                }                val  d  =  if  (points  ==  0)  1  else  points                  weightedpoints  /  d          }          newMu  }  
  15. 15. n  Dataquerying of in-memory collections n  inspired by LINQn  SQL-like declarative languagen  Use high-level semantic knowledge to implement query optimizer
  16. 16. //  lineItems:  Iterable[LineItem]  //  Similar  to  Q1  of  the  TPCH  benchmark   hoistedval  q  =  lineItems  Where(_.l_shipdate  <=  Date(‘‘19981201’’)).      GroupBy(l  =>  (l.l_linestatus)).      Select(g  =>  new  Result  {          val  lineStatus  =  g.key          val  sumQty  =  g.Sum(_.l_quantity)          val  sumDiscountedPrice  =              g.Sum(r  =>  r.l_extendedprice*(1.0-­‐r.l_discount))   fused        val  avgPrice  =  g.Average(_.l_extendedprice)          val  countOrder  =  g.Count      })  OrderBy(_.returnFlag)  ThenBy(_.lineStatus)  
  17. 17. n  A DSL for large-scale graph analysis based on Green-Marl Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12n  Directed and undirected graphs, nodes, edgesn  Collections for node/edge storage n  Set, sequence, ordern  Deferred assignment and parallel reductions with bulk synchronous consistency
  18. 18. Implicitly parallel iterationfor(t  <-­‐  G.Nodes)  {      val  rank  =  ((1.0  d)/  N)  +                              d  *  Sum(t.InNbrs){w  =>  PR(w)  /  w.OutDegree}      PR  <=  (t,rank)      diff  +=  Math.abs(rank  -­‐  PR(t))  }   Deferred assignment and scalar reduction Writes become visible after the loop completes
  19. 19. n  A port of a subset of Scala collections to a staged Delite DSLn  Demonstrates the benefits of high-level optimization and code generation val  sourcedests  =  pagelinks  flatMap  {  l  =>      val  sd  =  l.split(":")      val  source  =  Long.parseLong(sd(0))   Tuples    val  dests  =  sd(1).trim.split("  ")   encoded    dests.map(d  =>  (Integer.parseInt(d),  source))   as longs }   in back- val  inverted  =  sourcedests  groupBy  (x  =>  x._1)   end Reverse web-link benchmark in OptiCollections
  20. 20. Program at a high levelGet high performance
  21. 21. Scala CUDAdef  apply(x388:Int,x423:Int,x389:Int,   __device__  int      x419:Array[Double],x431:Int,   dev_collect_x478_x478(int  x423,int      x433:Array[Double])  {   x389,DeliteArray<double>  x419,int   x431,DeliteArray<double>  x433,int  val  x418  =  x413  *  x389   x413)  {  val  x912_zero      =  {  0  }   int  x418  =  x413  *  x389;  val  x912_zero_2  =  {     int  x919  =  0;        1.7976931348623157E308  }   double  x919_2  =  1.7976931348623157E308;  var  x912      =  x912_zero   int  x425  =  0;  var  x912_2  =  x912_zero_2   while  (x425  <  x423)  {      var  x425  =  0      int  x430  =  x425  *  1;  while  (x425  <  x423)  {          int  x432  =  x430  *  x431;      val  x430  =  x425  *  1      double  x923  =  0.0;      val  x432  =  x430  *  x431      int  x450  =  0;      val  x916_zero  =  {   .  .  .      0.0      }  .  .  .  
  22. 22. 1 1.60 k-means Normalized Execution Time 1.40 Template 0.8 Matching OptiML 1.6 1.20 1.9 1.00 0.6 C++OptiML 0.80 0.4 3.6 0.60 5.1 0.40 10.6 0.2 0.20 0.00 0 1 CPU 2 CPU 4 CPU 8 CPU 1 CPU 2 CPU 4 CPU 8 CPU GPU 2 0.63 0.52 1.6 TPCH-Q1 TPCH-Q2 Normalized Execution Time 1.5 1.2 1.0 OptiQLOptiQL 1 1.0 1.2 0.8 LINQ 2.3 2.1 0.5 0.4 6.7 0 0 1P 8P 1P 8P
  23. 23. 1 1 100k nodes x 8M nodes x 800k edges 64M edges 0.8 0.8 1.3 Normalized Execution 1.7 1.7 1.7 OptiGraph 0.6 0.6 2.1 Green MarlOptiGraph Time 2.4 0.4 0.4 3.93.8 4.3 4.8 (PageRank) 0.2 0.2 0 0 1P 2P 4P 8P 1P 2P 4P 8P 4 1.8 75 MB 0.61 463 MB 3.5 0.30 1.6 3 1.4 OptiCollections 1.2 2.5 1.0OptiCollections 2 0.52 0.71 0.8 1 1.2 Scala Parallel Collections 1.5 0.82 0.6 (Reverse web- 1 1.0 1.3 2.2 2.1 0.4 3.8 3.4 2.0 link benchmark) 0.5 3.1 0.2 5.6 0 0 1P 2P 4P 8P 1P 2P 4P 8P
  24. 24. How do I build my own Delite DSL?
  25. 25. Domain Data Physics Machine Graph Specific Analytics Learning Analysis (OptiQL) (OptiMesh) (OptiML) (OptiGraph)Languages Domain Embedding Language (Scala) Modular Staging Delite Compiler Delite: DSL Parallel PatternsInfrastructure Static Optimizations Heterogeneous Code Generation Delite Runtime Walk-time Optimizations Locality-aware SchedulingHeterogeneous SMP GPU Hardware
  26. 26. 1.  Types n  abstract, front-end2.  Operations n  language operators and methods available on types; represented by IR nodes3.  Data Structures n  platform-specific concrete implementation, back-end4.  Code Generators n  Scala traits that define how to emit code as strings for various IR nodes and platforms5.  Analyses and Optimizations (Optional) n  IR rewriting via pattern matching, traversals/transformations (e.g. fusion)
  27. 27. abstract  class  Vector[T]  extends  DeliteCollection[T]  abstract  class  Matrix[T]  extends  DeliteCollection[T]  abstract  class  Image[T]  extends  Matrix[T]  placeholders for static typechecking and method dispatch;not bound to any implementation
  28. 28. The same abstracttrait  VectorOps  {   Vector we defined earlier    //  add  an  infix  +  operator  to  Rep[Vector[A]]      def  infix_+(lhs:  Rep[Vector[A]],  rhs:  Rep[Vector[A]])  =          vector_plus(lhs,  rhs)      //  abstract,  applications  cannot  inspect  what  happens        //  when  methods  are  called      def  vector_length(lhs:  Rep[Vector[A]]):  Rep[Int]      def  vector_plus(lhs:  Rep[Vector[A]],                                      rhs:  Rep[Vector[A]]):  Rep[Vector[A]]  }  
  29. 29. trait  VectorOpsExp  extends  VectorOps  with  Expressions  {  //  a  Delite  parallel  op  IR  node  case  class  VectorPlus(inA:  Exp[Vector[A]],  inB:  Exp[Vector[A]])   extends  DeliteOpZipWith[Vector[A],  Vector[A],  Vector[A]]  {    //  number  of  elements  in  the  input  collections    def  size  =  inA.length    //  the  output  collection    def  alloc  =  Vector[A](inA.length)    //  the  ZipWith  function    def  func  =  (a,b)  =>  a  +  b  }  //  construct  IR  nodes  def  vector_plus(lhs:  Exp[Vector[A]],  rhs:  Exp[Vector[A]])    =  VectorPlus(lhs,  rhs)  }  
  30. 30. //  a  concrete,  back-­‐end  Scala  data  structure  //  will  be  instantiated  by  generated  code  class  Vector[T](__length:  Int)  {    var  _length  =  __length    var  _data:  Array[T]  =  new  Array[T](_length)  }  //  corresponding  data  structures  for  other  back-­‐ends  //  (CUDA,  OpenCL,  etc.)  //  .  .  .  
  31. 31. trait  ScalaGenVectorOps  extends  ScalaGen  {    val  IR:  VectorOpsExp    import  IR._    override  def  emitNode(sym:  Sym[Any],  rhs:  Def[Any])    (implicit  stream:  PrintWriter)  =        //  generate  code  for  particular  IR  nodes        rhs  match  {   The exact          case  v@VectorNew(length)  =>   back-end field                  emitValDef(sym,  “new  "  +  remap("Vector")+"("  +                              quote(length)  +  ")")   name we          case  VectorLength(x)  =>     defined earlier              emitValDef(sym,  quote(x)  +  ".  _length")            case  _  =>  super.emitNode(sym,  rhs)          }  }  
  32. 32. override  def  matrix_plus[A:Manifest:Arith]    (x:  Exp[Matrix[A]],  y:  Exp[Matrix[A]])  =        (x,  y)  match  {                //  (AB  +  AD)  ==  A(B  +  D)                case  (Def(MatrixTimes(a,  b)),                            Def(MatrixTimes(c,  d)))  if  (a  ==  c)  =>                        //  return  optimized  version                    matrix_times(a,  matrix_plus(b,d))      //  other  rewrites      //  case  .  .  .            case  _  =>  super.matrix_plus(x,  y)        }  
  33. 33. trait  OptiML  extends  OptiMLScalaOpsPkg  with  VectorOps  with   MatrixOps    with  ...  trait  OptiMLExp  extends  OptiMLScalaOpsPkgExp  with   VectorOpsExp  with  MatrixOpsExp    with  ...  trait  OptiMLCodeGenScala  extends  OptiMLScalaCodeGenPkg  with   ScalaGenVectorOps  with  ScalaGenMatrixOps    with  ...  trait  OptiMLCodeGenCuda  extends  OptiMLCudaCodeGenPkg  with   CudaGenVectorOps  with  CudaGenMatrixOps    with  ...  
  34. 34. n  Delite DSLs target high performance architectures from Scalan  Open source – use them to accelerate your apps or build your own! n  http://github.com/stanford-ppl/Deliten  Mailing List: n  http://groups.google.com/group/delite-develn  Thank you

×