Programming Systems for High-PerformanceComputingMarc	  Snir	  
PreambleMay	  2013	  CCGrid13	  2	  
In The Beginning§  Distributed	  Memory	  Parallel	  Systems	  are	  replacing	  vector	  machines.	  §  CM-­‐5	  (using...
Some of the AttemptsMay	  2013	  CCGrid13	  4	  1990	   1995	   2000	   2005	   2010	  HPF	   HPF1	   HPF2	  MPI	   MPI1	 ...
Questions1.  Message-­‐passing	  is	  claimed	  to	  be	  “a	  difficult	  way	  to	  program”.	  If	  so,	  why	  is	  it	 ...
Definitions§  Parallel	  programming	  model:	  a	  parallel	  programming	  paOern	  –  Bulk	  synchronous,	  SIMD,	  me...
Why Did MPI Succeed?Beyond	  the	  expectaUons	  of	  its	  designers	  May	  2013	  CCGrid13	  7	  
Some Pragmatic Reasons for the Success of MPI§  Clear	  need:	  Each	  of	  the	  vendors	  had	  its	  own	  message-­‐p...
Some Technical Reasons for the Success of MPI§  Good	  performance:	  HPC	  programming	  is	  about	  performance.	  An	...
How About Productivity?§  Coding	  producUvity	  is	  much	  overrated:	  A	  relaUve	  small	  fracUon	  of	  programmin...
Some Things MPI Got Right§  Communicators	  –  Support	  for	  Ume	  and	  space	  mulUplexing	  §  CollecUve	  operaUon...
MPI Forever?§  MPI	  will	  conUnue	  to	  be	  used	  in	  the	  exascale	  Ume	  frame	  §  MPI	  may	  increasingly	 ...
It is Hard to Kill MPI§  By	  now,	  there	  are	  10s	  of	  millions	  LOCs	  that	  use	  MPI	  §  MPI	  conUnues	  t...
What’s Needed for a New System toSucceed?May	  2013	  CCGrid13	  14	  
Factors for Success§  Compelling	  need	  –  MPI	  runs	  out	  of	  steam	  (?)	  –  OpenMP	  runs	  out	  of	  steam	  ...
Potential Problems with MPI§  Sooware	  overhead	  for	  communicaUon	  –  Strong	  scaling	  requires	  finer	  granulari...
Potential Problems with OpenMP§  No	  support	  for	  locality	  –	  flat	  memory	  model	  §  Parallelism	  is	  most	 ...
New Needs§  Current	  mental	  picture	  of	  parallel	  computaUon:	  Equal	  computaUon	  effort	  means	  equal	  compu...
What Do We Need in a Good HPCProgramming System?May	  2013	  CCGrid13	  19	  
Our FocusMay	  2013	  CCGrid13	  20	  MulUple	  Systems	  Specialized	  (?)	  SoQware	  Stack	  Specialized	  (?)	  SoQwar...
Shared Memory Programming ModelBasics:	  	  §  Program	  expresses	  parallelism	  and	  dependencies	  (ordering	  const...
Distributed Memory Programming Model (for HPC)Basics:	  §  Program	  distributes	  data,	  maps	  execuUon	  to	  data,	 ...
Hypothetical Future Architecture: (At Least) 3 Levels§  Assume	  core	  heterogeneity	  is	  transparent	  to	  user	  (a...
Possible Direction§  Need,	  for	  future	  nodes,	  a	  model	  intermediate	  between	  shared	  memory	  and	  distrib...
Extending the Shared Memory Model BeyondCoherence Domains – Caching in Memory§  Assume	  (to	  start	  with)	  a	  fixed	 ...
Living Without Coherence§  ScienUfic	  codes	  are	  ooen	  organized	  in	  successive	  phases	  where	  a	  variable	  ...
How This Compares to PGAS Languages?§  Added	  ability	  to	  cache	  and	  prefetch	  –  As	  disUnct	  from	  copying	 ...
Is Such Model Useful?§  Facilitates	  wriUng	  dynamic,	  irregular	  code	  (e.g.,	  Barnes-­‐Hut)	  §  Can	  be	  used...
Example: caching§  General	  case:	  “Logical	  cache	  line”	  defined	  as	  a	  user-­‐provided	  class	  §  Special	 ...
There is a L o n g List of Additional Issues§  How	  general	  are	  home	  parUUons?	  –  Use	  of	  “owner	  compute”	 ...
There is a L o n g List of Additional Issues (2)§  Local	  view	  of	  control	  or	  global	  view?	  Forall i on A[i] A...
32	  The End
Upcoming SlideShare
Loading in …5
×

Programming Models for High-performance Computing

1,059
-1

Published on

CCGrid13 talk

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,059
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Programming Models for High-performance Computing

  1. 1. Programming Systems for High-PerformanceComputingMarc  Snir  
  2. 2. PreambleMay  2013  CCGrid13  2  
  3. 3. In The Beginning§  Distributed  Memory  Parallel  Systems  are  replacing  vector  machines.  §  CM-­‐5  (using  SPARC)  announced  1991  –  1024  node  CM-­‐5  is  at  top  of  first  TOP500  list  in  1993    §  Problem:  Need  to  rewrite  all  (vector)  code  in  new  programming  model    May  2013  CCGrid13  3  
  4. 4. Some of the AttemptsMay  2013  CCGrid13  4  1990   1995   2000   2005   2010  HPF   HPF1   HPF2  MPI   MPI1   MPI2   MPI3  UPC  CAF  PGAS  OpenMP   V1   V2   V3  Shared  Memory  GPGPU   OpenCL  OpenACC  Widely  used  Does  not  scale  Too  early  to  say  LiAle  to  no  public  use  
  5. 5. Questions1.  Message-­‐passing  is  claimed  to  be  “a  difficult  way  to  program”.  If  so,  why  is  it  the  only  widely  used  way  to  program  HPC  systems?  2.  What’s  needed  for  a  programming  system  to  succeed?  3.  What’s  needed  in  a  good  HPC  parallel  programming  system?      May  2013  CCGrid13  5  
  6. 6. Definitions§  Parallel  programming  model:  a  parallel  programming  paOern  –  Bulk  synchronous,  SIMD,  message-­‐passing…    §  Parallel  programming  system:  a  language  or  library  that  implements  one  or  mulUple  models  –  MPI+C,  OpenMP…  §  MPI+C  implements  mulUple  models  (message-­‐passing,  bulk  synchronous…)  §  MPI+OpenMP  is  a  hybrid  system    May  2013  CCGrid13  6  
  7. 7. Why Did MPI Succeed?Beyond  the  expectaUons  of  its  designers  May  2013  CCGrid13  7  
  8. 8. Some Pragmatic Reasons for the Success of MPI§  Clear  need:  Each  of  the  vendors  had  its  own  message-­‐passing  library;  there  were  no  major  differences  between  them,  and  no  significant  compeUUve  advantage.  Users  wanted  portability    –  Same  was  true  of  OpenMP  §  Quick  implementaHon:  MPI  libraries  were  available  as  soon  as  the  standard  was  ready    –  As  disUnct  from  HPF  [rise  &  fall  of  HPF]  §  Cheap:  A  few  people-­‐years  (at  most)  to  port  MPI  to  a  new  placorm  May  2013  CCGrid13  8  
  9. 9. Some Technical Reasons for the Success of MPI§  Good  performance:  HPC  programming  is  about  performance.  An  HPC  programming  system  must  provide  to  the  user  the  raw  HW  performance  of  the  HW.  MPI  mostly  did  that  §  Performance  portability:  Same  opUmizaUons  (reduce  communicaUon  volume,  aggregate  communicaUon,  use  collecUves,  overlap  computaUon  and  communicaUon)  are  good  for  any  placorm  –  This  was  a  major  weakness  of  HPF  [rise  &  fall  of  HPF]  §  Performance  transparency:  Few  surprises;  a  simple  performance  model  (e.g.,  postal  model  –  a+bm)  predicts  well  communicaUon  performance  most  of  the  Ume  –  Another  weakness  of  HPF  May  2013  CCGrid13  9  
  10. 10. How About Productivity?§  Coding  producUvity  is  much  overrated:  A  relaUve  small  fracUon  of  programming  effort  is  coding:  Debugging,  tuning,  tesUng  &  validaUng  are  where  most  of  the  Ume  is  spent.  MPI  is  not  worse  than  exisUng  alternaUves  in  these  aspects  of  code  development.  §  MPI  code  is  very  small  fracUon  of  a  large  scienUfic  framework  §  Programmer  producUvity  depends  on  enUre  tool  chain  (compiler,  debugger,  performance  tools,  etc.).  It  is  easier  to  build  a  tool  chain  for  a  communicaUon  library  than  for  a  new  language  May  2013  CCGrid13  10  
  11. 11. Some Things MPI Got Right§  Communicators  –  Support  for  Ume  and  space  mulUplexing  §  CollecUve  operaUons  §  Orthogonal,  object-­‐oriented  design  May  2013  CCGrid13  11  ApplicaUon  Library  ApplicaUon  App1  Coupler  App2  App1   App2  
  12. 12. MPI Forever?§  MPI  will  conUnue  to  be  used  in  the  exascale  Ume  frame  §  MPI  may  increasingly  become  a  performance  boOleneck  at  exascale  §  It’s  not  MPI;  it  is  MPI+X,  where  X  is  used  for  node  programming.  The  main  problem  is  a  lack  of  adequate  X.  May  2013  CCGrid13  12  
  13. 13. It is Hard to Kill MPI§  By  now,  there  are  10s  of  millions  LOCs  that  use  MPI  §  MPI  conUnues  to  improve  (MPI3)  §  MPI  performance  can  sUll  be  improved  in  various  ways  (e.g.,  JIT  specializaUon  of  MPI  calls)  §  MPI  scalability  can  sUll  be  improved  in  a  variety  of  ways  (e.g.,  distributed  data  structures)  §  The  number  of  nodes  in  future  exascale  systems  will  not  be  significantly  higher  than  the  current  numbers  May  2013  CCGrid13  13  
  14. 14. What’s Needed for a New System toSucceed?May  2013  CCGrid13  14  
  15. 15. Factors for Success§  Compelling  need  –  MPI  runs  out  of  steam  (?)  –  OpenMP  runs  out  of  steam  (!)  –  MPI+OpenMP+Cuda+…  becomes  too  baroque  –  Fault  tolerance  and  power  management  require  new  programming  models    §  SoluUon  to  backward  compaUbility  problem  §  Limited  implementaUon  cost  &  fast  deployment  §  Community  support  (?)  May  2013  CCGrid13  15  
  16. 16. Potential Problems with MPI§  Sooware  overhead  for  communicaUon  –  Strong  scaling  requires  finer  granularity  §  MPI  processes  with  hundreds  of  threads  –  serial  boOlenecks  §  Interface  to  node  programming  models:  The  basic  design  of  MPI  is  for  a  single  threaded  process;  tasks  are  not  MPI  enUUes  §  Interface  to  heterogeneous  nodes  (NUMA,  deep  memory  hierarchy,  possibly  non-­‐coherent  memory)  §  Fault  tolerance  –   No  good  excepUon  model  §  Good  rDMA  leverage  (low-­‐overhead  one-­‐sided  communicaUon)  May  2013  CCGrid13  16  
  17. 17. Potential Problems with OpenMP§  No  support  for  locality  –  flat  memory  model  §  Parallelism  is  most  naturally  expressed  using  fine  grain  parallelism  (parallel  loops)  and  serializing  constructs  (locks,  serial  secUons)  §  Support  for  heterogeneity  is  lacking:  –  OpenACC    has  narrow  view  of  heterogeneous  architecture  (two,  noncoherent    memory  spaces)  –  No  support  for  NUMA  (memory  heterogeneity)  (There  is  significant  uncertainty  on  the  future  node  architecture)  May  2013  CCGrid13  17  
  18. 18. New Needs§  Current  mental  picture  of  parallel  computaUon:  Equal  computaUon  effort  means  equal  computaUon  Ume  –  Significant  effort  invested  to  make  this  true:  jiOer  avoidance  §  Mental  picture  is  likely  to  be  inaccurate  in  the  future  –  Power  management  is  local  –  Fault  correcUon  is  local  §  Global  synchronizaUon  is  evil!  May  2013  CCGrid13  18  
  19. 19. What Do We Need in a Good HPCProgramming System?May  2013  CCGrid13  19  
  20. 20. Our FocusMay  2013  CCGrid13  20  MulUple  Systems  Specialized  (?)  SoQware  Stack  Specialized  (?)  SoQware  Stack  Specialized  (?)  SoQware  Stack  Common  low-­‐level  parallel  programming  system  Automated  mapping  Domain-­‐Specifc  programming  system  Domain-­‐Specifc  programming  system  
  21. 21. Shared Memory Programming ModelBasics:    §  Program  expresses  parallelism  and  dependencies  (ordering  constraint)  §  Run-­‐Ume  maps  execuUon  threads  so  as  to  achieve  load  balance  §  Global  name  space  §  CommunicaUon  is  implied  by  referencing  §  Data  is  migrated  close  to  execuUng  thread  by  caching  hardware  §  User  reduces  communicaUon  by  ordering  accesses  so  as  to  achieve  good  temporal,  spaUal  and  thread  locality  Incidentals:  §  Shared  memory  programs  suffer  from  races  §  Shared  memory  programs  are  nondeterminisUc  §  Lack  of  collecUve  operaUons  §  Lack  of  parallel  teams  May  2013  CCGrid13  21  App1  Coupler  App2  App1   App2  
  22. 22. Distributed Memory Programming Model (for HPC)Basics:  §  Program  distributes  data,  maps  execuUon  to  data,  and  expresses  dependencies  §   CommunicaUon  is  usually  explicit,  but  can  be  implicit  (PGAS)  –  Explicit  communicaUon  needed  for  performance  –  User  explicitly  opUmizes  communicaUon  (no  caching)  Incidentals:  §  Lack  of  global  name  space  §  Use  of  send-­‐receive    May  2013  CCGrid13  22  
  23. 23. Hypothetical Future Architecture: (At Least) 3 Levels§  Assume  core  heterogeneity  is  transparent  to  user  (alternaUve  is  too  depressing…)  May  2013  CCGrid13  23  Node  …  Core   Core  Coherence  domain  …  Core   Core  Coherence  domain  …  …  Node  …  Core   Core  Coherence  domain  …  Core   Core  Coherence  domain  …  …  ...  ...  
  24. 24. Possible Direction§  Need,  for  future  nodes,  a  model  intermediate  between  shared  memory  and  distributed  memory  –  Shared  memory:  global  name  space,  implicit  communicaUon  and  caching  –  Distributed  memory:  control  of  data  locaUon  and  communicaUon  aggregaUon  –  With  very  good  support  for  latency  hiding  and  fine  grain  parallelism  §  Such  a  model  could  be  used  uniformly  across  the  enUre  memory  hierarchy!  –  With  a  different  mix  of  hardware  and  sooware  support    May  2013  CCGrid13  24  
  25. 25. Extending the Shared Memory Model BeyondCoherence Domains – Caching in Memory§  Assume  (to  start  with)  a  fixed  parUUon  of  data  to  “homes”  (locales)  §  Global  name  space:  easy  §  Implicit  communicaUon  (read/write):  performance  problem  –  Need  accessing  data  in  large  user-­‐defined  chunks  -­‐>    • User-­‐defined  “logical  cache  lines”  (data  structure,  Ule;  data  set  that  can  be  moved  together  • User  (or  compiler)  inserted  “prefetch/flush”  instrucUons  (acquire/release  access)  –  Need  latency  hiding  -­‐>  • Prefetch  or  “concurrent  mulUtasking”  May  2013  CCGrid13  25  
  26. 26. Living Without Coherence§  ScienUfic  codes  are  ooen  organized  in  successive  phases  where  a  variable  (a  family  of  variables)  is  either  exclusively  updated  or  shared  during  the  enUre  phase.  Examples  include  –  ParUcle  codes  (Barnes-­‐Hut,  molecular  dynamics…)  –  IteraUve  solvers  (finite  element,  red-­‐black  schemes..)    ☛ Need  not  track  status  of  each  “cache  line”;  sooware  can  effect  global  changes  at  phase  end    May  2013  CCGrid13  26  
  27. 27. How This Compares to PGAS Languages?§  Added  ability  to  cache  and  prefetch  –  As  disUnct  from  copying  –  Data  changes  locaQon  without  changing  name  §  Added  message  driven  task  scheduling  –  Necessary  for  hiding  latency  of  reads  in  irregular,  dynamic  codes  May  2013  CCGrid13  27  
  28. 28. Is Such Model Useful?§  Facilitates  wriUng  dynamic,  irregular  code  (e.g.,  Barnes-­‐Hut)  §  Can  be  used,  without  performance  loss,  by  adding  annotaUons  that  enable  early  binding  May  2013  CCGrid13  28  
  29. 29. Example: caching§  General  case:  “Logical  cache  line”  defined  as  a  user-­‐provided  class  §  Special  case:  Code  for  simple,  frequently  used  Ule  types    is  opUmized  &  inlined  (e.g.,  rectangular  submatrices)  [Chapel]  §  General  case:  need  dynamic  space  management  and  (possibly)  an  addiUonal  indirecUon    (ooen  avoided  via  pointer  swizzling)    §  Special  case:  same  remote  data  (e.g.,  halo)  is  repeatedly  accessed.  Can  preallocate  local  buffer  and  compile  remote  references  into  local  references  (at  compile  Ume,  for  regular  halos,  aoer  data  is  read,  for  irregular  halos)  May  2013  CCGrid13  29  
  30. 30. There is a L o n g List of Additional Issues§  How  general  are  home  parUUons?  –  Use  of  “owner  compute”  approach  requires  general  parUUon  –  Complex  parUUons  lead  to  expensive  address  translaUon  §  Is  the  number  of  homes  (locales)  fixed?    §  Can  data  homes  migrate?  –  No  migraUon  is  a  problem  for  load  balancing  and  failure  recovery  –  Usual  approach  to  migraUon  (virtualizaUon  &  overdecomposiUon)  unnecessarily  reduces  granularity  §  Can  we  update  (parameterized)  decomposiUon,  rather  than  virtualize  locaUons?  May  2013  CCGrid13  30  
  31. 31. There is a L o n g List of Additional Issues (2)§  Local  view  of  control  or  global  view?  Forall i on A[i] A[i] = foo(i)–  TransformaUon  described  in  terms  of  global  data  structure  Foreach A.tile foo(tile)–  TransformaUon  described  in  terms  of  local  data  structure  (with  local  references)  –  There  are  strong,  diverging  opinions  on  which  is  beOer  §  Hierarchical  data  structures  &  hierarchical  control  §  ExcepUon  model  §  …  May  2013  CCGrid13  31  
  32. 32. 32  The End

×