Your SlideShare is downloading. ×
Keynote snir spaa
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Keynote snir spaa

289
views

Published on

SPAA 2013 keynote presentation

SPAA 2013 keynote presentation

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
289
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Supercomputing: Technical Evolution & Programming Models Marc  Snir   Argonne  Na.onal  Laboratory  &   University  of  Illinois  at  Urbana-­‐Champaign  
  • 2. Introduction July  13   MCS    -­‐-­‐  Marc  Snir   2  
  • 3. Theory of Punctuated Equilibrium (Eldredge, Gould, Mayer…) §  Evolu.on  consists  of  long  periods  of  equilibrium,  with  liIle  change,   interspersed  with  short  periods  of  rapid  change.     –  Muta.ons  are  diluted  in  large  popula.ons  in  equilibrium  –  homogenizing   effect  prevents  the  accumula.on  of  mul.ple  changes   –  Small,  isolated    popula.on  under  heavy  natural  selec.on  pressure  evolve   rapidly  and  new  species  can  appear   –  Major  cataclysms  can  be  a  cause  for  rapid  change   §  Punctuated  Equilibrium  is  a  good  model  for  technology  evolu.on:   –  Revolu.ons  are  hard  in  large  markets  with  network  effects  and  technology   that  evolves  gradually   –  Changes  can  be  much  faster  when  small,  isolated  product  markets  are   created,  or  when  current  technology  hits  a  wall  (cataclysm)   §  (Not  a  new  idea:  e.g.,  Levinthal  1998:  The  Slow  Pace  of  Rapid  Technological   Change:  Gradualism  and  Punctua;on  in  Technological  Change)   MCS    -­‐-­‐  Marc  Snir   3   July  13  
  • 4. Why it Matters to SPAA (and PODC) §  Periods  of  paradigm  shiW  generate  a  rich  set  of  new  problems   (new  low-­‐hanging  fruit?)   –  It  is  a  .me  when  good  theory  can  help   §  E.g.,  Internet,  Wireless,  Big  data   –  Punctuated  evolu.on  due  to  the  appearance  of  new  markets   §  Hypothesis:  HPC  now  and,  ul.mately,  much  of  IT  are  entering  a   period  of  fast  evolu.on:  Please  prepare   July  13   MCS    -­‐-­‐  Marc  Snir   4  
  • 5. Where Analogy with Biological Evolution Breaks Down §  Technology  evolu.on  can  be  accelerated  by  gene.c  engineering   –  Technology  developed  in  one  market  is  exploited  in  another   market   –  E.g.,  Internet  or  wireless  were  enabled  by  cheap   microprocessors,  telephony    technology,  etc.   §  “Gene.c  engineering”  has  been  essen.al  for  HPC  in  the  last   25  years:     –  Progress  enabled  by  reuse  of  technologies  from  other  markets   (micros,  GPUs…)   July  13   MCS    -­‐-­‐  Marc  Snir   5  
  • 6. Past & Present July  13   MCS    -­‐-­‐  Marc  Snir   6  
  • 7. Evidence of Punctuated Equilibrium in HPC July  13   MCS    -­‐-­‐  Marc  Snir   7   1   10   100   1000   10000   100000   1000000   10000000   Core  Count  Leading  Top500  System   aIack  of   the  killer   micros   mul.core   accelerators   SPAA  
  • 8. 1990: The Attack of the Killer Micros (Eugene Brooks, 1990) §  ShiW  from  ECL  vector  machines  &  to  clusters  of  MOS  micros   –  Cataclysm:  bipolar  evolu.on  reached  its  limits  (nitrogen  cooling,  gallium   arsenide…);  MOS  was    on  a  fast  evolu.on  path   –  MOS  had  its  niche  markets:  controllers,  worksta.ons,  PCs   –  Classical  example  of  “good  enough,  cheaper  technology”  (Christensen,  The   Innovator’s  Dilemma)   July  13   MCS    -­‐-­‐  Marc  Snir   8  
  • 9. 2002: Multicore §  Clock  speed  stopped   increasing;  very  liIle  return   on  added  CPU  complexity;   chip  density  con.nued  to   increase   –  Technology  push  –  not   market  pull   –  S.ll  has  limited  success   July  13   MCS    -­‐-­‐  Marc  Snir   9  
  • 10. 2010: Accelerators §  New  market  (graphics)  created  ecological  niche   §  Technology  transplanted  into  other  markets  (signal  processing/ vision,  scien.fic  compu.ng)   –  Advantage  of  beIer  power/performance  ra.o  (less  logic)   §  Technology  s.ll  changing  rapidly:  integra.on  with  CPU  and   evolving  ISA   July  13   MCS    -­‐-­‐  Marc  Snir   10  
  • 11. Where the (R)evolutions Successful in HPC? §  Killer  Micros:  Yes   –  Totally  replaced  vector  machines   –  All  HPC  codes  enabled  for  message-­‐passing  (MPI)   –  Took  >  10  years  and  >  $1B  govt.  investment  (DARPA)   §  Mul:core:  Incomplete   –  Many  codes  s.ll  use  one  MPI  process  per  core  –  use  shared  memory   for  message-­‐passing   –  Use  of  two  programming  models  (MPI+OpenMP)  is  burdensome   –  PGAS  is  not  used,  and  does  not  provide  (so  far)  a  real  advantage  over   MPI   –  Many  open  issues  on  scaling  mul.threading  models  (OpenMP,  TBB,   Cilk…)  and  combining  them  with  message-­‐passing     –  (See  history  of  large-­‐scale  NUMA  -­‐-­‐  which  did  not  become  a  viable   species)   July  13   MCS    -­‐-­‐  Marc  Snir   11  
  • 12. Where the (R)evolutions Successful? (2) §  Accelerators:  Just  beginning   –  Few  HPC  codes  converted  to  use  GPUs   §  Obstacles:   –  Technology  s.ll  changing  fast  (integra.on  of  GPU  with  CPU,   con.nued  changes  in  ISA)   –  No  good  non-­‐proprietary  programming  systems  are  available,  and   their  long-­‐term  viability  is  uncertain   July  13   MCS    -­‐-­‐  Marc  Snir   12  
  • 13. Key Obstacles §  Scien.fic  codes  live  much  longer  than  computer  systems  (two   decades  or  more);  they  need  to  be  ported  across  successive  HW   genera.ons     §  Amount  of  code  to  be  ported  con.nuously  increases  (major   scien.fic  codes  each  have  >  1MLOCs)   §  Need  very  efficient,  well  tuned  codes  (HPC  plarorms  are   expensive)   §  Need  portability  across  plarorms  (HPC  programmers  are   expensive)   §  Squaring  the  circle?     §  Lack  of  performing,  portable  programming  models  has  become   the  major  impediment  to  the  evolu.on  of  HPC  hardware   July  13   MCS    -­‐-­‐  Marc  Snir   13  
  • 14. Did Theory Help? §  Killer  Micros:  Helped  by  work  on  scalable  algorithms  and  on   interconnects   §  Mul:core:  Helped  by  work  on  communica.on  complexity   (efficient  use  of  caches)   –  Very  liIle  use  of  work  on  coordina.on  algorithms  or   transac.onal  memory   §  Accelerators:  Cannot  think  of  relevant  work       –  Interes.ng  ques.on:  power  of  branching  &  power  of   indirec.on;   –  Surprising  result:  AKS  sor.ng  network     §  Too  oWen,  theory  follows  prac.ce,  rather  than  preceding  it.   July  13   MCS    -­‐-­‐  Marc  Snir   14  
  • 15. Future July  13   MCS    -­‐-­‐  Marc  Snir   15  
  • 16. The End of Moore’s Law is Coming §  Moore’s  Law:  The   number  of  transistors   per  chip  doubles  every   two  years   §  Stein’s  Law:  If   something  cannot  go   forever,  it  will  stop   §  Ques.on  is  not   whether  but  when  will   Moore’s  Law  stop?   –  It  is  difficult  to  make   predic.ons,  especially   about  the  future  (Yogi   Berra)   July  13   MCS    -­‐-­‐  Marc  Snir   16  
  • 17. Current Obstacle: Current Leakage §  Transistors  do  not  shut-­‐off  completely   While  power  consump;on  is  an  urgent  challenge,  its  leakage  or   sta;c  component  will  become  a  major  industry  crisis  in  the  long   term,  threatening  the  survival  of  CMOS  technology  itself,  just  as   bipolar  technology  was  threatened  and  eventually  disposed  of   decades  ago   Interna.onal  Technology  Roadmap  for  Semiconductors  (ITRS)  2011   §  The  ITRS  “long  term”  is  the  2017-­‐2024  .meframe.     §  No  “good  enough”  technology  wai.ng  in  the  wings   July  13   MCS    -­‐-­‐  Marc  Snir   17  
  • 18. Longer-Term Obstacle §  Quantum  effects  totally  change  the  behavior  of  transistors  as   they  shrink   –  7-­‐5nm  feature  size  is  predicted  to  be  the  lower  limit  for  CMOS   devices   –  ITRS  predicts  7.5nm  will  be  reached  in  2024       July  13   MCS    -­‐-­‐  Marc  Snir   18  
  • 19. The 7nm Wall 24  July  2013   ANL-­‐LBNL-­‐ORNL-­‐PNNL     19   (courtesy  S.  Dosanjh)  
  • 20. The Future Is Not What It Was 24  July  2013   ANL-­‐LBNL-­‐ORNL-­‐PNNL     20   (courtesy  S.  Dosanjh)  
  • 21. Progress Does Not Stop §  It  becomes  more  expensive  and  slows  down   –  New  materials  (e.g.,  III-­‐V,  germanium  thin  channels,  nanowires,   nanotubes  or  graphene)     –  New  structures  (e.g.,  3D  transistor  structures)     –  Aggressive  cooling   –  New  packages   §  More  inven.on  at  the  architecture  level   §  Seeking  value  from  features  other  than  speed  (More  than  Moore)   –  System  on  a  chip  –  integra.on  of  analog  and  digital   –  MEMS…   §  Beyond  Moore?  (Quantum,  biological…)  –  beyond  my  horizon   July  13   MCS    -­‐-­‐  Marc  Snir   21  
  • 22. Exascale July  13   MCS    -­‐-­‐  Marc  Snir   22  
  • 23. Supercomputer Evolution §  X1,000  performance  increase  every  11  years   –  X50  faster  than  Moore’s  Law   §  Extrapola.on  predicts  exaflop/s  (1018  floa.ng  point   opera.ons  per  second)  before  2020   –  We  are  now  at  50  Petaflop/s   §  Extrapola.on  may  not  work  if  Moore’s  Law  slows  down   July  13   MCS    -­‐-­‐  Marc  Snir   23  
  • 24. Do We Care? §  It’s  all  about  Big  Data  Now,  simula.ons  are  passé.   §  B***t   §  All  science  is  either  physics  or  stamp  collec;ng.  (Ernest   Rutherford)   –  In  Physical  Sciences,  experiments  and  observa.ons  exist  to   validate/refute/mo.vate  theory.  “Data  Mining”  not  driven  by  a   scien.fic  hypothesis  is  “stamp  collec.on”.   §  Simula.on  is  needed  to  go  from  a  mathema.cal  model  to   predic.ons  on  observa.ons.   –  If  system  is  complex  (e.g.,  climate)  then  simula.on  is  expensive   –  Predic.ons  are  oWen  sta.s.cal  –  complica.ng  both  simula.on   and  data  analysis     July  13   MCS    -­‐-­‐  Marc  Snir   24  
  • 25. Observation Meets Data: Cosmology Computation Meets Data: The Argonne View Mapping the Sky with Survey Instruments Observations: Statistical error bars will ‘disappear’ soon! Emulator based on Gaussian Process Interpolation in High- Dimensional Spaces Supercomputer Simulation Campaign Markov chain Monte Carlo ‘Precision Oracle’ ‘Cosmic Calibration’ LSST Weak Lensing HACC+CCF (Domain science+CS+Math+Stats +Machine learning) CCF= Cosmic Calibration Framework w = -1 w = - 0.9 LSST HACC=Hardware/Hybrid Accelerated Cosmology Code(s) (courtesy  Salman  Habib)   Record-­‐breaking  applica.on:  3.6  Trillion   par.cles,  14  Pflop/s  
  • 26. Exascale Design Point 202x with a cap of $200M and 20MW Systems   2012   BG/Q   Computer   2020-­‐2024     Difference   Today  &  2019   System  peak   20  Pflop/s   1  Eflop/s   O(100)   Power   8.6  MW   ~20  MW   System  memory   1.6  PB   (16*96*1024)     32  -­‐  64  PB   O(10)   Node  performance      205  GF/s   (16*1.6GHz*8)   1.2    or  15TF/s   O(10)  –  O(100)   Node  memory  BW   42.6  GB/s   2  -­‐  4TB/s   O(1000)   Node  concurrency   64  Threads   O(1k)  or  10k   O(100)  –  O(1000)   Total  Node  Interconnect  BW   20  GB/s   200-­‐400GB/s   O(10)   System  size  (nodes)   98,304   (96*1024)   O(100,000)  or  O(1M)   O(100)  –  O(1000)   Total  concurrency   5.97  M   O(billion)   O(1,000)   MTTI   4  days   O(<1  day)   -­‐  O(10)   Both  price  and  power  envelopes  may  be  too  aggressive!  
  • 27. Identified Issues §  Scale  (billion  threads)   §  Power  (10’s  of  MWaIs)   –  Communica:on:  >  99%  of  power  is  consumed  by  moving   operands  across  the  memory  hierarchy  and  across  nodes   –  Reduced  memory  size:  (communica.on  in  .me)   §  Resilience:  Something  fails  every  hour;  the  machine  is  never   “whole”   –  Trade-­‐off  between  power  and  resilience   §  Asynchrony:  Equal  work  ≠  equal  .me   –  Power  management   –  Error  recovery   July  13   MCS    -­‐-­‐  Marc  Snir   27  
  • 28. Other Issues §  Uncertainly  about  underlying  HW  architecture   –  Fast  evolu.on  of  architecture  (accelerators,  3D  memory  and   processing  near  memory,  nvram)   –  Uncertainty  about  the  market  that  will  supply  components  to   HPC   –  Possible  divergence  from  commodity  markets   §  Increased  complexity  of  soWware   –  Simula.ons  of  complex  systems  +  uncertainty  quan.fica.on  +   op.miza.on…   –  SoWware  management  of  power  and  failure   –  Scale  and  .ght  coupling  (tail  of  distribu.on  maIers!)   July  13   MCS    -­‐-­‐  Marc  Snir   28  
  • 29. Research Areas July  13   MCS    -­‐-­‐  Marc  Snir   29  
  • 30. Scale §  HPC  algorithms  are  being  designed  for  a  2-­‐level  hierarchy  (node,   global);  can  they  be  designed  for  a  mul.-­‐level  hierarchy?  Can   they  be  “hierarchy-­‐oblivious”?   §  Can  we  have  a  programming  model  that  abstracts  the  specific   HW  mechanisms  are  each  level  (message-­‐passing,  shared-­‐ memory)  yet  can  leverage  these  mechanisms  efficiently?   –  Global  shared  object  space  +  caching  +  explicit  communica.on   –  Mul.level  programing  (compila.on  with  human  in  the  loop)   July  13   MCS    -­‐-­‐  Marc  Snir   30  
  • 31. Communication §  Communica.on-­‐efficient  algorithms   §  A  beIer  understanding  of  fundamental  communica.on-­‐ computa.on  tradeoffs  for  PDE  solvers  (ge•ng  away  from  DAG  –   based  lower  bounds;  tradeoffs  between  communica.on  and   convergence  rate)   §  Programming  models,  libraries  and  languages  where   communica.on  is  a  first-­‐class  ci.zen  (other  than  MPI)       July  13   MCS    -­‐-­‐  Marc  Snir   31  
  • 32. Resilient Distributed Systems §  E.g.,  a  parallel  file  system,  with  768  I/O  nodes  >50K  disks   –  Systems  are  built  to  tolerate  disk  and  node  failures   –  However,  most  failures  in  the  field  are  due  to  “performance   bugs”:  e.g.,  .me-­‐outs,  due  to  thrashing   §  How  do  we  build  feedback  mechanisms  that  ensure  stability?   (control  theory  for  large-­‐scale,  discrete  systems)   §  How  do  we  provide  quality  of  service?   §  What  is  a  quan.ta.ve  theory  of  resilience?  (E.g.  Impact  of  failure   rate  on  overall  performance)   –  Focus  on  systems  where  failures  are  not  excep.onal   July  13   MCS    -­‐-­‐  Marc  Snir   32  
  • 33. Resilient Parallel Algorithms – Overcoming Silent Data Corruptions §  SDCs  may  be  unavoidable  in  future  large  systems  (due  to  flips  in   computa.on  logic)   §  Intui.on:  SDC  can  either  be   –  Type  1:  Grossly  violates  the  computa.on  model  (e.g.  jump  to   wrong  address,  message  sent  to  wrong  node),  or   –  Type  2:  Introduces  noise  in  the  data  (bit  flip  in  a  large  array)   §  Many  itera.ve  algorithms  can  tolerate  infrequent  type  2  errors   §  Type  1  errors  are  oWen  catastrophic  and  easy  to  detect  in   soWware   §  Can  we  build  systems  that  avoid  or  correct  easy  to  detect  (type  1)   errors  and  tolerate  hard  to  detect  (type  2)  errors?   §  What  is  the  general  theory  of  fault-­‐tolerant  numerical   algorithms?   July  13   MCS    -­‐-­‐  Marc  Snir   33  
  • 34. Asynchrony §  What  is  a  measure  of  asynchrony  tolerance?   –  Moving  away  from  the  qualita.ve  (e.g.,  wait-­‐free)  to  the   quan.ta.ve:     –  How  much  do  intermiIently  slow  processes  slow  down  the   en.re  computa.on  –  on  average?   §  What  are  the  trade-­‐offs  between  synchronicity  and  computa.on   work?   §  Load  balancing,  driven  not  by  uncertainty  on  computa.on,  but   uncertainty  on  computer   July  13   MCS    -­‐-­‐  Marc  Snir   34  
  • 35. Architecture-Specific Algorithms §  GPU/accelerators     §  Hybrid  memory  Cube  /  Near-­‐ memory  compu.ng   §  NVRAM  –  E.g.,  flash  memory   July  13   MCS    -­‐-­‐  Marc  Snir   35  
  • 36. Portable Performance §  Can  we  redefine  compila.on  so  that:   –  It  supports  well  a  human  in  the  loop  (manual  high-­‐level  decisions  vs.   automated  low-­‐level  transforma.ons)   –  It  integrates  auto-­‐tuning  and  profile-­‐guided  compila.on   –  It  preserves  high-­‐level  code  seman.cs   –  It  preserves  high-­‐level  code  “performance  seman.cs”   July  13   MCS    -­‐-­‐  Marc  Snir   36   Principle   High-­‐level  code   Low-­‐level,  plarorm-­‐ specific  codes   “Compila.on”   Prac:ce   Code  A   Code  B   Code  C   Manual  conversion   “ifdef”  spaghe•  
  • 37. Conclusion §  Moore’s  Law  is  slowing  down;  the  slow-­‐down  has  many   fundamental  consequences  –  only  a  few  of  them  explored  in  this   talk   §  HPC  is  the  “canary  in  the  mine”:   –  issues  appear  earlier  because  of  size  and  .ght  coupling   §  Op.mis.c  view  of  the  next  decades:  A  frenzy  of  innova.on  to   con.nue  pushing  current  ecosystem,  followed  by  frenzy  of   innova.on  to  use  totally  different  compute  technologies   §  Pessimis.c  view:    The  end  is  coming   July  13   MCS    -­‐-­‐  Marc  Snir   37