• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Resilience at exascale
 

Resilience at exascale

on

  • 561 views

 

Statistics

Views

Total Views
561
Views on SlideShare
561
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Resilience at exascale Resilience at exascale Presentation Transcript

    • Resilience at ExascaleMarc  Snir  Director,  Mathema0cs  and  Computer  Science  Division  Argonne  Na0onal  Laboratory    Professor,  Dept.  of  Computer  Science,  UIUC    
    • The Problem•  “Resilience  is  the  black  swan  of  the  exascale  program”  •  DOE  &  DoD  commissioned  several  reports   –  Inter-­‐Agency  Workshop  on  HPC  Resilience  at  Extreme  Scale   hOp://ins0tute.lanl.gov/resilience/docs/Inter-­‐ AgencyResilienceReport.pdf    (Feb  2012)   –  U.S.  Department  of  Energy  Fault  Management  Workshop   hOp://shadow.dyndns.info/publica0ons/geist12department.pdf   (June  2012)   –  ICIS  workshop  on  “addressing  failures  in  exascale  compu0ng”   (report  forthcoming)  •  Talk:   –  Discuss  the  liOle  we  understand   –  Discuss  how  we  could  learn  more   –  Seek  you  feedback  on  the  report’s  content       2  
    • WHERE WE ARE TODAY 3  
    • Failures Today•  Latest  large  systema0c  study  (Schroeder  &  Gibson)  published  in  2005   –  quite  obsolete  •  Failures  per  node  per  year:  2-­‐20  •  Root  cause  of  failures   –  Hardware  (30%-­‐60%),  SW  (5%-­‐25%),  Unknown  (20%-­‐30%)  •  Mean-­‐Time  to  Repair:  3-­‐24  hours  •  Current  anecdotal  numbers:   –  Applica0on  crash:  >1  day   –  Global  system  crash:  >  1  week   –  Applica0on  checkpoint/restart:  15-­‐20  minutes  (checkpoint  oden   “free”)   –  System  restart:  >  1  hour   –  Sod  HW  errors  suspected   4  
    • Recommendation•  Study  in  systema0c  manner  current  failure  types  and  rates   –  Using  error  logs  at  major  DOE  labs   –  Pushing  for  common  ontology  •  Spend  0me  hun0ng  for  sod  HW  errors  •  Study  causes  of  errors  (?)   –  Most  HW  faults  (>99%)  have  no  effect  (processors  are  hugely   inefficient?)   –  The  common  effect  of  sod  HW  bit  flips  is    SW  failure   –  HW  faults  can  (oden?)  be  due  to  design  bugs   –  Vendors  are  cagey  •  Time  for  a  consor0um  effort?   5  
    • Current Error-Handling•  Applica0on:  global  checkpoint  &  global  restart  •  System:   –  Repair  persistent  state  (file  system,  databases)   –  Clean  slate  restart  for  everything  else  •  Quick  analysis:   –  Assume  failures  have  Poisson  distribu0on   checkpoint  0me  C=1;    recover+restart  0me  =  R;      MTBF=M   − τ /M –  Op0mal  checkpoint  interval  τ  sa0sfies   e = (M − τ +1) / M –  System  u0liza0on  U  is   U = (M − R − τ +1) / M 6  
    • Utilization, assuming R=1 •  EE.g.   –  C=R=15  mins  (=1)   –  MTBF=  25  hours   (=100)   –  U  ≈  85%   7  
    • Utilization as Function of MTBF and R •  E.g.   –  C  =  15  mins  (=1)   –  R  =  1  hour  (=4)   –  MTBF=  25  hours   (=100)   •  U  ≈  80%   8  
    • Projecting Ahead•  “Comfort  zone:”   –  Checkpoint  0me  <1%  MTBF   –  Repair  0me  <5%  MTBF  •  Assume  MTBF  =  1  hour   –  Checkpoint  0me  ≈  0.5  minute   –  Repair  0me  ≈  3  minutes  •  Is  this  doable?   –  Yes  if  done  in  memory   –  E.g.,  using  “RAID”  techniques     9  
    • Exascale Design PointSystems   2012   2020-­‐2024     Difference   BG/Q   Today  &  2019   Computer  System  peak   20  Pflop/s   1  Eflop/s   O(100)  Power   8.6  MW   ~20  MW  System  memory   1.6  PB   32  -­‐  64  PB   O(10)   (16*96*1024)    Node  performance      205  GF/s   1.2    or  15TF/s   O(10)  –  O(100)   (16*1.6GHz*8)  Node  memory  BW   42.6  GB/s   2  -­‐  4TB/s   O(1000)  Node  concurrency   64  Threads   O(1k)  or  10k   O(100)  –  O(1000)  Total  Node  Interconnect  BW   20  GB/s   200-­‐400GB/s   O(10)  System  size  (nodes)   98,304   O(100,000)  or  O(1M)   O(100)  –  O(1000)   (96*1024)  Total  concurrency   5.97  M   O(billion)   O(1,000)  MTTI   4  days   O(<1  day)   -­‐  O(10)   Both price and power envelopes may be too aggressive!
    • Time to checkpoint•  Assume  “Raid  5”  across  memories,  checkpoint  size  ~  50%   memory  size   –  Checkpoint  0me  ≈  0me  to  transfer  50%  of  memory  to   another  node:  Few  seconds!   –  Memory  overhead    ~  50%   –  Energy  overhead  small  with  nVRAM   –  (But  need  many  write  cycles)   –  We  are  in  comfort  zone  (re  checkpoint)   •  How  about  recovery?   •  Time  to  restart  applica0on  same   as  0me  to  checkpoint   •  Problem:  Time  to  recover  if   system  failed   11  
    • Key Assumptions:  1.  Errors  are  detected  (before  the  erroneous  checkpoint  is   commiOed)    2.  System  failures  are  rare  (<<  1  day)  or  recovery  from  them  is   very  fast  (minutes)   12  
    • Hardware Will Fail More Frequently•  More,  smaller  transistor  •  Near-­‐threshold  logic  ☛ More  frequent  bit  upsets  due  to  radia0on  ☛ More  frequent  mul0ple  bit  upsets  due  to  radia0on  ☛ Larger  manufacturing  varia0on  ☛ Faster  aging             13  
    • nknown with respect to the 11nm technology node. Hardware Error Detection: Assumptions2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm 45nm 11nm Cores 8 128 Scattered latches per core 200, 000 200, 000 p p Scattered latchs in uncore relative to cores ncores ⇥ 1.25 = 0.44 ncores ⇥ 1.25 = 0.11 FIT per latch 10 1 10 1 Arrays per core (MB) 1 1 FIT per SRAM cell 10 4 10 4 Logic FIT / latch FIT 0.1 0.5 0.1 0.5 DRAM FIT (per node) 50 50 12 14  
    • Hardware Error Detection: Analysis Array interleaving and SECDED (Baseline) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 20000 1 1000 Scattered latches 200 4000 N/A N/A 20 400Combinational logic 20 400 N/A N/A 0 4 DRAM 50 50 0.5 0.5 0.005 0.005 Total 1000 - 5000 100000 10 - 100 5000 - 20000 10 - 50 500 - 5000 Array interleaving and ¿SECDED (11nm overhead: ⇠ 1% area and < 5% power) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 1000 1 5 Scattered latches 200 4000 N/A N/A 20 400Combinational logic 20 400 N/A N/A 0.2 5 DRAM 50 50 0.5 0.5 0.005 0.005 Total 1500 - 6500 100000 10 - 50 500 - 5000 10 - 50 100 - 500 Array interleaving and ¿SECDED + latch parity (45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power) DCE [FIT] DUE [FIT] UE [FIT] 45nm 11nm 45nm 11nm 45nm 11nm Arrays 5000 100000 50 1000 1 5 Scattered latches 200 4000 20 400 0.01 0.5Combinational logic 20 400 N/A N/A 0.2 5 DRAM 0 0 0.1 0.0 0.100 0.001 Total 1500 - 6500 100000 25 - 100 2000 - 10000 1 5 - 20 15  
    • Summary of (Rough) Analysis•  If  no  new  technology  is  deployed  can  have  up  to  one  undetected   error  per  hour  •  With  addi0onal  circuitry  could  get  down  to  one  undetected   error  per  100-­‐1,000  hours  (week  –  months)   –  The  cost  could  be  as  high  as  20%  addi0onal  circuits  and  25%   addi0onal  power   –  Main  problems  are  in  combinatorial  logic  and  latches   –  The  price  could  be  significantly  higher  (small  market  for  high-­‐ availability  servers)  •  Need  sodware  error  detec0on  as  an  op0on!       16  
    • Application-Level Data Error Detection•  Check  checkpoint  for  correctness  before  it  is  commiOed.   –  Look  for  outliers  (assume  smooth  fields)  –  handle  high  order   bit  errors   –  Use  robust  solvers  –  handle  low-­‐order  bit  errors  •  May  not  work  for  discrete  problems,  parBcle  simulaBons,   disconBnuous  phenomena,  etc.    •  May  not  work  if  outlier  is  rapidly  smoothed  (error  propagaBon)   –  Check  for  global  invariants  (e.g.,  energy  preserva0on)  •   Ignore  error  •  Duplicate  computa0on  of  cri0cal  variables  •  Can  we  reduce  checkpoint  size  (or  avoid  them  altogether)?   –  Tradeoff:  how  much  is  saved,  how  oden  data  is  checked  (big/ small  transac0ons)     17  
    • How About Control Errors?•  Code  is  corrupted,  jump  to  wrong  address,  etc.   –  Rarer  (more  data  state  than  control  state)   –  More  likely  to  cause  a  fatal  excep0on   –  More  easy  to  protect  against     18  
    • How About System Failures (Due to Hardware)?•  Kernel  data  corrupted   –  Page  tables,  rou0ng  tables,  file  system  metadata,  etc.  •  Hard  to  understand  and  diagnose     –  Break  abstrac0on  layers  •  Make  sense  to  avoid  such  failures   –  Duplicate  system  computa0ons  that  affect  kernel  state  (price   can  be  tolerated  for  HPC)   –  Use  transac0onal  methods   –  Harden  kernel  data     •  Highly  reliable  DRAM   •  Remotely  accessible  NVRAM   –  Predict  and  avoid  failures   19  
    • Failure Prediction from Event Logs Use a combination of signal analysis (to identify outliers) and datamining (to find correlations)e 1 presents pe.oes not havee, a memoryarge numbersh the errorData miningthemselfs ine more than ent the ma- l to extract ignals. Thisf faults seenbetween the three signal Gainaru, Cappello, Snir, Kramer (SC12) that can be Fig. 2. Methodology overview of the hybrid approach 20  
    • Can Hardware Failure Be Predicted? Prediction method Precision Recall Seq used Pred failures ELSA hybrid 91.2% 45.8% 62 (96.8%) 603 ELSA signal 88.1% 40.5% 117 (92.8%) 534 Data mining 91.9% 15.7% 39 (95.1%) 207 TABLE II P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES C Precision Recall MTTF for the whole system Waste gain . 1min 92 20 one day 9.13%n 1min 92 36 one day 17.33% 10s 92 36 one day 12.09%os The metrics used for evaluating one day 10s 92 45 prediction performance are 15.63% prediction 92 recall: 1min 10s and 92 50 65 5h 5h 21.74% 24.78% • Precision is the fraction of failure predictions that turn TABLE III) out to be correct. Migrating processes when node failure is predicted can P ERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES significantly improve utilization • Recall is the fraction of failures that are predicted.os 21  
    • How About Software Bugs?•  Parallel  code  (transforma0onal,  mostly  determinis0c)  vs.   concurrent  code  (event  driven,  very  nondeterminis0c)   –  Hard  to  understand  concurrent  code  (cannot  comprehend   more  than  10  concurrent  actors)   –  Hard  to  avoid  performance  bugs  (e.g.,  overcommiOed   resources  causing  0me-­‐outs)   –  Hard  to  test  for  performance  bugs  •  Concurrency  performance  bugs  (e.g.,  in  parallel  file  systems)  are   major  source  of  failures  on  current  supercomputers   –  Problem  will  worsen  as  we  scale  up  –  performance  bugs   become  more  frequent  •  Need  to  become  beOer  at  avoiding  performance  bugs  (learn   from  control  theory  and  real-­‐0me  systems)   –  Make  system  code  more  determinis0c   22  
    • System Recovery•  Local  failures  (e.g.,  node  kernel  crashed)  are  not  a  major   problem  -­‐-­‐  Can  replace  •  Global  failures  (e.g.,  global  file  system  crashed)  are  the  hardest   problem  -­‐-­‐  Need  to  avoid  •  Issue:  Fault  containment   –  Local  hardware  failure  corrupts  global  state   –  Localized  hardware  error  causes  global  performance  bugs     23  
    • Quid Custodiet Ipsos Custodes?•  Who  watches  the  watchmen?  •  Need  robust  (scalable,  fault-­‐tolerant)  infrastructure  for  error   repor0ng  and  recovery  orchestra0on   –  Current  approach  of  out-­‐of-­‐band  monitoring  and  control  is   too  restricted   24  
    • Summary•  Resilience  is  a  major  problem  in  the  exascale  era  •  Major  todos:   –  Need  to  understand  much  beOer  failures  in  current  systems   and  future  trends   –  Need  to  develop  good  applica0on  error  detec0on   methodology;  error  correc0on  is  desirable,  but  less  cri0cal   –  Need  to  significantly  reduce  system  errors  and  reduce   system  recovery  0me   –  Need  to  develop  robust  infrastructure  for  error  handling   25  
    • The End 26