MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

on

  • 1,035 views

In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans. ...

In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans.

This talk will focus on latest developments and future plans for the MVAPICH2 and MVAPICH2-X projects. For the MVAPICH2 project, we will focus on scalable and highly-optimized designs for pt-to-pt communication (two-sided and one-sided MPI-3 RMA), collective communication (blocking and MPI-3 non-blocking), support for GPGPUs and Intel MIC, support for MPI-T interface and schemes for fault-tolerance/fault-resilience. For the MVAPICH2-X project, will focus on efficient support for hybrid MPI and PGAS (UPC and OpenSHMEM) programming model with unified runtime."

Watch the video presentation: http://wp.me/p3RLHQ-coF

Statistics

Views

Total Views
1,035
Views on SlideShare
1,032
Embed Views
3

Actions

Likes
2
Downloads
15
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans Presentation Transcript

  • 1. MVAPICH2  and  MVAPICH2-­‐X  Projects:     Latest  Developments  and  Future  Plans   Dhabaleswar  K.  (DK)  Panda   The  Ohio  State  University   E-­‐mail:  panda@cse.ohio-­‐state.edu   h<p://www.cse.ohio-­‐state.edu/~panda   Talk  at    HPC  Advisory  Council  European  Conference  (2014)       by  
  • 2. High-­‐End  CompuPng  (HEC):  PetaFlop  to  ExaFlop   HPC  Advisory  Council  European  Conference,  June  '14   2   Expected  to  have  an  ExaFlop  system  in  2020-­‐2022!   100 PFlops in 2015 1 EFlops in 2018?
  • 3. HPC  Advisory  Council  European  Conference,  June  '14   3   Trends  for  Commodity  CompuPng  Clusters  in  the  Top  500   List  (hWp://www.top500.org)   0   10   20   30   40   50   60   70   80   90   100   0   50   100   150   200   250   300   350   400   450   500   Percentage  of  Clusters     Number  of  Clusters   Timeline   Percentage  of  Clusters   Number  of  Clusters  
  • 4. Drivers  of  Modern  HPC  Cluster  Architectures   •  MulR-­‐core  processors  are  ubiquitous   •  InfiniBand  very  popular  in  HPC  clusters  •      •  Accelerators/Coprocessors  becoming  common  in  high-­‐end  systems   •  Pushing  the  envelope  for  Exascale  compuRng           Accelerators  /  Coprocessors     high  compute  density,  high  performance/waW   >1  TFlop  DP  on  a  chip     High  Performance  Interconnects  -­‐  InfiniBand   <1usec  latency,  >100Gbps  Bandwidth       Tianhe  –  2  (1)   Titan  (2)   Stampede  (6)   Tianhe  –  1A  (10)   4   MulP-­‐core  Processors   HPC  Advisory  Council  European  Conference,  June  '14  
  • 5. •  207  IB  Clusters  (41%)  in  the  November  2013  Top500  list                  (h<p://www.top500.org)   •  InstallaRons  in  the  Top  40  (19  systems):   HPC  Advisory  Council  European  Conference,  June  '14   Large-­‐scale  InfiniBand  InstallaPons   519,640  cores  (Stampede)  at  TACC  (7th)   70,560  cores  (Helios)  at  Japan/IFERC  (24th)   147,  456  cores  (Super  MUC)  in    Germany  (10th)   138,368  cores  (Tera-­‐100)  at  France/CEA  (29th)   74,358  cores  (Tsubame  2.5)  at  Japan/GSIC  (11th)   60,000-­‐cores,  iDataPlex  DX360M4  at  Germany/ Max-­‐Planck  (31st)   194,616  cores  (Cascade)  at  PNNL  (13th)   53,504  cores  (PRIMERGY)  at  Australia/NCI  (32nd)   110,400  cores  (Pangea)  at  France/Total  (14th)   77,520  cores  (Conte)  at  Purdue  University  (33rd)   96,192  cores  (Pleiades)  at  NASA/Ames  (16th)   48,896  cores  (MareNostrum)  at  Spain/BSC  (34th)   73,584  cores  (Spirit)  at  USA/Air  Force  (18th)   222,072  (PRIMERGY)  at  Japan/Kyushu  (36th)     77,184  cores  (Curie  thin  nodes)  at  France/CEA  (20th)   78,660  cores  (Lomonosov)  in  Russia  (37th  )   120,  640  cores  (Nebulae)  at  China/NSCS  (21st)   137,200  cores  (Sunway  Blue  Light)    in  China  40th)   72,288  cores  (Yellowstone)  at  NCAR  (22nd)   and  many  more!   5  
  • 6. HPC  Advisory  Council  European  Conference,  June  '14   6   Parallel  Programming  Models  Overview   P1   P2   P3   Shared  Memory   P1   P2   P3   Memory   Memory   Memory   P1   P2   P3   Memory   Memory   Memory   Logical  shared  memory   Shared  Memory  Model   SHMEM,  DSM   Distributed  Memory  Model     MPI  (Message  Passing  Interface)   ParRRoned  Global  Address  Space  (PGAS)   Global  Arrays,  UPC,  Chapel,  X10,  CAF,  …   •  Programming  models  provide  abstract  machine  models   •  Models  can  be  mapped  on  different  types  of  systems   –  Distributed  Shared  Memory  (DSM),  MPI  within  a  node,  etc.   –  Logical  shared  memory  across  nodes  (PGAS)  
  • 7. 7   ParPPoned  Global  Address  Space  (PGAS)  Models   HPC  Advisory  Council  European  Conference,  June  '14   •  Key  features   -  Simple  shared  memory  abstracRons     -  Light  weight  one-­‐sided  communicaRon     -  Easier  to  express  irregular  communicaRon   •  Different  approaches  to  PGAS     -  Languages     •  Unified  Parallel  C  (UPC)   •  Co-­‐Array  Fortran  (CAF)   •  X10   -  Libraries   •  OpenSHMEM   •  Global  Arrays   •  Chapel  
  • 8. •  Hierarchical  architectures  with  mulRple  address  spaces   •  (MPI  +  PGAS)  Model   –  MPI  across  address  spaces   –  PGAS  within  an  address  space   •  MPI  is  good  at  moving  data  between  address  spaces   •  Within  an  address  space,  MPI  can  interoperate  with  other  shared   memory  programming  models     •  Can  co-­‐exist  with  OpenMP  for  offloading  computaRon   •  ApplicaRons  can  have  kernels  with  different  communicaRon  pa<erns   •  Can  benefit  from  different  models   •  Re-­‐wriRng  complete  applicaRons  can  be  a  huge  effort   •  Port  criRcal  kernels  to  the  desired  model  instead   HPC  Advisory  Council  European  Conference,  June  '14   8   MPI+PGAS  for  Exascale  Architectures  and  ApplicaPons  
  • 9. Hybrid  (MPI+PGAS)  Programming   •  ApplicaRon  sub-­‐kernels  can  be  re-­‐wri<en  in  MPI/PGAS  based   on  communicaRon  characterisRcs   •  Benefits:   –  Best  of  Distributed  CompuRng  Model   –  Best  of  Shared  Memory  CompuRng  Model   •  Exascale  Roadmap*:     –  “Hybrid  Programming  is  a  pracRcal  way  to    program  exascale  systems”   *  The  InternaConal  Exascale  SoDware  Roadmap,  Dongarra,  J.,  Beckman,  P.  et  al.,  Volume  25,  Number  1,  2011,   InternaConal  Journal  of  High  Performance  Computer  ApplicaCons,  ISSN  1094-­‐3420   Kernel  1   MPI   Kernel  2   MPI   Kernel  3   MPI   Kernel  N   MPI   HPC  ApplicaPon   Kernel  2   PGAS   Kernel  N   PGAS   HPC  Advisory  Council  European  Conference,  June  '14   9  
  • 10. Designing  Sogware  Libraries  for  MulP-­‐Petaflop  and   Exaflop  Systems:  Challenges     Programming  Models   MPI,  PGAS  (UPC,  Global  Arrays,  OpenSHMEM),   CUDA,  OpenACC,  Cilk,  Hadoop,  MapReduce,  etc.   ApplicaPon  Kernels/ApplicaPons     Networking  Technologies   (InfiniBand,  40/100GigE,   Aries,  BlueGene)    MulP/Many-­‐core   Architectures   Point-­‐to-­‐point   CommunicaPon   (two-­‐sided    &  one-­‐sided)     CollecPve   CommunicaPon     SynchronizaPon  &   Locks     I/O  &  File   Systems   Fault   Tolerance   CommunicaPon  Library  or  RunPme  for  Programming  Models   10   Accelerators   (NVIDIA  and  MIC)   Middleware     HPC  Advisory  Council  European  Conference,  June  '14   Co-­‐Design   OpportuniPes   and   Challenges   across  Various   Layers     Performance   Scalability   Fault-­‐ Resilience  
  • 11. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   •  Balancing  intra-­‐node  and  inter-­‐node  communicaRon  for  next  generaRon   mulR-­‐core  (128-­‐1024  cores/node)   –  MulRple  end-­‐points  per  node   •  Support  for  efficient  mulR-­‐threading   •  Support  for  GPGPUs  and  Accelerators   •  Scalable  CollecRve  communicaRon   –  Offload   –  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Fault-­‐tolerance/resiliency   •  QoS  support  for  communicaRon  and  I/O   •  Support  for  Hybrid  MPI+PGAS  programming  (MPI  +  OpenMP,  MPI  +  UPC,   MPI  +  OpenSHMEM,  …)       Challenges  in  Designing    (MPI+X)  at  Exascale   11   HPC  Advisory  Council  European  Conference,  June  '14  
  • 12. •  High  Performance  open-­‐source  MPI  Library  for  InfiniBand,  10Gig/iWARP,  and  RDMA   over  Converged  Enhanced  Ethernet  (RoCE)   –  MVAPICH  (MPI-­‐1),  MVAPICH2  (MPI-­‐2.2  and  MPI-­‐3.0),  Available  since  2002   –  MVAPICH2-­‐X  (MPI  +  PGAS),  Available  since  2012   –  Support  for  GPGPUs  and  MIC   –  Used  by  more  than    2,150  organizaPons    in  72  countries   –  More  than  214,000  downloads  from  OSU  site  directly   –  Empowering  many  TOP500  clusters   •   7th  ranked  519,640-­‐core  cluster  (Stampede)  at    TACC   •  11th    ranked  74,358-­‐core  cluster  (Tsubame  2.5)  at  Tokyo  InsRtute  of  Technology   •  16th  ranked  96,192-­‐core  cluster  (Pleiades)  at  NASA     •  75th  ranked    16,896-­‐core  cluster  (Keenland)  at  GaTech  and    many  others  .  .  .   –  Available  with  sotware  stacks  of  many  IB,  HSE,  and  server  vendors  including   Linux  Distros  (RedHat  and  SuSE)   –  h<p://mvapich.cse.ohio-­‐state.edu   •  Partner  in  the  U.S.  NSF-­‐TACC  Stampede  System   MVAPICH2/MVAPICH2-­‐X  Sogware   12  HPC  Advisory  Council  European  Conference,  June  '14  
  • 13. •  Released  on  06/20/14   •  Major  Features  and  Enhancements   –  Based  on  MPICH-­‐3.1   –  MPI-­‐3  RMA  Support   –  Support  for  Non-­‐blocking  collecRves     –  MPI-­‐T  support     –  CMA  support  is  default  for  intra-­‐node  communicaRon   –  OpRmizaRon  of  collecRves  with  CMA  support   –  Large  message  transfer  support   –  Reduced  memory  footprint   –  Improved  Job  start-­‐up  Rme   –  OpRmizaRon  of  collecRves    and  tuning  for  mulRple  plaworms   –  Updated  hwloc  to  version  1.9   •  MVAPICH2-­‐X  2.0  GA    supports  hybrid  MPI  +  PGAS  (UPC  and  OpenSHMEM)   programming  models   –  Based  on  MVAPICH2  2.0  GA  including  MPI-­‐3  features   –  Compliant  with  UPC  2.18.0  and  OpenSHMEM  v1.0f     13   MVAPICH2  2.0  GA  and  MVAPICH2-­‐X  2.0  GA   HPC  Advisory  Council  European  Conference,  June  '14  
  • 14. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   14   HPC  Advisory  Council  European  Conference,  June  '14  
  • 15. HPC  Advisory  Council  European  Conference,  June  '14   15   One-­‐way  Latency:  MPI  over  IB  with  MVAPICH2   0.00   1.00   2.00   3.00   4.00   5.00   6.00   Small  Message  Latency   Message  Size  (bytes)   Latency  (us)   1.66   1.56   1.64   1.82   0.99   1.09   1.12   DDR,  QDR  -­‐  2.4  GHz  Quad-­‐core  (Westmere)  Intel  PCI  Gen2  with  IB  switch   FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch   ConnectIB-­‐Dual  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch   ConnectIB-­‐Dual  FDR  -­‐  2.8  GHz  Deca-­‐core  (IvyBridge)  Intel  PCI  Gen3  with  IB  switch   0.00   50.00   100.00   150.00   200.00   250.00   Qlogic-­‐DDR   Qlogic-­‐QDR   ConnectX-­‐DDR   ConnectX2-­‐PCIe2-­‐QDR   ConnectX3-­‐PCIe3-­‐FDR   Sandy-­‐ConnectIB-­‐DualFDR   Ivy-­‐ConnectIB-­‐DualFDR   Large  Message  Latency   Message  Size  (bytes)   Latency  (us)  
  • 16. HPC  Advisory  Council  European  Conference,  June  '14   16   Bandwidth:  MPI  over  IB  with  MVAPICH2   0   2000   4000   6000   8000   10000   12000   14000   UnidirecPonal  Bandwidth   Bandwidth  (MBytes/sec)   Message  Size  (bytes)   3280   3385   1917   1706   6343   12485   12810   0   5000   10000   15000   20000   25000   30000   4   16   64   256  1024   4K   16K   64K  256K  1M   Qlogic-­‐DDR   Qlogic-­‐QDR   ConnectX-­‐DDR   ConnectX2-­‐PCIe2-­‐QDR   ConnectX3-­‐PCIe3-­‐FDR   Sandy-­‐ConnectIB-­‐DualFDR   Ivy-­‐ConnectIB-­‐DualFDR   BidirecPonal  Bandwidth   Bandwidth  (MBytes/sec)   Message  Size  (bytes)   3341   3704   4407   11643   6521   21025   24727   DDR,  QDR  -­‐  2.4  GHz  Quad-­‐core  (Westmere)  Intel  PCI  Gen2  with  IB  switch   FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch   ConnectIB-­‐Dual  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch   ConnectIB-­‐Dual  FDR  -­‐  2.8  GHz  Deca-­‐core  (IvyBridge)  Intel  PCI  Gen3  with  IB  switch    
  • 17. 0   0.2   0.4   0.6   0.8   1   0   1   2   4   8   16   32   64   128   256   512   1K   Latency  (us)   Message  Size  (Bytes)   Latency   Intra-­‐Socket   Inter-­‐Socket   MVAPICH2  Two-­‐Sided  Intra-­‐Node  Performance   (Shared  memory  and  Kernel-­‐based  Zero-­‐copy  Support  (LiMIC  and  CMA))   17   Latest  MVAPICH2  2.0   Intel  Ivy-­‐bridge   0.18  us   0.45  us   0   2000   4000   6000   8000   10000   12000   14000   16000   Bandwidth  (MB/s)   Message  Size  (Bytes)   Bandwidth  (Inter-­‐socket)   inter-­‐Socket-­‐CMA   inter-­‐Socket-­‐Shmem   inter-­‐Socket-­‐LiMIC   0   2000   4000   6000   8000   10000   12000   14000   16000   Bandwidth  (MB/s)   Message  Size  (Bytes)   Bandwidth  (Intra-­‐socket)   intra-­‐Socket-­‐CMA   intra-­‐Socket-­‐Shmem   intra-­‐Socket-­‐LiMIC   14,250  MB/s   13,749  MB/s   HPC  Advisory  Council  European  Conference,  June  '14  
  • 18. 0   0.5   1   1.5   2   2.5   3   3.5   1   2   4   8   16   32   64   128  256  512   1K   2K   4K   Latency  (us)   Message  Size  (Bytes)   Inter-­‐node  Get/Put  Latency   Get   Put   MPI-­‐3  RMA  Get/Put  with  Flush  Performance     18   Latest  MVAPICH2  2.0,  Intel  Sandy-­‐bridge  with  Connect-­‐IB  (single-­‐port)   2.04  us   1.56  us   0   5000   10000   15000   20000   1   2   4   8   16   32   64   128   256   512   1K   2K   4K   8K   16K   32K   64K   128K   256K   Bandwidth  (Mbytes/sec)   Message  Size  (Bytes)   Intra-­‐socket  Get/Put  Bandwidth   Get   Put   0   0.05   0.1   0.15   0.2   0.25   0.3   0.35   1   2   4   8   16   32   64  128  256  512  1K   2K   4K   8K   Latency  (us)   Message  Size  (Bytes)   Intra-­‐socket  Get/Put  Latency   Get   Put   0.08  us   HPC  Advisory  Council  European  Conference,  June  '14   0   1000   2000   3000   4000   5000   6000   7000   8000   1   4   16   64   256   1K   4K   16K   64K   256K   Bandwidth  (Mbytes/sec)   Message  Size  (Bytes)   Inter-­‐node  Get/Put  Bandwidth   Get   Put   6881   6876   14926   15364  
  • 19. •  Memory  usage  for  32K  processes  with  8-­‐cores  per  node  can  be  54  MB/process  (for  connecRons)     •  NAMD  performance  improves  when  there  is  frequent  communicaRon  to  many  peers   HPC  Advisory  Council  European  Conference,  June  '14   Minimizing  memory  footprint  with  XRC  and  Hybrid  Mode   Memory  Usage   Performance  on  NAMD  (1024  cores)   19   •  Both  UD  and  RC/XRC  have  benefits     •  Hybrid  for  the  best  of  both   •  Available  since  MVAPICH2  1.7  as  integrated  interface   •  RunRme  Parameters:    RC  -­‐  default;       •  UD  -­‐  MV2_USE_ONLY_UD=1   •  Hybrid  -­‐    MV2_HYBRID_ENABLE_THRESHOLD=1     M.  Koop,  J.  Sridhar  and  D.  K.  Panda,  “Scalable  MPI  Design  over  InfiniBand  using  eXtended  Reliable   ConnecPon,”  Cluster  ‘08   0   100   200   300   400   500   1   4   16   64   256   1K   4K   16K   Memory  (MB/process)   ConnecPons   MVAPICH2-­‐RC   MVAPICH2-­‐XRC   0   0.5   1   1.5   apoa1 er-gre f1atpase jac Normalized  Time   Dataset   MVAPICH2-­‐RC   MVAPICH2-­‐XRC   0   2   4   6   128   256   512   1024   Time  (us)   Number  of  Processes   UD   Hybrid   RC   26%   40%   30%  38%  
  • 20. HPC  Advisory  Council  European  Conference,  June  '14   20   Dynamic  Connected  (DC)  Transport  in  MVAPICH2   Node0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node2 P5 P4 IB Network •  Constant  connecRon  cost  (One  QP  for  any  peer)   •  Full  Feature  Set  (RDMA,  Atomics  etc)   •  Separate  objects  for  send  (DC  IniRator)  and  receive  (DC   Target)   –  DC  Target  idenRfied  by  “DCT  Number”   –  Messages  routed  with  (DCT  Number,  LID)   –  Requires  same  “DC  Key”  to  enable  communicaRon   •  IniRal  study  done  in  MVAPICH2   –  More  details  in  Research  Papers  06,  Wednesday  June  25;   12:00  PM  to  12:30  PM   0   0.5   1   160   320   620   Normalized  ExecuPon   Time   Number  of  Processes   NAMD  -­‐  Apoa1:  Large  data  set   RC   DC-­‐Pool   UD   XRC   DCT  support  available  publicly  in  Mellanox  OFED  –  2.2.01   10   22   47   97   1   1   1   2   10   10   10   10   1   1   3   5   1   10   100   80   160   320   640   ConnecPon  Memory  (KB)   Number  of  Processes   Memory  Footprint  for  Alltoall   RC   DC-­‐Pool   UD   XRC  
  • 21. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   21   HPC  Advisory  Council  European  Conference,  June  '14  
  • 22. HPC  Advisory  Council  European  Conference,  June  '14   22   Hardware  MulPcast-­‐aware  MPI_Bcast  on  Stampede   0   5   10   15   20   25   30   35   40   2   8   32   128   512   Latency    (us)   Message  Size  (Bytes)   Small  Messages  (102,400  Cores)   Default   MulPcast   ConnectX-­‐3-­‐FDR  (54  Gbps):  2.7  GHz  Dual  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  Mellanox  IB  FDR  switch   0   100   200   300   400   500   2K   8K   32K   128K   Latency    (us)   Message  Size  (Bytes)   Large  Messages  (102,400  Cores)   Default   MulPcast   0   5   10   15   20   25   30   Latency  (us)   Number  of  Nodes   16  Byte  Message   Default   MulPcast   0   50   100   150   200   Latency  (us)   Number  of  Nodes   32  KByte  Message   Default   MulPcast  
  • 23. 0   1   2   3   4   5   512   600   720   800   ApplicaPon  Run-­‐Time   (s)   Data  Size   0   5   10   15   64   128   256   512   Run-­‐Time  (s)   Number  of  Processes   PCG-­‐Default   Modified-­‐PCG-­‐Offload   ApplicaPon  benefits  with  Non-­‐Blocking  CollecPves  based   on  CX-­‐3  CollecPve  Offload   23   Modified    P3DFFT  with  Offload-­‐Alltoall  does  up  to   17%  be<er  than  default  version  (128  Processes)   K.  Kandalla,  et.  al..  High-­‐Performance  and  Scalable  Non-­‐Blocking  All-­‐ to-­‐All  with  CollecPve  Offload  on  InfiniBand  Clusters:  A  Study  with   Parallel  3D  FFT,  ISC  2011   HPC  Advisory  Council  European  Conference,  June  '14   17%   0   0.2   0.4   0.6   0.8   1   1.2   10   20   30   40   50   60   70   Normalized     Performance     HPL-­‐Offload   HPL-­‐1ring   HPL-­‐Host   HPL  Problem  Size  (N)  as  %  of  Total  Memory   4.5%   Modified    HPL  with  Offload-­‐Bcast  does  up  to  4.5%   be<er  than  default  version  (512  Processes)   Modified    Pre-­‐Conjugate  Gradient  Solver  with   Offload-­‐Allreduce  does  up  to  21.8%  be<er  than   default  version   K.  Kandalla,  et.  al,  Designing  Non-­‐blocking  Broadcast  with  CollecPve   Offload  on  InfiniBand  Clusters:  A  Case  Study  with  HPL,  HotI  2011   K.  Kandalla,  et.  al.,  Designing  Non-­‐blocking  Allreduce  with  CollecPve   Offload  on  InfiniBand  Clusters:  A  Case  Study  with  Conjugate  Gradient   Solvers,  IPDPS  ’12   21.8%   Can  Network-­‐Offload  based  Non-­‐Blocking  Neighborhood  MPI   CollecPves  Improve  CommunicaPon  Overheads  of  Irregular  Graph     Algorithms?  K.  Kandalla,  A.  Buluc,  H.  Subramoni,  K.  Tomko,  J.  Vienne,   L.  Oliker,  and  D.  K.  Panda,  IWPAPS’  12  
  • 24. HPC  Advisory  Council  European  Conference,  June  '14   24   Network-­‐Topology-­‐Aware  Placement  of  Processes   Can  we  design  a  highly  scalable  network  topology  detecRon  service  for  IB?     How  do  we  design  the  MPI  communicaRon  library  in  a  network-­‐topology-­‐aware  manner  to  efficiently   leverage  the  topology  informaRon  generated  by  our  service?     What  are  the  potenRal  benefits  of  using  a  network-­‐topology-­‐aware  MPI  library  on  the  performance  of   parallel  scienRfic  applicaRons?   Overall  performance  and  Split  up  of  physical  communicaPon  for  MILC  on  Ranger   Performance  for  varying   system  sizes   Default  for  2048  core  run   Topo-­‐Aware  for  2048  core  run   15%   H.  Subramoni,  S.  Potluri,  K.  Kandalla,  B.  Barth,  J.  Vienne,  J.  Keasler,  K.  Tomko,  K.  Schulz,  A.  Moody,  and  D.  K.  Panda,  Design  of  a  Scalable   InfiniBand  Topology  Service  to  Enable  Network-­‐Topology-­‐Aware  Placement  of  Processes,  SC'12  .  BEST    Paper  and  BEST  STUDENT  Paper   Finalist   •     Reduce  network  topology  discovery  Pme  from  O(N2 hosts)  to  O(Nhosts)   •     15%  improvement  in  MILC  execuPon  Pme  @  2048  cores   •     15%  improvement  in  Hypre  execuPon  Pme  @  1024  cores            
  • 25. 25   Power  and  Energy  Savings  with  Power-­‐Aware  CollecPves   Performance  and  Power  Comparison  :  MPI_Alltoall  with  64  processes  on  8  nodes   K.  Kandalla,  E.  P.  Mancini,  S.  Sur  and  D.  K.  Panda,    “Designing  Power  Aware  CollecPve  CommunicaPon  Algorithms  for   Infiniband  Clusters”,    ICPP  ‘10   CPMD  ApplicaPon  Performance  and  Power  Savings     5% 32% HPC  Advisory  Council  European  Conference,  June  '14  
  • 26. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   26   HPC  Advisory  Council  European  Conference,  June  '14  
  • 27. •  Before  CUDA  4:  AddiRonal  copies     –  Low  performance  and  low  producRvity       •  Ater  CUDA  4:  Host-­‐based  pipeline     –  Unified  Virtual  Address   –  Pipeline  CUDA  copies  with  IB  transfers     –  High  performance  and  high  producRvity     •   Ater  CUDA  5.5:  GPUDirect-­‐RDMA  support     –  GPU  to  GPU  direct  transfer     –  Bypass  the  host  memory     –  Hybrid  design  to  avoid  PCI  bo<lenecks     HPC  Advisory  Council  European  Conference,  June  '14   27   MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI     InfiniBand   GPU   GPU   Memor y   CPU   Chip   set  
  • 28. MVAPICH2  1.8,  1.9,  2.0  Releases   •  Support  for  MPI  communicaRon  from  NVIDIA  GPU  device   memory   •  High  performance  RDMA-­‐based  inter-­‐node  point-­‐to-­‐point   communicaRon  (GPU-­‐GPU,  GPU-­‐Host  and  Host-­‐GPU)   •  High  performance  intra-­‐node  point-­‐to-­‐point   communicaRon  for  mulR-­‐GPU  adapters/node  (GPU-­‐GPU,   GPU-­‐Host  and  Host-­‐GPU)   •  Taking  advantage  of  CUDA  IPC  (available  since  CUDA  4.1)  in   intra-­‐node  communicaRon  for  mulRple  GPU  adapters/node   •  OpRmized  and  tuned  collecRves  for  GPU  device  buffers   •  MPI  datatype  support  for  point-­‐to-­‐point  and  collecRve   communicaRon  from  GPU  device  buffers   28  HPC  Advisory  Council  European  Conference,  June  '14  
  • 29. •  Hybrid  design  using  GPUDirect  RDMA   –  GPUDirect  RDMA  and  Host-­‐based   pipelining   –  Alleviates  P2P  bandwidth  bo<lenecks   on  SandyBridge  and  IvyBridge   •  Support  for  communicaRon  using  mulR-­‐rail   •  Support  for  Mellanox  Connect-­‐IB  and   ConnectX  VPI  adapters   •  Support  for  RoCE  with  Mellanox  ConnectX   VPI  adapters   HPC  Advisory  Council  European  Conference,  June  '14   29   GPUDirect  RDMA  (GDR)  with  CUDA   IB  Adapter   System   Memory   GPU   Memory GPU   CPU   Chipset   P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB  E5-­‐2670   P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB  E5-­‐2680V2   SNB  E5-­‐2670  /   IVB  E5-­‐2680V2   S.  Potluri,  K.  Hamidouche,  A.  Venkatesh,  D.  Bureddy  and  D.  K.  Panda,  Efficient   Inter-­‐node  MPI  CommunicaPon  using  GPUDirect  RDMA  for  InfiniBand   Clusters  with  NVIDIA  GPUs,  Int'l  Conference  on  Parallel  Processing  (ICPP  '13)  
  • 30. 30   Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Latency   GPU-­‐GPU  Internode  MPI  Latency   0   100   200   300   400   500   600   700   800   8K   32K   128K   512K   2M   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Large  Message  Latency   Message  Size  (Bytes)   Latency  (us)   Based  on  MVAPICH2-­‐2.0b   Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores   NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA   CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in   10  %   0   5   10   15   20   25   1   4   16   64   256   1K   4K   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Small  Message  Latency   Message  Size  (Bytes)   Latency  (us)   67  %   5.49  usec   HPC  Advisory  Council  European  Conference,  June  '14  
  • 31. 31   Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Bandwidth   GPU-­‐GPU  Internode  MPI  Uni-­‐DirecPonal  Bandwidth   0   200   400   600   800   1000   1200   1400   1600   1800   2000   1   4   16   64   256   1K   4K   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Small  Message  Bandwidth   Message  Size  (Bytes)   Bandwidth  (MB/s)   0   2000   4000   6000   8000   10000   12000   8K   32K   128K   512K   2M   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Large  Message  Bandwidth   Message  Size  (Bytes)   Bandwidth  (MB/s)   5x   9.8  GB/s   HPC  Advisory  Council  European  Conference,  June  '14   Based  on  MVAPICH2-­‐2.0b   Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores   NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA   CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in  
  • 32. 32   Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Bi-­‐Bandwidth   GPU-­‐GPU  Internode  MPI  Bi-­‐direcPonal  Bandwidth   0   200   400   600   800   1000   1200   1400   1600   1800   2000   1   4   16   64   256   1K   4K   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Small  Message  Bi-­‐Bandwidth   Message  Size  (Bytes)   Bi-­‐Bandwidth  (MB/s)   0   5000   10000   15000   20000   25000   8K   32K   128K   512K   2M   1-­‐Rail   2-­‐Rail   1-­‐Rail-­‐GDR   2-­‐Rail-­‐GDR   Large  Message  Bi-­‐Bandwidth   Message  Size  (Bytes)   Bi-­‐Bandwidth  (MB/s)   4.3x   19  %   19  GB/s   HPC  Advisory  Council  European  Conference,  June  '14   Based  on  MVAPICH2-­‐2.0b   Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores   NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA   CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in  
  • 33. •  MVAPICH2-­‐2.0b  with  GDR  support  can  be  downloaded  from   h<ps://mvapich.cse.ohio-­‐state.edu/download/mvapich2gdr/   •  System  sotware  requirements   •  Mellanox  OFED  2.1  or  later   •  NVIDIA  Driver  331.20  or  later   •  NVIDIA  CUDA  Toolkit  5.5  or  later   •  Plugin  for  GPUDirect  RDMA   –  h<p://www.mellanox.com/page/products_dyn?product_family=116   •  Has  opRmized  designs  for  point-­‐to-­‐point  communicaRon  using  GDR   •  Contact  MVAPICH  help  list  with  any  quesRons  related  to  the  package     mvapich-­‐help@cse.ohio-­‐state.edu   Using  MVAPICH2-­‐GPUDirect  Version   HPC  Advisory  Council  European  Conference,  June  '14  
  • 34. New  Design  of  MVAPICH2  with  GPUDirect-­‐RDMA   0   2   4   6   8   10   12   1   4   16   64   256   1024   4096   MV2-­‐GDR  2.0b   MV2-­‐GDR-­‐New  (Loopback)   MV2-­‐GDR-­‐New  (Fast  Copy)   Small  Message  Latency   Message  Size  (Bytes)   Latency  (us)   7.3  usec   5.0  usec   3.5  usec   52%   GPU-­‐GPU  Internode  MPI  Latency   Intel  Ivy  Bridge  (E5-­‐2630  v2)  node  with  12  cores  w/  NVIDIA  Tesla  K20c  GPU,     Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA   CUDA  6.0,  Mellanox  OFED  2.1  with  GPUDirect-­‐RDMA  Plug-­‐in   HPC  Advisory  Council  European  Conference,  June  '14  
  • 35. 0   500   1000   1500   2000   2500   3000   4   8   16   32   64   Average  TPS Number  of  GPU  Nodes MV2-­‐2.0b-­‐GDR   MV2-­‐NewGDR-­‐Loopback   MV2-­‐NewGDR-­‐Fastcopy   •  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  Strong Scaling: fixed 64K particles •  Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs •  Weak Sailing: fixed 2K particles / GPU •  Loopback  and  Fastcopy  get  up  to  54%  and  56%  improvement  for  16  GPUs   HOOMD-blue Strong Scaling Application-Level Evaluation (HOOMD-blue) HOOMD-blue Weak Scaling 48%  47%   56%  53%   0   500   1000   1500   2000   2500   3000   3500   4000   4500   4   8   16   32   64   Average  TPS Number  of  GPU  Nodes MV2-­‐2.0b-­‐GDR   MV2-­‐NewGDR-­‐Loopback   MV2-­‐NewGDR-­‐Fastcopy   ApplicaPon-­‐Level  EvaluaPon  (HOOMD-­‐blue)   HPC  Advisory  Council  European  Conference,  June  '14  
  • 36. MPI-­‐3  RMA  Support  with  GPUDirect  RDMA   Intel  Ivy  Bridge  (E5-­‐2630  v2)  node  with  12  cores   NVIDIA  Tesla  K20c  GPU,  Mellanox  Connect-­‐IB  HCA   CUDA  6.0,  Mellanox  OFED  2.1  with  GPUDirect-­‐RDMA  Plugin   MPI-3 RMA provides flexible synchronization and completion primitives 0   1   2   3   4   5   6   7   8   9   1   4   16   64   256   1024   MV2-­‐GDR  2.0b   MV2-­‐GDR-­‐New  (Send/Recv)   MV2-­‐GDR-­‐New  (Put+Flush)   Small  Message  Latency   Message  Size  (Bytes)   Latency  (usec)   39%   0   500000   1000000   1500000   2000000   2500000   1   4   16   64   256   1024   MV2-­‐GDR  2.0b   MV2-­‐GDR-­‐New  (Send/Recv)   MV2-­‐GDR-­‐New  (Put+Flush)   Small  Message  Rate   Message  Size  (Bytes)   Message  Rate  (Msgs/sec)   2.8usec   New  release  with  the  advanced   designs  coming  soon   HPC  Advisory  Council  European  Conference,  June  '14  
  • 37. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   37   HPC  Advisory  Council  European  Conference,  June  '14  
  • 38. MPI  ApplicaPons  on  MIC  Clusters   Xeon   Xeon  Phi   MulR-­‐core  Centric   Many-­‐core  Centric   MPI   Program   MPI   Program   Offloaded   ComputaRon   MPI   Program   MPI  Program   MPI  Program   Host-­‐only   Offload     (/reverse  Offload)   Symmetric   Coprocessor-­‐only   •   Flexibility  in  launching  MPI  jobs  on  clusters  with  Xeon  Phi     38  HPC  Advisory  Council  European  Conference,  June  '14  
  • 39. Data  Movement  on  Intel  Xeon  Phi  Clusters   CPU   CPU   QPI   MIC   PCIe   MIC   MIC   CPU   MIC   IB   Node  0   Node  1   1.  Intra-­‐Socket   2.  Inter-­‐Socket   3.  Inter-­‐Node   4.  Intra-­‐MIC   5.  Intra-­‐Socket  MIC-­‐MIC   6.  Inter-­‐Socket  MIC-­‐MIC   7.  Inter-­‐Node  MIC-­‐MIC   8.  Intra-­‐Socket  MIC-­‐Host   10.  Inter-­‐Node  MIC-­‐Host   9.  Inter-­‐Socket  MIC-­‐Host   MPI  Process   39   11.  Inter-­‐Node  MIC-­‐MIC  with  IB  adapter    on  remote  socket     and  more  .  .  .   •   CriRcal  for  runRmes  to  opRmize  data  movement,  hiding  the  complexity   •   Connected  as  PCIe  devices  –  Flexibility  but  Complexity   HPC  Advisory  Council  European  Conference,  June  '14  
  • 40. 40   MVAPICH2-­‐MIC  Design  for  Clusters  with  IB  and    MIC   •  Offload  Mode     •  Intranode  CommunicaRon   •  Coprocessor-­‐only  Mode   •  Symmetric  Mode     •  Internode  CommunicaRon     •  Coprocessors-­‐only     •  Symmetric  Mode   •  MulR-­‐MIC  Node  ConfiguraRons   HPC  Advisory  Council  European  Conference,  June  '14  
  • 41. Direct-­‐IB  CommunicaPon   41    IB  Write  to  Xeon  Phi                                (P2P  Write)   IB  Read  from  Xeon  Phi   (P2P  Read)   CPU   Xeon  Phi   PCIe   QPI   E5-­‐2680  v2  E5-­‐2670     5280  MB/s  (83%)   IB  FDR  HCA        MT4099   6396  MB/s  (100%)   962  MB/s  (15%)   3421  MB/s  (54%)   1075  MB/s  (17%)   1179  MB/s  (19%)   370  MB/s  (6%)   247  MB/s  (4%)   (SandyBridge)   (IvyBridge)   •  IB-­‐verbs  based  communicaPon   from  Xeon  Phi   •  No  porPng  effort  for  MPI  libraries   with  IB  support   Peak  IB  FDR  Bandwidth:  6397  MB/s         Intra-­‐socket    IB  Write  to  Xeon  Phi                                (P2P  Write)   IB  Read  from  Xeon  Phi   (P2P  Read)  Inter-­‐socket   •  Data  movement  between  IB  HCA   and  Xeon  Phi:    implemented  as   peer-­‐to-­‐peer  transfers   HPC  Advisory  Council  European  Conference,  June  '14  
  • 42. MIC-­‐Remote-­‐MIC  P2P  CommunicaPon   42  HPC  Advisory  Council  European  Conference,  June  '14   Bandwidth   BeWer   BeWer  BeWer   Latency  (Large  Messages)   0 1000 2000 3000 4000 5000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) 5236   Intra-­‐socket  P2P   Inter-­‐socket  P2P   0 2000 4000 6000 8000 10000 12000 14000 16000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) Latency  (Large   Messages)   0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) BeWer   5594   Bandwidth  
  • 43. InterNode  –  Symmetric  Mode  -­‐  P3DFFT  Performance   43  HPC  Advisory  Council  European  Conference,  June  '14   •     Note   •   ApplicaRon  performance  in  symmetric  mode  closely  Red  with  load  balancing  of  computaRon             between  host  and  the  Xeon  Phi   •  Try  different  decomposiRons  /  threading  levels  to  idenRfy  best  se|ngs  for  target  applicaRon   3DStencil  Comm.  Kernel     P3DFFT  –  512x512x4K     0 2 4 6 8 10 4+4 8+8 TimeperStep(sec) Processes on Host+MIC (32 Nodes) 0 5 10 15 20 2+2 2+8 CommperStep(sec) Processes on Host-MIC - 8 Threads/Proc (8 Nodes) 50.2%   MV2-­‐INTRA-­‐OPT   MV2-­‐MIC   16.2%   S.  Potluri,  D.  Bureddy,  K.  Hamidouche,  A.  Venkatesh,  K.  Kandalla,  H.  Subramoni  and  D.  K.  Panda,   MVAPICH-­‐PRISM:  A  Proxy-­‐based  CommunicaPon  Framework  using  InfiniBand  and  SCIF  for  Intel  MIC   Clusters  Int'l  Conference  on  SupercompuPng  (SC  '13),  November  2013.    
  • 44. 44   OpPmized  MPI  CollecPves  for  MIC  Clusters  (Allgather  &  Alltoall)   HPC  Advisory  Council  European  Conference,  June  '14   A.  Venkatesh,  S.  Potluri,  R.  Rajachandrasekar,  M.  Luo,  K.  Hamidouche  and  D.  K.  Panda  -­‐  High   Performance  Alltoall  and  Allgather  designs  for  InfiniBand  MIC  Clusters;  IPDPS’14,  May  2014   0   5000   10000   15000   20000   25000   1   2   4   8   16   32   64   128  256  512  1K   Latency  (usecs)   Message  Size  (Bytes)   32-­‐Node-­‐Allgather  (16H  +  16  M)   Small  Message  Latency   MV2-­‐MIC   MV2-­‐MIC-­‐Opt   0   500   1000   1500   8K   16K   32K   64K   128K  256K  512K   1M   Latency  (usecs)  x  1000   Message  Size  (Bytes)   32-­‐Node-­‐Allgather  (8H  +  8  M)   Large  Message  Latency   MV2-­‐MIC   MV2-­‐MIC-­‐Opt   0   200   400   600   800   4K   8K   16K   32K   64K   128K  256K  512K   Latency  (usecs)  x  1000   Message  Size  (Bytes)   32-­‐Node-­‐Alltoall  (8H  +  8  M)   Large  Message  Latency   MV2-­‐MIC   MV2-­‐MIC-­‐Opt   0   10   20   30   40   50   MV2-­‐MIC-­‐Opt   MV2-­‐MIC   ExecuPon  Time  (secs)   32  Nodes  (8H  +  8M),  Size  =  2K*2K*1K   P3DFFT  Performance   CommunicaRon     ComputaRon   76%   58%   55%  
  • 45. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   45   HPC  Advisory  Council  European  Conference,  June  '14  
  • 46. •  Component  failures  are  common  in  large-­‐scale  clusters   •  Imposes  need  on  reliability  and  fault  tolerance   •  MulRple  challenges:   –  Checkpoint-­‐Restart  vs.  Process  MigraRon   –  Benefits  of  Scalable  Checkpoint  Restart  (SCR)  Support   •  ApplicaRon  guided   •  ApplicaRon  transparent   HPC  Advisory  Council  European  Conference,  June  '14   46   Fault  Tolerance  
  • 47. HPC  Advisory  Council  European  Conference,  June  '14   47   Checkpoint-­‐Restart  vs.  Process  MigraPon   Low  Overhead  Failure  PredicPon  with  IPMI   X.  Ouyang,  R.  Rajachandrasekar,  X.  Besseron,  D.  K.  Panda,  High  Performance  Pipelined  Process  MigraPon  with  RDMA,   CCGrid  2011     X.  Ouyang,  S.  Marcarelli,  R.  Rajachandrasekar  and  D.  K.  Panda,    RDMA-­‐Based  Job  MigraPon  Framework  for  MPI  over   InfiniBand,  Cluster  2010     10.7X   2.3X   4.3X   1   Time  to  migrate  8  MPI  processes   •   Job-­‐wide  Checkpoint/Restart  is  not  scalable   •   Job-­‐pause  and  Process  MigraRon  framework  can  deliver  pro-­‐acRve  fault-­‐tolerance   •   Also  allows  for  cluster-­‐wide  load-­‐balancing  by  means  of  job  compacRon   0 5000 10000 15000 20000 25000 30000 Migration w/ o RDMA CR (ext3) CR (PVFS) ExecutionTime(seconds) Job Stall Checkpoint (Migration) Resume Restart 2.03x   4.49x   LU  Class  C  Benchmark  (64  Processes)  
  • 48. MulP-­‐Level  CheckpoinPng  with  ScalableCR  (SCR)   48  HPC  Advisory  Council  European  Conference,  June  '14   ⊕ Checkpoint  Cost  and  Resiliency   Low   High   •  LLNL’s  Scalable  Checkpoint/Restart   library   •  Can  be  used  for  applicaRon  guided  and   applicaRon  transparent  checkpoinRng   •  EffecRve  uRlizaRon  of  storage  hierarchy   –  Local:  Store  checkpoint  data  on  node’s  local   storage,  e.g.  local  disk,  ramdisk   –  Partner:  Write  to  local  storage  and  on  a   partner  node   –  XOR:  Write  file  to  local  storage  and  small  sets   of  nodes  collecRvely  compute  and  store  parity   redundancy  data  (RAID-­‐5)   –  Stable  Storage:  Write  to  parallel  file  system  
  • 49. ApplicaPon-­‐guided  MulP-­‐Level  CheckpoinPng   49  HPC  Advisory  Council  European  Conference,  June  '14   0   20   40   60   80   100   PFS   MVAPICH2+SCR   (Local)   MVAPICH2+SCR   (Partner)   MVAPICH2+SCR   (XOR)   Checkpoint  WriRng  Time  (s)   RepresentaRve  SCR-­‐Enabled  ApplicaRon   •  Checkpoint  wriRng  phase  Rmes  of  representaRve  SCR-­‐enabled  MPI  applicaRon     •  512  MPI  processes  (8  procs/node)   •  Approx.  51  GB  checkpoints  
  • 50. Transparent  MulP-­‐Level  CheckpoinPng   50  HPC  Advisory  Council  European  Conference,  June  '14   •  ENZO  Cosmology  applicaPon  –  RadiaRon  Transport  workload   •  Using  MVAPICH2’s  CR  protocol  instead  of  the  applicaRon’s  in-­‐built  CR  mechanism   •  512  MPI  processes  running  on  the  Stampede  Supercomputer@TACC   •  Approx.  12.8  GB  checkpoints:  ~2.5X  reducPon  in  checkpoint  I/O  costs  
  • 51. •  Scalability  for  million  to  billion  processors   –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided   and  one-­‐sided)   –  Extremely  minimum  memory  footprint   –  Support  with  Direct  Connected  Transport  (DCT)   •  CollecRve  communicaRon   –  Hardware-­‐mulRcast-­‐based   –  Offload  and  Non-­‐blocking   –  Topology-­‐aware   –  Power-­‐aware   •  Support  for  GPGPUs   •  Support  for  Intel  MICs   •  Fault-­‐tolerance   •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with   Unified  RunRme         Overview  of  A  Few  Challenges  being  Addressed  by   MVAPICH2/MVAPICH2-­‐X  for  Exascale   51   HPC  Advisory  Council  European  Conference,  June  '14  
  • 52. MVAPICH2-­‐X  for  Hybrid  MPI  +  PGAS  ApplicaPons   HPC  Advisory  Council  European  Conference,  June  '14   52   MPI  ApplicaPons,  OpenSHMEM  ApplicaPons,  UPC   ApplicaPons,  Hybrid  (MPI  +  PGAS)  ApplicaPons   Unified  MVAPICH2-­‐X  RunPme   InfiniBand,  RoCE,  iWARP   OpenSHMEM  Calls   MPI  Calls  UPC  Calls   •  Unified  communicaRon  runRme  for  MPI,  UPC,  OpenSHMEM  available  with   MVAPICH2-­‐X  1.9  onwards!     –  h<p://mvapich.cse.ohio-­‐state.edu   •  Feature  Highlights   –  Supports  MPI(+OpenMP),  OpenSHMEM,  UPC,  MPI(+OpenMP)  +  OpenSHMEM,   MPI(+OpenMP)  +  UPC     –  MPI-­‐3  compliant,  OpenSHMEM  v1.0  standard  compliant,  UPC  v1.2  standard   compliant   –  Scalable  Inter-­‐node  and  Intra-­‐node  communicaRon  –  point-­‐to-­‐point  and  collecRves  
  • 53. OpenSHMEM  CollecPve  CommunicaPon  Performance   53   Reduce  (1,024  processes)   Broadcast  (1,024  processes)   Collect  (1,024  processes)   Barrier   0   50   100   150   200   250   300   350   400   128   256   512   1024   2048   Time  (us)   No.  of  Processes   1   10   100   1000   10000   100000   1000000   10000000   Time  (us)   Message  Size   180X   1   10   100   1000   10000   4   16   64   256   1K   4K   16K   64K  256K   Time  (us)   Message  Size   18X   1   10   100   1000   10000   100000   1   4   16   64   256   1K   4K   16K   64K   Time  (us)   Message  Size   MV2X-­‐SHMEM   Scalable-­‐SHMEM   OMPI-­‐SHMEM   12X   3X   HPC  Advisory  Council  European  Conference,  June  '14  
  • 54. OpenSHMEM  ApplicaPon  EvaluaPon   54   0   100   200   300   400   500   600   256   512   Time  (s)   Number  of  Processes   UH-­‐SHMEM   OMPI-­‐SHMEM  (FCA)   Scalable-­‐SHMEM  (FCA)   MV2X-­‐SHMEM   0   10   20   30   40   50   60   70   80   256   512   Time  (s)   Number  of  Processes   UH-­‐SHMEM   OMPI-­‐SHMEM  (FCA)   Scalable-­‐SHMEM  (FCA)   MV2X-­‐SHMEM   Heat  Image   Daxpy   •  Improved  performance  for  OMPI-­‐SHMEM  and  Scalable-­‐SHMEM  with  FCA   •  ExecuRon  Rme  for  2DHeat  Image  at  512  processes  (sec):   -  UH-­‐SHMEM  –  523,  OMPI-­‐SHMEM  –  214,  Scalable-­‐SHMEM  –  193,  MV2X-­‐ SHMEM  –  169   •  ExecuRon  Rme  for  DAXPY  at  512  processes  (sec):   -  UH-­‐SHMEM  –  57,  OMPI-­‐SHMEM  –  56,  Scalable-­‐SHMEM  –  9.2,  MV2X-­‐ SHMEM  –  5.2   J.  Jose,  J.  Zhang,  A.  Venkatesh,  S.  Potluri,  and  D.  K.  Panda,  A  Comprehensive  Performance  EvaluaPon  of   OpenSHMEM  Libraries  on  InfiniBand  Clusters,  OpenSHMEM  Workshop  (OpenSHMEM’14),  March  2014     HPC  Advisory  Council  European  Conference,  June  '14  
  • 55. HPC  Advisory  Council  European  Conference,  June  '14   55   Hybrid  MPI+OpenSHMEM  Graph500  Design     ExecuPon  Time   J.  Jose,  S.  Potluri,  K.  Tomko  and  D.  K.  Panda,  Designing  Scalable  Graph500  Benchmark  with  Hybrid  MPI+OpenSHMEM  Programming  Models,   InternaPonal  SupercompuPng  Conference  (ISC’13),  June  2013   0   1   2   3   4   5   6   7   8   9   26   27   28   29   Billions  of  Traversed  Edges  Per   Second  (TEPS)   Scale   MPI-­‐Simple   MPI-­‐CSC   MPI-­‐CSR   Hybrid(MPI+OpenSHMEM)   0   1   2   3   4   5   6   7   8   9   1024   2048   4096   8192   Billions  of  Traversed  Edges  Per   Second  (TEPS)   #  of  Processes   MPI-­‐Simple   MPI-­‐CSC   MPI-­‐CSR   Hybrid(MPI+OpenSHMEM)   Strong  Scalability   Weak  Scalability   J.  Jose,  K.  Kandalla,  M.  Luo  and  D.  K.  Panda,  SupporPng  Hybrid  MPI  and  OpenSHMEM  over  InfiniBand:  Design  and  Performance   EvaluaPon,  Int'l  Conference  on  Parallel  Processing  (ICPP  '12),  September  2012   0   5   10   15   20   25   30   35   4K   8K   16K   Time  (s)   No.  of  Processes   MPI-­‐Simple   MPI-­‐CSC   MPI-­‐CSR   Hybrid  (MPI+OpenSHMEM)   13X   7.6X   •  Performance  of  Hybrid  (MPI+OpenSHMEM)   Graph500  Design   •  8,192  processes    -­‐  2.4X  improvement  over  MPI-­‐CSR    -­‐  7.6X  improvement  over  MPI-­‐Simple   •  16,384  processes    -­‐  1.5X  improvement  over  MPI-­‐CSR    -­‐  13X  improvement  over  MPI-­‐Simple    
  • 56. 56   Hybrid  MPI+OpenSHMEM  Sort  ApplicaPon     ExecuPon  Time   Strong  Scalability   •  Performance  of  Hybrid  (MPI+OpenSHMEM)  Sort  ApplicaRon   •  ExecuRon  Time    -­‐  4TB  Input  size  at  4,096  cores:  MPI  –  2408  seconds,  Hybrid:  1172  seconds      -­‐  51%  improvement  over  MPI-­‐based  design   •  Strong  Scalability  (configuraRon:  constant  input  size  of  500GB)    -­‐  At  4,096  cores:  MPI  –  0.16  TB/min,  Hybrid  –  0.36  TB/min    -­‐  55%  improvement  over  MPI  based  design     0.00   0.05   0.10   0.15   0.20   0.25   0.30   0.35   0.40   512   1,024   2,048   4,096   Sort  Rate  (TB/min)   No.  of  Processes   MPI   Hybrid   0   500   1000   1500   2000   2500   3000   500GB-­‐512   1TB-­‐1K   2TB-­‐2K   4TB-­‐4K   Time  (seconds)   Input  Data  -­‐  No.  of  Processes   MPI   Hybrid   51%   55%   HPC  Advisory  Council  European  Conference,  June  '14  
  • 57. •  Performance  and  Memory  scalability  toward  900K-­‐1M  cores   –  Enhanced  design  with  Dynamically  Connected  Transport  (DCT)  service  and  Connect-­‐IB     •  Enhanced  OpRmizaRon  for  GPGPU  and  Coprocessor  Support   –  Extending  the  GPGPU  support  (GPU-­‐Direct  RDMA)  with  CUDA  6.5  and  Beyond   –  Support  for  Intel  MIC  (Knight  Landing)   •  Taking  advantage  of  CollecRve  Offload  framework   –  Including  support  for  non-­‐blocking  collecRves  (MPI  3.0)   •  RMA  support  (as  in  MPI  3.0)   •  Extended  topology-­‐aware  collecRves   •  Power-­‐aware  collecRves   •  Support  for  MPI  Tools  Interface  (as  in  MPI  3.0)   •  Checkpoint-­‐Restart  and  migraRon  support  with  in-­‐memory  checkpoinRng   •  Hybrid  MPI+PGAS  programming  support  with  GPGPUs  and  Accelertors           MVAPICH2/MVPICH2-­‐X  –  Plans  for  Exascale   57   HPC  Advisory  Council  European  Conference,  June  '14  
  • 58. •  InfiniBand  with  RDMA  feature  is  gaining  momentum  in  HPC  systems  with  best   performance  and  greater  usage   •  As  the  HPC  community  moves  to  Exascale,  new  soluRons  are  needed  in  the   MPI  and  Hybrid  MPI+PGAS  stacks  for  supporRng  GPUs  and  Accelerators   •  Demonstrated  how  such  soluRons  can  be  designed  with  MVAPICH2/ MVAPICH2-­‐X  and  their  performance  benefits   •  Such  designs  will  allow  applicaRon  scienRsts    and  engineers  to  take  advantage   of  upcoming  exascale  systems   58   Concluding  Remarks   HPC  Advisory  Council  European  Conference,  June  '14  
  • 59. HPC  Advisory  Council  European  Conference,  June  '14   Funding  Acknowledgments   Funding  Support  by   Equipment  Support  by   59  
  • 60. HPC  Advisory  Council  European  Conference,  June  '14   Personnel  Acknowledgments   Current  Students     –  S.  Chakraborthy    (Ph.D.)   –  N.  Islam  (Ph.D.)   –  J.  Jose  (Ph.D.)   –  M.  Li  (Ph.D.)   –  R.  Rajachandrasekhar  (Ph.D.)   Past  Students     –  P.  Balaji  (Ph.D.)   –  D.  BunRnas  (Ph.D.)   –  S.  Bhagvat  (M.S.)   –  L.  Chai  (Ph.D.)   –  B.  Chandrasekharan  (M.S.)   –  N.  Dandapanthula  (M.S.)   –  V.  Dhanraj  (M.S.)   –  T.  Gangadharappa  (M.S.)   –  K.  Gopalakrishnan  (M.S.)   –  G.  Santhanaraman  (Ph.D.)   –  A.  Singh  (Ph.D.)   –  J.  Sridhar  (M.S.)   –  S.  Sur  (Ph.D.)   –  H.  Subramoni  (Ph.D.)   –  K.  Vaidyanathan  (Ph.D.)   –  A.  Vishnu  (Ph.D.)   –  J.  Wu  (Ph.D.)   –  W.  Yu  (Ph.D.)   60   Past  Research  ScienCst   –  S.  Sur   Current  Post-­‐Docs   –  X.  Lu   –  K.  Hamidouche   Current  Programmers   –  M.  Arnold   –  J.  Perkins   Past  Post-­‐Docs   –  H.  Wang   –  X.  Besseron   –  H.-­‐W.  Jin   –  M.  Luo   –  W.  Huang  (Ph.D.)   –  W.  Jiang  (M.S.)   –  S.  Kini  (M.S.)   –  M.  Koop  (Ph.D.)   –  R.  Kumar  (M.S.)   –  S.  Krishnamoorthy  (M.S.)   –  K.  Kandalla  (Ph.D.)   –  P.  Lai  (M.S.)   –  J.  Liu  (Ph.D.)   –  M.  Luo  (Ph.D.)   –  A.  Mamidala  (Ph.D.)   –  G.  Marsh  (M.S.)   –  V.  Meshram  (M.S.)   –  S.  Naravula  (Ph.D.)   –  R.  Noronha  (Ph.D.)   –  X.  Ouyang  (Ph.D.)   –  S.  Pai  (M.S.)   –  S.  Potluri  (Ph.D.)     –  M.  Rahman  (Ph.D.)   –  D.  Shankar  (Ph.D.)   –  R.  Shir  (Ph.D.)   –  A.  Venkatesh  (Ph.D.)   –  J.  Zhang  (Ph.D.)   –  E.  Mancini   –  S.  Marcarelli   –  J.  Vienne   Current  Senior  Research  Associate   –  H.  Subramoni   Past  Programmers   –  D.  Bureddy  
  • 61. •  August  25-­‐27,  2014;  Columbus,  Ohio,  USA   •  Keynote  Talks,  Invited  Talks,  Contributed  PresentaRons         •  Half-­‐day  tutorial  on  MVAPICH2,  MVAPICH2-­‐X  and  MVAPICH2-­‐GDR  opRmizaRon  and  tuning   •  Speakers  (Confirmed  so  far):   –  Keynote:  Dan  Stanzione  (TACC)   –  Keynote:  Darren  Kerbyson  (PNNL)   –  Sadaf  Alam  (CSCS,  Switzerland)   –  Onur  Celebioglu  (Dell)   –  Jens  Glaser  (Univ.  of  Michigan)   –  Jeff  Hammond  (Intel)   –  Adam  Moody  (LLNL)   –  David  Race  (Cray)   –  Davide  Rosse|  (NVIDIA)   –  Gilad  Shainer  (Mellanox)   –  Sayantan  Sur  (Intel)   –  Mahidhar  TaRneni  (SDSC)   –  Jerome  Vienne  (TACC)   •  Call  for  Contributed  PresentaPons  (Deadline:  July  1,  2014)   •  More  details  at:  h<p://mug.mvapich.cse.ohio-­‐state.edu   61   Upcoming  2nd  Annual  MVAPICH  User  Group  (MUG)  MeePng   HPC  Advisory  Council  European  Conference,  June  '14  
  • 62. •  Looking  for  Bright  and  EnthusiasRc  Personnel  to  join  as     –  VisiRng  ScienRsts   –  Post-­‐Doctoral  Researchers   –  PhD  Students   –  MPI  Programmer/Sotware  Engineer     –  Hadoop/Big  Data  Programmer/Sotware  Engineer   •  If  interested,  please  contact  me  at  this  conference  and/or  send   an  e-­‐mail  to  panda@cse.ohio-­‐state.edu   62   MulPple  PosiPons  Available  in  My  Group   HPC  Advisory  Council  European  Conference,  June  '14  
  • 63. HPC  Advisory  Council  European  Conference,  June  '14   Pointers   hWp://nowlab.cse.ohio-­‐state.edu     panda@cse.ohio-­‐state.edu   hWp://www.cse.ohio-­‐state.edu/~panda     63   hWp://mvapich.cse.ohio-­‐state.edu   hWp://hadoop-­‐rdma.cse.ohio-­‐ state.edu