Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Penguin Computing Designing and Deploying End to End HPC and AI Solutions

215 views

Published on

Learn what’s working for organizations building AI systems based on lessons learned from Penguin Computing, NVIDIA’s HPC 2017 Partner of the Year.

Published in: Technology
  • Be the first to comment

Penguin Computing Designing and Deploying End to End HPC and AI Solutions

  1. 1. Best Prac*ces in Designing and Deploying End-to-End HPC and AI Solu*ons Kevin Tubbs Ph.D., Sr. Director, Advanced Solu*ons Group GTC 2018 © 2018 Penguin Compu*ng. All rights reserved.
  2. 2. © 2018 Penguin Compu*ng. All rights reserved. About Penguin Compu*ng 2 •  20-year-old, U.S.-based global provider of high- performance computing (HPC) and data center solutions with more than 2,500 customers in 40 countries, across 8 major vertical markets •  Comprehensive portfolio of Linux servers and software, integrated, turn-key clusters, enterprise-grade storage, and bare metal HPC on cloud, as well as expert HPC and AI services, financing, and top-rated support
  3. 3. © 2018 Penguin Compu*ng. All rights reserved. Growth Has Resulted in Gaps in AI Design 3 Explosion of data has resulted in huge growth in AI systems but many designs fail to address the most critical challenges, failing to: ●  Balance researchers’ workflow and technical compute needs ●  Sizing AI infrastructure to optimize utilization ●  Balancing data workflows and performance ●  Deploying AI infrastructure efficiently
  4. 4. © 2018 Penguin Compu*ng. All rights reserved. Ways to Address Gaps 4 ●  Flexible hardware and software required to meet research and workload needs ●  Well-designed, balanced systems are required to ensure data and compute achieves the overall production needs ●  Fast and optimized storage necessary to deliver large amounts of data required to improve accuracy ●  Deploying AI infrastructure efficiently requires balancing power, cooling and performance
  5. 5. © 2018 Penguin Compu*ng. All rights reserved. Essen*als on How to Achieve These Objec*ves 5 Designs and product must take into account these essen*al components of AI system design: •  Managing SoVware Ecosystem, focusing on Orchestra*on and Workload portability •  Accurately Sizing AI Infrastructure •  Balancing Overall Data workflows and performance •  Deploying and Scaling AI Infrastructure Efficiently
  6. 6. © 2018 Penguin Compu*ng. All rights reserved. Managing AI SoVware Ecosystem Managing AI software ecosystem for infrastructure designs must balance researchers’ workflow and technical compute needs by focusing on: ●  Managing System and Application Orchestration and Workload Portability ●  Curating Deep Learning Frameworks such as TensorFlow, PyTorch, Caffe2, MXNet ●  Providing Libraries and SDK’s optimized for hardware
  7. 7. © 2018 Penguin Compu*ng. All rights reserved. Accurately Sizing AI Infrastructure Optimizing Training and Inference Workloads on Nvidia GPUs ○  Intranode Communication: GPU Topology and NVLink ○  Internode Communication: Infiniband and Ethernet Networks Selecting the proper heterogeneous compute options based time to market/production, TCO and performance: Nvidia GPUs Ensuring optimal use of compute resources can be challenging, particularly in heterogeneous architectures. •  CPU Based architectures (x86 , Power, Arm64) •  Specialty compute architectures (GPUs, FPGAs, TPUs and emerging ASICs).
  8. 8. © 2018 Penguin Compu*ng. All rights reserved. Balancing Overall Data Workflows and Performance The need to effectively manage large amounts of data, is at the center of a modern deep learning solution. •  Larger amounts of Data have varying ingest, processing and data life cycle requirements •  Storage Performance requirements dictated by data type, size, workflow stage, etc. •  Delivering storage to GPU clusters at high IOPs, Bandwidth and low latency is crucial Fast and optimized storage is critical to deliver the large amounts of data required to improve accuracy
  9. 9. © 2018 Penguin Compu*ng. All rights reserved. Deploying and Scaling AI Infrastructure 9 Designing, Deploying, and Scaling AI infrastructure efficiently involves optimizing at Rack, Datacenter and multiple Datacenter Scales. Optimizing AI Infrastructure at multiple datacenter scales presents a challenge may include a shift from on- premise to hybrid to private/public cloud Optimizing AI infrastructure at Rack and Datacenter level requires balancing power, cooling, and performance
  10. 10. © 2018 Penguin Compu*ng. All rights reserved. Value of Moving Towards Open Technologies 1 0 •  Flexible compute architectures (x86,Power, ARM64, Nvidia GPUs, FPGAs, ASICs, …) •  Flexible rack scale platforms (EIA,OCP) •  Flexible Software Defined networking solutions •  Flexible software defined storage solutions Open Architectures for Artificial Intelligence /Deep Learning A Scalable Software-defined Artificial Intelligence / Deep Learning Environment
  11. 11. © 2018 Penguin Compu*ng. All rights reserved. Unnamed AI Cluster Client deploying clustered, dense GPU nodes at scale, including •  NVIDIA DGX1 at scale •  Penguin Computing Relion 4118GTS at scale •  Strong storage component Top 500 and Green 500 Proven AI Deployments Based on Essen*als
  12. 12. © 2018 Penguin Compu*ng. All rights reserved. Unnamed Hybrid AI and HPC Cluster Client deploying clustered, OCP-compliant Relion X1904GTS servers •  Tundra® Extreme Scale HPC system •  Air Cooled •  Liquid Cooled Proven AI Deployments Based on Essen*als
  13. 13. © 2018 Penguin Compu*ng. All rights reserved. Summary of Best Prac*ces 1 3 ●  Manage AI software ecosystem to balance workflow and compute needs by focusing on orchestration and workload portability. ●  Select the proper heterogeneous compute options based time to market/ production, TCO and performance: Nvidia GPUs ●  Fast and optimized storage required to deliver the large amounts of data required to improve accuracy ●  Deploying and Scaling AI infrastructure efficiently involves optimizing power and cooling at Rack, Datacenter and multiple Datacenter Scales. ●  Consider open technologies for a scalable software-defined artificial intelligence / deep learning environment
  14. 14. © 2018 Penguin Compu*ng. All rights reserved. 14 Ques%on & Answer For more information, visit www.penguincomputing.com/ai

×