Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data

697 views

Published on

HPC DAY 2017 - http://www.hpcday.eu/

The network part in accelerating Machine-Learning and Big-Data

Boris Neiman | Sr. System Engineer at Mellanox

Published in: Technology
  • Be the first to comment

HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data

  1. 1. Boris Neiman – Sr. System Engineer October 2017 The network part in accelerating Machine-Learning and Big-Data
  2. 2. 2- Mellanox Confidential -© 2017 Mellanox Technologies 2 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide Offices ~2,900Employees worldwide Ticker: MLNX
  3. 3. 3- Mellanox Confidential -© 2017 Mellanox Technologies 3 Adapters Switches Cables & Transceivers System on a Chip Higher Faster Better Data Speeds Data Processing Data Security SmartNIC Exponential Data Growth Everywhere
  4. 4. 4- Mellanox Confidential -© 2017 Mellanox Technologies 4 Enabling the Future of Machine Learning Applications HPC and Machine Learning Share Same Interconnect Needs Storage High Performance Computing Financial Embedded Appliances Database Hyperscale Machine Learning IoT Healthcare Manufacturing Retail Self-Driving Vehicles
  5. 5. 5- Mellanox Confidential -© 2017 Mellanox Technologies 5 Highest Performance 100 and 200Gb/s Interconnect Solutions Transceivers Active Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) 40 HDR (200Gb/s) Ports 80 HDR100 (100Gb/s) Ports 16Tb/s Throughput, 15.6 Billion msg/sec Interconnect Switch Adapters 200Gb/s, 0.6us Latency 200 Million Messages per Second (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) Switch 32 100GbE Ports, 64 25/50GbE Ports (10 / 25 / 40 / 50 / 100GbE) Throughput of 6.4Tb/s Today’s Datacenters Need the Most Intelligent Interconnect
  6. 6. 6- Mellanox Confidential -© 2017 Mellanox Technologies 6 X86 Open POWER GPU ARM FPGA Leading Supplier of End-to-End Interconnect Solutions Storage Front / Back-End Server / Compute Switch / Gateway 56/100/200G InfiniBand 10/25/40/50/ 100/200GbE Virtual Protocol Interconnect 56/100/200G InfiniBand 10/25/40/50/ 100/200GbE Virtual Protocol Interconnect Smart Interconnect to Unleash The Power of All Compute Architectures
  7. 7. 7- Mellanox Confidential -© 2017 Mellanox Technologies 7 What is Deep Learning?  Also known as Deep Neural Network (DNN) • Subset of Artificial Neural Network (ANN) Deep Learning Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks Source: http://machinelearningmastery.com/what-is-deep-learning/
  8. 8. 8- Mellanox Confidential -© 2017 Mellanox Technologies 8 Training & Inference: Two Phases of Deep Learning Images Video Text Speech Tabular Time Series PBs of TRAINING DATASET NEW DATA Billions of TFLOPS Billions of FLOPS Trained Model Untrained Model Big Data Storage TRAINING • Large Scale-Out Cluster for Faster Training • Faster access to Big Data storage INFERENCE • Highly transactional / supports many users • Instant Response on ‘Trained’ Network
  9. 9. 9- Mellanox Confidential -© 2017 Mellanox Technologies 9 Deep Learning Use Cases Across Industries Finance, Fraud & Insurance • Fraud Detection • Credit/Risk Analysis • High Frequency Trading Automotive & Transportation • Self Driving • Image / Facial Recognition • Logistics & Mapping Medicine & Genomics • Drug Discovery • Diagnostic Assistance • Cancer Cell Detection Oil & Gas • Seismic Imaging • Reservoir Characterization • Subsurface Fault Detection Cloud, Web, Mobile, Retail • Image Tagging • Speech Recognition • Sentiment Analysis Security & Safety • Surveillance • Image Analysis • Facial Recognition and Detection
  10. 10. 10- Mellanox Confidential -© 2017 Mellanox Technologies 10 An Intelligent Network Unlocks the Power of Data with AI 60% Higher Return on Investment Up to 50% Savings on Capital and Operation Expenses World’s Highest Performance, Scalability and Productivity for AI Mellanox Networking Unlocks the Power of AI with RDMA Chainer Cognitive Toolkit
  11. 11. 11- Mellanox Confidential -© 2017 Mellanox Technologies 11 What Is RDMA?  Remote Direct Memory Access (RDMA)  Advance transport protocol (same layer as TCP and UDP)  Main features • Remote memory read/write semantics in addition to send/receive • Kernel bypass / direct user space access • Full hardware offload • Secure, channel based IO  Application advantage • Low latency • High bandwidth • Low CPU consumption  RoCE: RDMA over Converge Ethernet • Available for all Ethernet speeds 10 – 100G  Verbs: RDMA SW interface (equivalent to sockets) RDMA TCP/UDP Application Network Layer Link Layer Physical Layer Application Transport Layer Network Layer Link Layer Physical Layer Buffer BufferBuffer Transport Layer User code Kernel code Hardware
  12. 12. 12- Mellanox Confidential -© 2017 Mellanox Technologies 12 Mellanox is Leading Artificial Intelligence (AI) Health Care, Business Integrity, Business Intelligence Knowledge Discovery, Security, Customer Support and more Advancing Technology to Affect Science, Business, and Society By Enabling Critical and Timely Decision Making More Data Better Models Faster Interconnect GPUs CPUs FPGAs Storage More Data → Faster Interconnect → Better Insight → Competitive Advantage
  13. 13. 13- Mellanox Confidential -© 2017 Mellanox Technologies 13 Enabling Most Efficient Machine Learning Platforms (Examples) Highest Performance, Scalability and Productivity for Deep Learning
  14. 14. 14- Mellanox Confidential -© 2017 Mellanox Technologies 14 World’s First PCIe Gen 4 Public Cloud Server for Cognitive Computing Enabling Analytics in Cloud Sets TeraSort 2016 Benchmark Record 5x Faster , 3x Energy Efficient than 2015 Record http://sortbenchmark.org/TencentSort2016.pdf Smart Network for Azure Cloud Server Designed for Big Data Analytics & AI Mellanox Accelerates Machine Learning and Big Data
  15. 15. 15- Mellanox Confidential -© 2017 Mellanox Technologies 15 Mellanox Accelerates Machine Learning and Big Data Big Sur & Big Basin Facebook Open Source AI Hardware Platform Only ONE Network of Choice - Mellanox Powering Self Driving Car 2X Faster Training with Paddle Paddle “…We rely on fast interconnect technologies and RDMA.” Andrew Ng, Chief Scientist, Baidu https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/fbcollective/vendor/fbcollective Real Time Fraud Detection 14 Million Transactions per Day 4 Billion Database Inserts Image Recognition ~90% Prediction Accuracy RDMA in Tensorflow and Caffe Caffe Caffe2
  16. 16. 16- Mellanox Confidential -© 2017 Mellanox Technologies 16 Exponential Data Growth – The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale Must Wait for the Data Creates Performance Bottlenecks Analyze Data as it Moves!
  17. 17. 17- Mellanox Confidential -© 2017 Mellanox Technologies 17 Data Centric Architecture to Overcome Latency Bottlenecks HPC / Machine Learning Communications Latencies of 30-40us HPC / Machine Learning Communications Latencies of 3-4us CPU-Centric (Onload) Data-Centric (Offload) Network In-Network Computing Intelligent Interconnect Paves the Road to Exascale Performance
  18. 18. 18- Mellanox Confidential -© 2017 Mellanox Technologies 18 Mellanox Technology Accelerations for Machine Learning GPU GPU CPUCPU CPU CPU CPU GPU GPU In-Network Computing Key for Highest Return on Investment RDMA GPUDirect NVMe over Fabrics SHARP Security
  19. 19. 19- Mellanox Confidential -© 2017 Mellanox Technologies 19 In-Network Computing Enables Deep Learning Frameworks CUDA Mellanox Interconnect Solutions Mellanox Accelerations for Machine Learning and Big Data SHARP Middleware (MPI, gRPC) - Optional GPUDirect RDMA NVMe over FabricsrCUDA
  20. 20. 20- Mellanox Confidential -© 2017 Mellanox Technologies 20 Mellanox SHARP for Gradient Computation  CPU in a parameter server becomes the bottleneck quickly (roughly 4 nodes)  TCP adds a lot of overhead and the traffic pattern is bursty • SHARP performs the gradient averaging • Removes the need for physical parameter server • Removes all parameter server overhead SHARP Provides Better Scalability and Reduced Network Traffic
  21. 21. 21- Mellanox Confidential -© 2017 Mellanox Technologies 21 TensorFlow with Mellanox RDMA Unmatched Linear Scalability, No Additional Cost Up to 76% Efficiency and 50% Better Performance versus TCP Reference Deployment Guide
  22. 22. 22- Mellanox Confidential -© 2017 Mellanox Technologies 22 Accelerating Big Data Storage & Data Ingestion
  23. 23. 23- Mellanox Confidential -© 2017 Mellanox Technologies 23 Data Ingestion  Data ingestion is the process of acquiring and preparing the input  Preprocessing stage before accessing machine learning frameworks  Examples • Convert file/image formats • Combine multiple data sources • Clean noise / enhance input  Relevant for training and inference  Data Ingestion typically includes • Access to storage (local, distributed, network storage) • Pre-processing in a big data framework such as Hadoop or Spark Accelerate Data Ingest is critical for machine learning performance
  24. 24. 24- Mellanox Confidential -© 2017 Mellanox Technologies 24 Mellanox Accelerate Big Data - Enabling Real-time Decisions Benchmark: TeraSort Benchmark: Cassandra Stress 0 200 400 600 800 1000 1200 Intel 10Gb/s Mellanox 10Gb/s Mellanox 40Gb/s ExecutionTime(inseconds) Ethernet Network 3X Faster Benchmark: Fraud Detection 0 100 200 300 400 500 600 700 Existing Solution Aerospike with Mellanox + Samsung NVMe TotalTransactionTime(inms) CPU + Storage + Network Fraud Detection Algorithm ~2x more time for running fraud detection algorithm 3X Faster Runtime! ~2X Faster Runtime! 25G BW for Database! Mellanox is Certified by Leading Big Data Partners
  25. 25. 25- Mellanox Confidential -© 2017 Mellanox Technologies 25 Spark Map-GroupBy: Extensive Data Exchange Mapper Data shard Mapper Data Shard Mapper Data Shard … Merge Sort Reducer Data Shard Merge Sort Reducer Data Shard Merge Sort Reducer Data Shard … HDFS Replication HDFS Replication HDFS Replication All to All Traffic Pattern RDMA Significantly Improves Data Exchange Process
  26. 26. 26- Mellanox Confidential -© 2017 Mellanox Technologies 26 17% 23% 31% 28% 0 20 40 60 80 100 120 140 160 Customer App #1 Customer App #2 HiBench TeraSort GroupBy Runtime(inseconds) TCP RDMA Apache Spark with Mellanox RDMA Runtime samples Input Size Nodes Cores per node RAM per node Improvement Customer App #1 5GB 14 24 85GB 17% Customer App #2 540GB 14 24 85GB 23% HiBench TeraSort 600GB 30 28 256GB 31% GroupBy 48M Keys 15 28 256GB 28% Lower is better
  27. 27. 27- Mellanox Confidential -© 2017 Mellanox Technologies 27 Big Data and Deep Learning Solutions with HPE HPE Mellanox Adapter & Switch Options SN2100 HPE Reference Architecture: Cloudera Hortonworks SAP HANA Vora & Spark Elastic Platform for BDA
  28. 28. 28- Mellanox Confidential -© 2017 Mellanox Technologies 28 Proven Advantages  RDMA delivers 2X performance advantage over traditional TCP  Machine Learning and HPC platforms share the same interconnect needs  Scalable, flexible, high performance, high bandwidth, end-to-end connectivity  Standards-based and supported by the largest eco-system  Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.  Native Offloading architecture  RDMA, GPUDirect, SHARP and other core accelerations  Backward and future compatible Scalable Machine Learning Depends on Mellanox
  29. 29. Thank You
  30. 30. 30- Mellanox Confidential -© 2017 Mellanox Technologies 30 Purpose-built for Acceleration of Deep Learning PeerDirect™, GPUDirect® RDMA and ASYNC
  31. 31. 31- Mellanox Confidential -© 2017 Mellanox Technologies 31 What is GPUDirect™  Provides significant decrease in communication latency for acceleration devices  Natively supported by Mellanox OFED  Supports peer-to-peer communications between Mellanox adapters and third-party devices  No unnecessary system memory copies & CPU overhead  Enables GPUDirect™ RDMA, GPUDirect™ ASYNC, ROCm and others  InfiniBand and Ethernet CPU Chip set Chipset Vendor Device CPU Chip set Chipset Vendor Device 0101001011 Designed for Deep Learning Acceleration
  32. 32. 32- Mellanox Confidential -© 2017 Mellanox Technologies 32 GPUDirect™ RDMA and GPUDirect ASYNC™ Direct Connectivity GPU - Interconnect
  33. 33. 33- Mellanox Confidential -© 2017 Mellanox Technologies 33 GPU-GPU Internode Latency LowerisBetter GPUDirect™ RDMA Performance 9.3X Better Latency GPU-GPU Internode Bandwidth HigherisBetter 10X Better Throughput Source: Prof. DK Panda 9.3X 2.18 usec 10x
  34. 34. 34- Mellanox Confidential -© 2017 Mellanox Technologies 34 NVIDIA® NCCL 2.0 Near-Linear Scalability  Optimized collective communication library • Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather  Inter-node communication using InfiniBand verbs and GPUDirect™ RDMA  Multi-rail support, Topology detection  50% performance improvement with NVIDIA® DGX-1™ across 32 NVIDIA Tesla® V100 GPUs NVIDIA Accelerates Scalable Deep Learning with Mellanox
  35. 35. 35- Mellanox Confidential -© 2017 Mellanox Technologies 35 Performance and Scalability Examples
  36. 36. 36- Mellanox Confidential -© 2017 Mellanox Technologies 36 Accelerating TensorFlow™ with gRPC over RDMA  Open source Machine Learning from Google  Distributed training with gRPC framework • Google’s Optimized RPC for distributed network  RDMA Acceleration over UCX • Unified Communication X (UCX) • Integration with upstream TensorFlow  2X higher Performance with RDMA >2x Faster Lower is better ~2X Acceleration for TensorFlow with RDMA
  37. 37. 37- Mellanox Confidential -© 2017 Mellanox Technologies 37 TensorFlow™ over RDMA in Apache® Spark™ Environment  Yahoo enhanced the TensorFlow C++ layer to enable RDMA over InfiniBand  InfiniBand provides faster connectivity and supports accelerated offload capability Source: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep InfiniBand Provides Near Linear Scalability for Inception Model Training
  38. 38. 38- Mellanox Confidential -© 2017 Mellanox Technologies 38 2X Acceleration for Baidu  Machine Learning Software from Baidu • Usage: word prediction, translation, image processing  RDMA (GPUDirect) speeds training • Lowers latency, increases throughput • More cores for training • Even better results with optimized RDMA ~2X Acceleration for Paddle Training with RDMA
  39. 39. 39- Mellanox Confidential -© 2017 Mellanox Technologies 39 ChainerMN Depends on InfiniBand  ChainerMN depends on MPI for inter-node communication  NVIDIA® NCCL library is then used for intra-node communication between GPUs  Leveraging InfiniBand results in near linear performance  Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy. Source: http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
  40. 40. 40- Mellanox Confidential -© 2017 Mellanox Technologies 40 Machine Learning Performance Comparison InfiniBand Delivers 60% Better Performance with 2X Less Infrastructure DeepBench measures the performance of basic operations involved in training deep neural networks. 60.3% 8 Accelerators 16 Accelerators 32 Accelerators Lower is Better
  41. 41. 41- Mellanox Confidential -© 2017 Mellanox Technologies 41 Scalable Deep Learning Depends on Mellanox A Few Solution Examples
  42. 42. 42- Mellanox Confidential -© 2017 Mellanox Technologies 42 HPE ProLiant XL270d accelerator tray HPE Apollo 6500 | Purpose Built for AI Mellanox ConnectX-4 and ConnectX-5 Options Available HPE Mellanox Adapter Options − 8 GP GPU @ 350W − 4:1 or 8:1 GPU:CPU − Latest network fabrics − 2U HPE Apollo d6500 chassis Delivers efficiency and tailor-made configurability
  43. 43. 43- Mellanox Confidential -© 2017 Mellanox Technologies 43 NVIDIA® DGX-1™ Deep Learning Server Deep Learning Supercomputer in a Box 8 x NVIDIA® Tesla® P100 GPUs 5.3TFlops 16nm FinFET NVLINK 4 x ConnectX®-4 EDR 100G InfiniBand Adapters NVIDIA® “SaturnV” NVIDIA® Machine Learning Supercomputer #28 on the Top500 3.3Pf with 124 DGX-1 nodes #1 on the Green500
  44. 44. 44- Mellanox Confidential -© 2017 Mellanox Technologies 44 NVIDIA® DGX-1™  World’s first purpose-built system for deep learning • SaturnV is #28 on the Top500, 3.3Pf with 124 nodes • SaturnV is also #1 on the Green500  Fully integrated hardware • 8x Tesla™ P100 (Pascal) w/16GB per GPU • 28672 CUDA® Cores • 4x ConnectX-4 EDR 100Gb/s HCAs  Fully integrated software stack • Major deep learning frameworks • Drivers, NVIDIA CUDA, NVIDIA Deep Learning SDK • GPUDirect™ RDMA
  45. 45. 45- Mellanox Confidential -© 2017 Mellanox Technologies 45 Mellanox – Powering IBM Solutions IBM offers Mellanox InfiniBand and Ethernet Solutions Founding Members 32 Ports of 100GE 10G / 25G / 40G / 50G / 100G
  46. 46. 46- Mellanox Confidential -© 2017 Mellanox Technologies 46 Big Sur – An Open Artificial Intelligence Platform (Facebook)  An OCP based, GPU Artificial Intelligence Platform  Flexible Architecture supporting up to 8 GPUs • NVIDIA®, Intel®, AMD  Mellanox Ethernet adapters • RDMA supported (RoCE)  Accelerating Artificial Intelligence • Text Processing • Language Modeling • Computer Vision https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/ Mellanox Intelligent Network for Intelligent Platform
  47. 47. 47- Mellanox Confidential -© 2017 Mellanox Technologies 47 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms X86 Open POWER GPU ARM FPGA Smart Interconnect to Unleash The Power of All Compute Architectures

×