Yuval Degani, LinkedIn
Dr. Jithin Jose, Microsoft Azure
Tackling Network
Bottlenecks with
Hardware Accelerations:
Cloud vs. On-Premise
#UnifiedAnalytics #SparkAISummit
Intro
• Infinite loop of removing performance road blocks
• With faster storage devices (DRAM, NVMe, SSD) and
stronger than ever processing power (CPU, GPU, ASIC),
a traditional network just can’t keep up with I/O flow
• Upgrading to higher wire speeds will rarely do the trick
• This is where co-designed hardware acceleration can be
used to truly utilize the power of a compute cluster
2#UnifiedAnalytics #SparkAISummit
Previous talks
3#UnifiedAnalytics #SparkAISummit
Spark Summit Europe 2017
First open-source stand-alone RDMA accelerated
shuffle plugin for Spark (SparkRDMA)
Spark+AI Summit North America 2018
First preview of SparkRDMA on Azure HPC
nodes, demonstrating x2.6 job speed-up on cloud
VMs
Network Bottlenecks in the Wild
4#UnifiedAnalytics #SparkAISummit
Network Bottlenecks in the Wild
• Not always caused by lack of bandwidth
• Network I/O imposes overhead in many system components:
– Memory management
– Memory copy
– Garbage Collection
– Serialization/Compression/Encryption
• Overhead=CPU cycles, cycles that are not available for the
actual job at hand
• Hardware acceleration can reduce overhead and allow better
utilization of compute and network resources
5#UnifiedAnalytics #SparkAISummit
Network Bottlenecks: Shuffle
• Most expensive non-storage
network I/O in compute clusters
• Blocking, massive movement of
transient data
• Acceleration opportunities:
– Efficient serving with reduced server-
side logic
– Serialization/Compression/Encryption
– Reduce I/O overhead and latency by
employing modern transport protocols
6#UnifiedAnalytics #SparkAISummit
Partitioning
4%
Input
11%
Shuffle
Read
57%
Output
28%
HiBench TeraSort on Spark
Network Bottlenecks: Distributed
Training
• Model updates create massive
network traffic
• Model update frequency rises
as GPUs get faster
• Acceleration opportunities:
– Inter-GPU RDMA communication
– Lower latency network transport
– Collectives offloads
7#UnifiedAnalytics #SparkAISummit
K80
M60
V100
ResNet 269*
Total Time GPU Active Time
* “Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training” by Luo et al.
Network Bottlenecks: Storage
• Massive data movement
• Premium devices (DRAM, Flash) provide storage
access speeds that were never seen before
• Acceleration opportunities:
– Higher bandwidth
– Reduced transport overhead
– OS/CPU bypass – direct storage access from network
devices
8#UnifiedAnalytics #SparkAISummit
Major Hardware Acceleration
Technologies
9#UnifiedAnalytics #SparkAISummit
Speeds
• 1, 10, 25, 40, 100, 200Gbps
• Faster network doesn’t
necessarily mean a faster
runtime
• Many workloads consist of
relatively short bursts rather
than sustainable throughput:
higher bandwidth may not have
any effect
10#UnifiedAnalytics #SparkAISummit
0
100
200
300
400
500
600
700
800
Flink
TeraSort
Flink
PageRank
PowerGraph
PageRank
Timely
PageRank
Effect of network speed
on workload runtime*
1GbE 10GbE 40GbE
* “On The [Ir]relevance of Network Performance for Data Processing” by Trivedi et al.
InfiniBand
• De-facto standard in the HPC world
• FDR: 56Gbps, EDR: 100Gbps, HDR:
200Gbps
• Sub-microsecond latency
• Native support for RDMA
• HW accelerated transport layer
• True SDN: standard fabric components are
developed as open-source and are cross-
platform
• Native support for Switch collectives offload
11#UnifiedAnalytics #SparkAISummit
Ethernet
23%
InfiniBand
38%
Custom
28%
Omnipath
10%
Proprietary
1%
TOP500 Supercomputers
Interconnect Performance
Share*
* www.top500.org
RDMA
• Remote Direct Memory Access
– Read/write from/to remote memory locations
• Zero-copy
• Direct hardware interface – bypasses the
kernel and TCP/IP in IO path
• Flow control and reliability is offloaded in
hardware
• Supported on almost all mid-range/high-
end network adapters: both InfiniBand
and Ethernet
12
Java app
buffer
OS
Sockets
TCP/IP
Driver
Network Adapter
RDMA
Socket
Context switch
#UnifiedAnalytics #SparkAISummit
NVIDIA GPUDirect
• Direct DMA over PCIe
• RDMA devices can write/read
directly to/from GPU memory
over the network
• No CPU overhead
• Zero-copy
13#UnifiedAnalytics #SparkAISummit
GPUDirect
Non-GPUDirect
NIC GPU
CPU
“Smart NIC” – FPGA/ASIC Offloads
• FPGA – tailor-made accelerations
• ASIC – less flexibility, better performance
• Common use cases:
– I/O: Serialization, compression, encryption offloads
– Data: Aggregation, sorting, group-by, reduce
• Deployment options:
– Pipeline
– Look-aside
– Bump-on-the-wire
14#UnifiedAnalytics #SparkAISummit
“Smart Switch”
• In-network processing
– Data reduction during movement
– Wire-speed
• Generic: MPI Switch Collectives Offloads (e.g.
Mellanox SHArP)
• Per-workload: Programmable switches (e.g.
Barefoot Tofino)
– Example: Network-Accelerated Query Processing
15#UnifiedAnalytics #SparkAISummit
NVMeOF
• Network protocol for NVM
express disks (PCIe)
• Uses RDMA to provide direct
NIC<->Disk access
• Completely bypasses the host
• Minimal latency differences
between local and remote access
16#UnifiedAnalytics #SparkAISummit
NVMeOF
Traditional
NIC
CPU
Azure Network Acceleration
Offering
17#UnifiedAnalytics #SparkAISummit
Offer ‘Bare Metal’ Experience
– Azure HPC Solution
#UnifiedAnalytics #SparkAISummit 18
Eliminate Jitter
Host holdback is a start, but must
completely isolate guest from host
Minroot & CPU Groups; separated
host and guest VM sandboxes
Full Network Experience
Enable customers to use Mellanox or
OFED drivers
Supports all MPI types and versions
Leverage hardware offload to
Mellanox InfiniBand ASIC
Transparent Exposure of
Hardware
Core N in guest VM should =
Core N in silicon
1:1 between physical pNUMA
topology and vNUMA topology
Latest Azure HPC Offerings – HB/HC
HB Series (AMD EPYC) HC Series (Intel Xeon Platinum)
Workloads Targets Bandwidth Intensive Compute Intensive
Core Count 60 44
System Memory 240 GB 352 GB
Network 100 Gbps EDR InfiniBand, 40 Gbps Ethernet
Storage Support Standard / Premium Azure Storage, and 700GB Local SSD
OS Support for RDMA CentOS/RHEL, Ubuntu, SLES 12, Windows
MPI Support
OpenMPI, HPC-X, MVAPICH2, MPICH,
Intel MPI, PlatformMPI, Microsoft MPI
Hardware Collectives Enabled
Access Model
Azure CLI, ARM template, Azure CycleCloud,
Azure Batch, Partner Platform
19#UnifiedAnalytics #SparkAISummit
Other Azure HPC Highlights
• SR-IOV going broad
– All HPC SKUs will support SR-IOV
– Driver/SKU Performance Optimizations
• GPUs
– Latest NDv2 Series
• 8 Nvidia Tesla v100 NVLINK interconnected GPUs
• Intel Skylake, 672 GB Memory
• Excellent platform for HPC and AI workloads
• Azure FPGA
– Based on Project Brainwave
– Deploy model to Azure FPGA, Reconfigure for different models
– Supports ResNet 50, ResNet 152, DenseNet-121, and VGG-16
20#UnifiedAnalytics #SparkAISummit
Accelerate Your Framework
21#UnifiedAnalytics #SparkAISummit
MPI Microbenchmarks
22#UnifiedAnalytics #SparkAISummit
• Experiments on HC cluster
• OSU Benchmarks 5.6.1
• OpenMPI (4.0.0) + UCX (1.5.0)
• MPI ranks pinned nearer to HCA
1.77 us
12 GB/s
• MPI Latency (4 B) – 1.77us
• Getting even better later this year
• MPI Bandwidth (4 MB) – 12.06 GB/s
0
2000
4000
6000
8000
10000
12000
14000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Bandwidth(MB/s)
Message Size (bytes)
MPI Bandwidth
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)
0
10
20
30
40
50
60
70
80
90
0 1 2 4 8 16 32 64 128 256 512 1K 2K
Time(us)
Message Size (bytes)
MPI Latency
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)
SparkRDMA
• RDMA-powered ShuffleManager
plugin for Apache Spark
• Similarly spec 8 node cluster:
– On-prem: 100GbE RoCE
– Cloud: Azure ”h16mr” instances with
56Gbps InfiniBand
• https://github.com/Mellanox/SparkRDMA
23#UnifiedAnalytics #SparkAISummit
0 1000 2000
TeraSort 320GB
PageRank 19GB
On-prem non-RDMA 100GbE
On-prem RDMA 100GbE
Azure IPoIB 56Gbps
Azure RDMA 56Gbps
SparkRDMA on Azure
• Azure HC cluster:
– 100 Gbps InfiniBand
– 16 Spark Workers/HDFS DataNodes
– Separate NameNode
– Data folder hosted on SSD
– HiBench Benchmarks (gigantic)
• Spark 2.4.0, Hadoop 2.7.7, SparkRDMA 3.1
24#UnifiedAnalytics #SparkAISummit
0 100 200 300 400 500 600
TeraSort - 320 GB
PageRank - 19GB
Execution Time (s)
RDMA (100 Gbps)
IPoIB (100 Gbps)
HDFS-RDMA on Azure
25#UnifiedAnalytics #SparkAISummit
• OSU HDFS RDMA 0.9.1
• Based on Hadoop 3.0.0
• http://hibd.cse.ohio-state.edu/#hadoop3
• HDFS on HC cluster
• 1 NameNode
• 16 DataNodes
• Data folder hosted on SSD
• Packet Size: 128KB
• Containers per Node: 32 0
50
100
150
200
250
300
350
400
512GB 640GB 768GB 896GB 1TB
Time(sec)
Size (bytes)
TestDFSIO (Write) Execution Time
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)
Memcached-RDMA on Azure
26#UnifiedAnalytics #SparkAISummit
• OSU Memcached RDMA 0.9.6
• Based on Memcached 1.5.3 and
libmemcached 1.0.18
• http://hibd.cse.ohio-state.edu/#memcached
• Experiment run on HC Nodes
• Memcached GET (8 B) Latency – 5.5us
• Memcached SET (8 B) Latency – 6.45us
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
Message Size (bytes)
Memcached GET
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
Message Size (bytes)
Memcached SET
Ethernet (40 Gbps) IPoIB (100 Gbps)
RDMA (100 Gbps)
Kafka-RDMA on Azure
27#UnifiedAnalytics #SparkAISummit
• OSU Kafka RDMA 0.9.1
• Based on Apache Kafka 1.0.0
• http://hibd.cse.ohio-state.edu/#kafka
• HC cluster
• Broker with 100 GB Ramdisk
• Record Size – 100 bytes
• Number of Records – 500000
0
50
100
150
200
250
300
350
400
Producer
Time(s)
Kafka Producer Latency
IPoIB (100 Gbps) RDMA (100 Gbps)
0
10
20
30
40
50
60
70
Producer
Bandwidth(MB/s)
Kafka Producer Bandwidth
IPoIB (100 Gbps) RDMA (100 Gbps)
Horovod on Azure
28#UnifiedAnalytics #SparkAISummit
• Tensorflow 1.13
– ResNet-50 Training
– Partial ImageNet Data
– Batch Size = 64 per worker
– 2 workers per node
– Total batches 100
– CPU only version
• HC Cluster
– OpenMPI 4.0 + UCX 1.5
– Singularity container
• ~97% Scaling efficiency
100.00
96.78
95.58 94.93
100.00
98.86 98.37
96.94
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
100.00
0
200
400
600
800
1000
1200
1400
1600
2 4 8 16
%Efficiency
Images/second
# nodes
IPoIB (100 Gbps)
RDMA (100 Gbps)
IPoIB Efficiency
RDMA Efficiency
Wrapping up
29#UnifiedAnalytics #SparkAISummit
What’s available on major clouds?
Technology Azure AWS GCP
Network speeds 100Gbps 100Gbps 20Gbps?
InfiniBand ✔ ! !
RDMA ✔ (limited) !
GPUDirect ! (single host) !
Smart NIC ! ! !
Smart Switch ! ! !
NVMeOF ! ! !
30#UnifiedAnalytics #SparkAISummit
Take-aways
• Accelerated Frameworks:
– SparkRDMA on GitHub
– High Performance Big Data (From OSU)
– Horovod
• Azure instances
– Azure HPC HB/HC
– Azure NDv2 GPUs
– Azure FPGA
31#UnifiedAnalytics #SparkAISummit
Questions?
32#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise

  • 1.
    Yuval Degani, LinkedIn Dr.Jithin Jose, Microsoft Azure Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise #UnifiedAnalytics #SparkAISummit
  • 2.
    Intro • Infinite loopof removing performance road blocks • With faster storage devices (DRAM, NVMe, SSD) and stronger than ever processing power (CPU, GPU, ASIC), a traditional network just can’t keep up with I/O flow • Upgrading to higher wire speeds will rarely do the trick • This is where co-designed hardware acceleration can be used to truly utilize the power of a compute cluster 2#UnifiedAnalytics #SparkAISummit
  • 3.
    Previous talks 3#UnifiedAnalytics #SparkAISummit SparkSummit Europe 2017 First open-source stand-alone RDMA accelerated shuffle plugin for Spark (SparkRDMA) Spark+AI Summit North America 2018 First preview of SparkRDMA on Azure HPC nodes, demonstrating x2.6 job speed-up on cloud VMs
  • 4.
    Network Bottlenecks inthe Wild 4#UnifiedAnalytics #SparkAISummit
  • 5.
    Network Bottlenecks inthe Wild • Not always caused by lack of bandwidth • Network I/O imposes overhead in many system components: – Memory management – Memory copy – Garbage Collection – Serialization/Compression/Encryption • Overhead=CPU cycles, cycles that are not available for the actual job at hand • Hardware acceleration can reduce overhead and allow better utilization of compute and network resources 5#UnifiedAnalytics #SparkAISummit
  • 6.
    Network Bottlenecks: Shuffle •Most expensive non-storage network I/O in compute clusters • Blocking, massive movement of transient data • Acceleration opportunities: – Efficient serving with reduced server- side logic – Serialization/Compression/Encryption – Reduce I/O overhead and latency by employing modern transport protocols 6#UnifiedAnalytics #SparkAISummit Partitioning 4% Input 11% Shuffle Read 57% Output 28% HiBench TeraSort on Spark
  • 7.
    Network Bottlenecks: Distributed Training •Model updates create massive network traffic • Model update frequency rises as GPUs get faster • Acceleration opportunities: – Inter-GPU RDMA communication – Lower latency network transport – Collectives offloads 7#UnifiedAnalytics #SparkAISummit K80 M60 V100 ResNet 269* Total Time GPU Active Time * “Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training” by Luo et al.
  • 8.
    Network Bottlenecks: Storage •Massive data movement • Premium devices (DRAM, Flash) provide storage access speeds that were never seen before • Acceleration opportunities: – Higher bandwidth – Reduced transport overhead – OS/CPU bypass – direct storage access from network devices 8#UnifiedAnalytics #SparkAISummit
  • 9.
  • 10.
    Speeds • 1, 10,25, 40, 100, 200Gbps • Faster network doesn’t necessarily mean a faster runtime • Many workloads consist of relatively short bursts rather than sustainable throughput: higher bandwidth may not have any effect 10#UnifiedAnalytics #SparkAISummit 0 100 200 300 400 500 600 700 800 Flink TeraSort Flink PageRank PowerGraph PageRank Timely PageRank Effect of network speed on workload runtime* 1GbE 10GbE 40GbE * “On The [Ir]relevance of Network Performance for Data Processing” by Trivedi et al.
  • 11.
    InfiniBand • De-facto standardin the HPC world • FDR: 56Gbps, EDR: 100Gbps, HDR: 200Gbps • Sub-microsecond latency • Native support for RDMA • HW accelerated transport layer • True SDN: standard fabric components are developed as open-source and are cross- platform • Native support for Switch collectives offload 11#UnifiedAnalytics #SparkAISummit Ethernet 23% InfiniBand 38% Custom 28% Omnipath 10% Proprietary 1% TOP500 Supercomputers Interconnect Performance Share* * www.top500.org
  • 12.
    RDMA • Remote DirectMemory Access – Read/write from/to remote memory locations • Zero-copy • Direct hardware interface – bypasses the kernel and TCP/IP in IO path • Flow control and reliability is offloaded in hardware • Supported on almost all mid-range/high- end network adapters: both InfiniBand and Ethernet 12 Java app buffer OS Sockets TCP/IP Driver Network Adapter RDMA Socket Context switch #UnifiedAnalytics #SparkAISummit
  • 13.
    NVIDIA GPUDirect • DirectDMA over PCIe • RDMA devices can write/read directly to/from GPU memory over the network • No CPU overhead • Zero-copy 13#UnifiedAnalytics #SparkAISummit GPUDirect Non-GPUDirect NIC GPU CPU
  • 14.
    “Smart NIC” –FPGA/ASIC Offloads • FPGA – tailor-made accelerations • ASIC – less flexibility, better performance • Common use cases: – I/O: Serialization, compression, encryption offloads – Data: Aggregation, sorting, group-by, reduce • Deployment options: – Pipeline – Look-aside – Bump-on-the-wire 14#UnifiedAnalytics #SparkAISummit
  • 15.
    “Smart Switch” • In-networkprocessing – Data reduction during movement – Wire-speed • Generic: MPI Switch Collectives Offloads (e.g. Mellanox SHArP) • Per-workload: Programmable switches (e.g. Barefoot Tofino) – Example: Network-Accelerated Query Processing 15#UnifiedAnalytics #SparkAISummit
  • 16.
    NVMeOF • Network protocolfor NVM express disks (PCIe) • Uses RDMA to provide direct NIC<->Disk access • Completely bypasses the host • Minimal latency differences between local and remote access 16#UnifiedAnalytics #SparkAISummit NVMeOF Traditional NIC CPU
  • 17.
  • 18.
    Offer ‘Bare Metal’Experience – Azure HPC Solution #UnifiedAnalytics #SparkAISummit 18 Eliminate Jitter Host holdback is a start, but must completely isolate guest from host Minroot & CPU Groups; separated host and guest VM sandboxes Full Network Experience Enable customers to use Mellanox or OFED drivers Supports all MPI types and versions Leverage hardware offload to Mellanox InfiniBand ASIC Transparent Exposure of Hardware Core N in guest VM should = Core N in silicon 1:1 between physical pNUMA topology and vNUMA topology
  • 19.
    Latest Azure HPCOfferings – HB/HC HB Series (AMD EPYC) HC Series (Intel Xeon Platinum) Workloads Targets Bandwidth Intensive Compute Intensive Core Count 60 44 System Memory 240 GB 352 GB Network 100 Gbps EDR InfiniBand, 40 Gbps Ethernet Storage Support Standard / Premium Azure Storage, and 700GB Local SSD OS Support for RDMA CentOS/RHEL, Ubuntu, SLES 12, Windows MPI Support OpenMPI, HPC-X, MVAPICH2, MPICH, Intel MPI, PlatformMPI, Microsoft MPI Hardware Collectives Enabled Access Model Azure CLI, ARM template, Azure CycleCloud, Azure Batch, Partner Platform 19#UnifiedAnalytics #SparkAISummit
  • 20.
    Other Azure HPCHighlights • SR-IOV going broad – All HPC SKUs will support SR-IOV – Driver/SKU Performance Optimizations • GPUs – Latest NDv2 Series • 8 Nvidia Tesla v100 NVLINK interconnected GPUs • Intel Skylake, 672 GB Memory • Excellent platform for HPC and AI workloads • Azure FPGA – Based on Project Brainwave – Deploy model to Azure FPGA, Reconfigure for different models – Supports ResNet 50, ResNet 152, DenseNet-121, and VGG-16 20#UnifiedAnalytics #SparkAISummit
  • 21.
  • 22.
    MPI Microbenchmarks 22#UnifiedAnalytics #SparkAISummit •Experiments on HC cluster • OSU Benchmarks 5.6.1 • OpenMPI (4.0.0) + UCX (1.5.0) • MPI ranks pinned nearer to HCA 1.77 us 12 GB/s • MPI Latency (4 B) – 1.77us • Getting even better later this year • MPI Bandwidth (4 MB) – 12.06 GB/s 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Bandwidth(MB/s) Message Size (bytes) MPI Bandwidth Ethernet (40 Gbps) IPoIB (100 Gbps) RDMA (100 Gbps) 0 10 20 30 40 50 60 70 80 90 0 1 2 4 8 16 32 64 128 256 512 1K 2K Time(us) Message Size (bytes) MPI Latency Ethernet (40 Gbps) IPoIB (100 Gbps) RDMA (100 Gbps)
  • 23.
    SparkRDMA • RDMA-powered ShuffleManager pluginfor Apache Spark • Similarly spec 8 node cluster: – On-prem: 100GbE RoCE – Cloud: Azure ”h16mr” instances with 56Gbps InfiniBand • https://github.com/Mellanox/SparkRDMA 23#UnifiedAnalytics #SparkAISummit 0 1000 2000 TeraSort 320GB PageRank 19GB On-prem non-RDMA 100GbE On-prem RDMA 100GbE Azure IPoIB 56Gbps Azure RDMA 56Gbps
  • 24.
    SparkRDMA on Azure •Azure HC cluster: – 100 Gbps InfiniBand – 16 Spark Workers/HDFS DataNodes – Separate NameNode – Data folder hosted on SSD – HiBench Benchmarks (gigantic) • Spark 2.4.0, Hadoop 2.7.7, SparkRDMA 3.1 24#UnifiedAnalytics #SparkAISummit 0 100 200 300 400 500 600 TeraSort - 320 GB PageRank - 19GB Execution Time (s) RDMA (100 Gbps) IPoIB (100 Gbps)
  • 25.
    HDFS-RDMA on Azure 25#UnifiedAnalytics#SparkAISummit • OSU HDFS RDMA 0.9.1 • Based on Hadoop 3.0.0 • http://hibd.cse.ohio-state.edu/#hadoop3 • HDFS on HC cluster • 1 NameNode • 16 DataNodes • Data folder hosted on SSD • Packet Size: 128KB • Containers per Node: 32 0 50 100 150 200 250 300 350 400 512GB 640GB 768GB 896GB 1TB Time(sec) Size (bytes) TestDFSIO (Write) Execution Time Ethernet (40 Gbps) IPoIB (100 Gbps) RDMA (100 Gbps)
  • 26.
    Memcached-RDMA on Azure 26#UnifiedAnalytics#SparkAISummit • OSU Memcached RDMA 0.9.6 • Based on Memcached 1.5.3 and libmemcached 1.0.18 • http://hibd.cse.ohio-state.edu/#memcached • Experiment run on HC Nodes • Memcached GET (8 B) Latency – 5.5us • Memcached SET (8 B) Latency – 6.45us 0 20 40 60 80 100 120 140 160 180 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Latency(us) Message Size (bytes) Memcached GET 0 20 40 60 80 100 120 140 160 180 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Latency(us) Message Size (bytes) Memcached SET Ethernet (40 Gbps) IPoIB (100 Gbps) RDMA (100 Gbps)
  • 27.
    Kafka-RDMA on Azure 27#UnifiedAnalytics#SparkAISummit • OSU Kafka RDMA 0.9.1 • Based on Apache Kafka 1.0.0 • http://hibd.cse.ohio-state.edu/#kafka • HC cluster • Broker with 100 GB Ramdisk • Record Size – 100 bytes • Number of Records – 500000 0 50 100 150 200 250 300 350 400 Producer Time(s) Kafka Producer Latency IPoIB (100 Gbps) RDMA (100 Gbps) 0 10 20 30 40 50 60 70 Producer Bandwidth(MB/s) Kafka Producer Bandwidth IPoIB (100 Gbps) RDMA (100 Gbps)
  • 28.
    Horovod on Azure 28#UnifiedAnalytics#SparkAISummit • Tensorflow 1.13 – ResNet-50 Training – Partial ImageNet Data – Batch Size = 64 per worker – 2 workers per node – Total batches 100 – CPU only version • HC Cluster – OpenMPI 4.0 + UCX 1.5 – Singularity container • ~97% Scaling efficiency 100.00 96.78 95.58 94.93 100.00 98.86 98.37 96.94 50.00 55.00 60.00 65.00 70.00 75.00 80.00 85.00 90.00 95.00 100.00 0 200 400 600 800 1000 1200 1400 1600 2 4 8 16 %Efficiency Images/second # nodes IPoIB (100 Gbps) RDMA (100 Gbps) IPoIB Efficiency RDMA Efficiency
  • 29.
  • 30.
    What’s available onmajor clouds? Technology Azure AWS GCP Network speeds 100Gbps 100Gbps 20Gbps? InfiniBand ✔ ! ! RDMA ✔ (limited) ! GPUDirect ! (single host) ! Smart NIC ! ! ! Smart Switch ! ! ! NVMeOF ! ! ! 30#UnifiedAnalytics #SparkAISummit
  • 31.
    Take-aways • Accelerated Frameworks: –SparkRDMA on GitHub – High Performance Big Data (From OSU) – Horovod • Azure instances – Azure HPC HB/HC – Azure NDv2 GPUs – Azure FPGA 31#UnifiedAnalytics #SparkAISummit
  • 32.
  • 33.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT