Lustre, RoCE and MAN
Łukasz Flis, Marek Magryś
Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
Academic Computer Centre Cyfronet AGH
● The biggest Polish Academic Computer Centre
○ Over 45 years of experience in IT provision
○ Centre of excellence in HPC and Grid Computing
○ Home for Prometheus and Zeus supercomputers
● Legal status: an autonomous within AGH University of Science and Technology
● Staff: > 160 , ca. 60 in R&D
● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science
● NGI Coordination in EGI e-Infrastructure
2
Network backbone
●4 main links to achieve maximum reliability
●Each link with 7x 10 Gbps capacity
●Additional 2x 100 Gbps dedicated links
●Direct connection with GEANT scientific network
●Over 40 switches
●Security
●Monitoring
3
Academic Computer Centre Cyfronet AGH
Prometheus
● 2.4 PFLOPS
● 53 604 cores
● 1st HPC system
in Poland (174st on Top 500, 38th in 2015)
4
Zeus
● 374 TFLOPS
● 25 468 cores
● 1st HPC system in Poland
(from 2009 to 2015, highest
rank on Top500 – 81st in 2011)
Computing portals and
frameworks
● OneData
● PLG-Data
● DataNet
● Rimrock
● InSilicoLab
Data Centres
● 3 independent data centres
● dedicated backbone links
Research & Development
● distributed computing environments
● computing acceleration
● machine learning
● software development & optimization
Storage
● 48 PB
● hierarchical data management
Computational Cloud
● based on OpenStack
HPC@Cyfronet 5
●Prometheus and Zeus clusters
○ 6475 active users (at the end of 2018)
○ 350+ computational grants
○ 8+ millions of jobs in 2018
○ 371+ millions of CPU hours spent in 2018
○ Biggest jobs in 2018
■ 27 648 cores
■ 261 152 CPU hours in one job
○ 900+ (Prometheus) and 600+ (Zeus) software modules
○ Custom users helper tools developed in-house
The fastest supercomputer in Poland:
Prometheus 6
● Installed in Q2 2015 (upgraded in Q4 2015)
● Centos 7 + SLURM
● HP Apollo 8000 - direct warm cooled system – PUE 1.06
○ 20 racks (4 CDU, 16 compute)
● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM
○ 2160 regular nodes (2 CPUs, 128 GB RAM)
○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL)
○ 4 islands
● Main storage based on Lustre
○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx
○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx
● 2.4 PFLOPS total performance (Rpeak)
● < 850 kW power (including cooling)
● TOP500: current 174th position, highest: 38th (XI 2015)
Project background 7
● Industrial partner
● Areas:
○ Data storage
■ POSIX
■ 10s of PBs
■ Incremental growth
○ HPC
○ Networking
○ Consulting
● PoC in 2017
● Infrastructure tests and design in 2018
● Production in Q1 2019
Photo: wikipedia.org
Challenges 8
● How to separate industrial and academic workloads?
○ Isolated storage platform
○ Dedicated network + dedicated IB partition
○ Custom compute OS image
○ Scheduler (SLURM) setup
○ Do not mix funding sources
● Which hardware platform to use?
○ ZFS JBOD vs RAID
○ Infiniband vs Ethernet
○ Capacity/performance ratio
○ Single vs partitioned namespace
Location 9
Storage to compute distance: 14 km over fibre (81 µs)
DC Nawojki
DC Pychowice
Map: openstreetmap.org
MAN backup link
Dark fibre
Infrastructure overview 10
Solution 11
● DDN SFA200NV for Lustre MDT
○ 10x 1.5 TB NVMe + 1 spare
● DDN ES7990 building block for OST
○ > 4 PiB usable space
○ ~ 20 GB/s performance
○ 450x 14 TB NL SAS
○ 4x 100 Gb/s Ethernet
○ Embedded Exascaler
● Juniper QFX10008
○ Deep buffers (100ms)
● Vertiv DCM racks
○ 48 U, custom depth: 130 cm
○ 1500 kg static load
Network: RDMA over Converged Ethernet
RoCE v1:
● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915)
● requires link level flow control for lossless Ethernet
(PAUSE frames or Priority Flow Control)
● not routable
RoCE v2:
● L3 - uses UDP/IP packets, port 4791
● link level flow control optional
● can use ECN (Explicit Congestion Notification) for
controlling flows on lossy networks
● routable
Mellanox ConnectX HCAs implement hardware offload for
RoCE protocols
12
LNET: TCP vs RoCE v2
LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216
(RoCE uses 4k max)
1310874.4
Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs
Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs
Theoretical Max: 11682 MiB/s (12250 MB/s)
LNET: TCP vs RoCE v2
Short summary TCP vs RoCE v2 p2p (no congestion)
Short range test:
● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP
● link saturation 93%
Long range test (14km):
● out-of-box LNET: RoCE v2 1.85x better than TCP
● link saturation: 58% (default settings)
● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64
gives 11332.66 MiB/s (97% saturation)
HW offloaded RoCE allows for full link utilization and low CPU usage.
Single LNET router is easily able to saturate 100 Gb/s link
14
Explicit Congestion Notification
● RoCEv2 can be used over lossy links
● Packet drops == retransmissions == bandwidth hiccups
● Enabling ECN effectively reduces packet drops on
congested ports
● ECN must be enabled on all devices over the path
● If HCA sees ECN mark on received packet:
○ 1. CNP packet is sent back to the sender
○ 2. Sender reduces transmission speed in reaction to CNP
15
ECN how to
1. Use ECN capable switches
2. Use RoCE capable host adapters (CX4 and CX5 were tested)
3. Use DSCP field in IP header to tag RDMA and CNP packets
on host (cma_roce_tos)
4. Enable ECN for RoCE traffic on switches
5. Prioritize CNP packets to assure proper congestion signaling
6. Enjoy stable transfers and significantly reduced frame drops
7. Optionally use L3 and OSPF or BGP to handle backup routes
16
LNET: congested long link
Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2
Congestion appears on the DC1 to DC2 link due to 4:2 link reduction
17
RoCEv2 no FC: 12818.9 MiB/s 54.86%
TCP no FC: 15368.3 MiB/s 65.78%
RoCEv2 ECN: 19426.8 MiB/s 83.14%
RoCEv2: ECN vs no ECN
Effects of disabling ECN
18
Real life test
2x DDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km
Bandwidth: IOR 112 tasks @ 28 client nodes
Max Write: 29872.21 MiB/sec (31323.28 MB/sec)
Max Read: 34368.27 MiB/sec (36037.74 MB/sec)
19
Conclusions 20
● For bandwidth workloads latency on MAN distances is
not an issue
● ECN mechanisms for RoCE needs to be enabled to
significantly reduce packet drops during congestion
● Aggregation of links (LACP+Adaptive Load Balancing or
ECMP for L3) allows to scale bandwidth linearly by
evenly utilizing available links
● RoCE allows more flexibility in terms of transport links
compared to IB - ie. backup routing, cheaper and more
scalable infrastructure
Acknowledgements 21
Thanks for the test infrastructure and support
22
Visit us at booth H-710!
(and taste some krówka)
Thank you!

Lustre, RoCE, and MAN

  • 1.
    Lustre, RoCE andMAN Łukasz Flis, Marek Magryś Dominika Kałafut, Patryk Lasoń, Adrian Marszalik, Maciej Pawlik
  • 2.
    Academic Computer CentreCyfronet AGH ● The biggest Polish Academic Computer Centre ○ Over 45 years of experience in IT provision ○ Centre of excellence in HPC and Grid Computing ○ Home for Prometheus and Zeus supercomputers ● Legal status: an autonomous within AGH University of Science and Technology ● Staff: > 160 , ca. 60 in R&D ● Leader of PLGrid: Polish Grid and Cloud Infrastructure for Science ● NGI Coordination in EGI e-Infrastructure 2
  • 3.
    Network backbone ●4 mainlinks to achieve maximum reliability ●Each link with 7x 10 Gbps capacity ●Additional 2x 100 Gbps dedicated links ●Direct connection with GEANT scientific network ●Over 40 switches ●Security ●Monitoring 3
  • 4.
    Academic Computer CentreCyfronet AGH Prometheus ● 2.4 PFLOPS ● 53 604 cores ● 1st HPC system in Poland (174st on Top 500, 38th in 2015) 4 Zeus ● 374 TFLOPS ● 25 468 cores ● 1st HPC system in Poland (from 2009 to 2015, highest rank on Top500 – 81st in 2011) Computing portals and frameworks ● OneData ● PLG-Data ● DataNet ● Rimrock ● InSilicoLab Data Centres ● 3 independent data centres ● dedicated backbone links Research & Development ● distributed computing environments ● computing acceleration ● machine learning ● software development & optimization Storage ● 48 PB ● hierarchical data management Computational Cloud ● based on OpenStack
  • 5.
    HPC@Cyfronet 5 ●Prometheus andZeus clusters ○ 6475 active users (at the end of 2018) ○ 350+ computational grants ○ 8+ millions of jobs in 2018 ○ 371+ millions of CPU hours spent in 2018 ○ Biggest jobs in 2018 ■ 27 648 cores ■ 261 152 CPU hours in one job ○ 900+ (Prometheus) and 600+ (Zeus) software modules ○ Custom users helper tools developed in-house
  • 6.
    The fastest supercomputerin Poland: Prometheus 6 ● Installed in Q2 2015 (upgraded in Q4 2015) ● Centos 7 + SLURM ● HP Apollo 8000 - direct warm cooled system – PUE 1.06 ○ 20 racks (4 CDU, 16 compute) ● 2235 nodes, 53 604 CPU cores (Haswell, Xeon E5-2680v3 12C 2.5GHz), 282 TB RAM ○ 2160 regular nodes (2 CPUs, 128 GB RAM) ○ 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) ○ 4 islands ● Main storage based on Lustre ○ Scratch: 5 PB, 120 GB/s, 4x DDN SFA12kx ○ Archive: 5 PB, 60 GB/s, 2x DDN SFA12kx ● 2.4 PFLOPS total performance (Rpeak) ● < 850 kW power (including cooling) ● TOP500: current 174th position, highest: 38th (XI 2015)
  • 7.
    Project background 7 ●Industrial partner ● Areas: ○ Data storage ■ POSIX ■ 10s of PBs ■ Incremental growth ○ HPC ○ Networking ○ Consulting ● PoC in 2017 ● Infrastructure tests and design in 2018 ● Production in Q1 2019 Photo: wikipedia.org
  • 8.
    Challenges 8 ● Howto separate industrial and academic workloads? ○ Isolated storage platform ○ Dedicated network + dedicated IB partition ○ Custom compute OS image ○ Scheduler (SLURM) setup ○ Do not mix funding sources ● Which hardware platform to use? ○ ZFS JBOD vs RAID ○ Infiniband vs Ethernet ○ Capacity/performance ratio ○ Single vs partitioned namespace
  • 9.
    Location 9 Storage tocompute distance: 14 km over fibre (81 µs) DC Nawojki DC Pychowice Map: openstreetmap.org MAN backup link Dark fibre
  • 10.
  • 11.
    Solution 11 ● DDNSFA200NV for Lustre MDT ○ 10x 1.5 TB NVMe + 1 spare ● DDN ES7990 building block for OST ○ > 4 PiB usable space ○ ~ 20 GB/s performance ○ 450x 14 TB NL SAS ○ 4x 100 Gb/s Ethernet ○ Embedded Exascaler ● Juniper QFX10008 ○ Deep buffers (100ms) ● Vertiv DCM racks ○ 48 U, custom depth: 130 cm ○ 1500 kg static load
  • 12.
    Network: RDMA overConverged Ethernet RoCE v1: ● L2 - Ethernet Link Layer Protocol (Ethertype 0x8915) ● requires link level flow control for lossless Ethernet (PAUSE frames or Priority Flow Control) ● not routable RoCE v2: ● L3 - uses UDP/IP packets, port 4791 ● link level flow control optional ● can use ECN (Explicit Congestion Notification) for controlling flows on lossy networks ● routable Mellanox ConnectX HCAs implement hardware offload for RoCE protocols 12
  • 13.
    LNET: TCP vsRoCE v2 LNET selftest, default tuning for ksocknald and ko2iblnd, Lustre: 2.10.5, ConnectX-4 Adapters, 100 GbE, congestion free env., MTU 9216 (RoCE uses 4k max) 1310874.4 Local: MAX TCP: 4114.7 MiB/s @ 4 RPCs vs MAX RoCE v2: 10874.4 MiB/s @ 16 RPCs Remote: MAX TCP: 3662.2 MiB/s @ 4 RPCs vs MAX RoCE v2: 6805.7 MiB/s @ 32 RPCs Theoretical Max: 11682 MiB/s (12250 MB/s)
  • 14.
    LNET: TCP vsRoCE v2 Short summary TCP vs RoCE v2 p2p (no congestion) Short range test: ● RoCE v2 out-of-box LNET bandwidth 2.6x better than TCP ● link saturation 93% Long range test (14km): ● out-of-box LNET: RoCE v2 1.85x better than TCP ● link saturation: 58% (default settings) ● tuning required - ko2iblnd concurrent_sends=4, peer_credits=64 gives 11332.66 MiB/s (97% saturation) HW offloaded RoCE allows for full link utilization and low CPU usage. Single LNET router is easily able to saturate 100 Gb/s link 14
  • 15.
    Explicit Congestion Notification ●RoCEv2 can be used over lossy links ● Packet drops == retransmissions == bandwidth hiccups ● Enabling ECN effectively reduces packet drops on congested ports ● ECN must be enabled on all devices over the path ● If HCA sees ECN mark on received packet: ○ 1. CNP packet is sent back to the sender ○ 2. Sender reduces transmission speed in reaction to CNP 15
  • 16.
    ECN how to 1.Use ECN capable switches 2. Use RoCE capable host adapters (CX4 and CX5 were tested) 3. Use DSCP field in IP header to tag RDMA and CNP packets on host (cma_roce_tos) 4. Enable ECN for RoCE traffic on switches 5. Prioritize CNP packets to assure proper congestion signaling 6. Enjoy stable transfers and significantly reduced frame drops 7. Optionally use L3 and OSPF or BGP to handle backup routes 16
  • 17.
    LNET: congested longlink Lustre 2.10.5, DC1 to DC2 2x100 GbE, test: write 4:2 Congestion appears on the DC1 to DC2 link due to 4:2 link reduction 17 RoCEv2 no FC: 12818.9 MiB/s 54.86% TCP no FC: 15368.3 MiB/s 65.78% RoCEv2 ECN: 19426.8 MiB/s 83.14%
  • 18.
    RoCEv2: ECN vsno ECN Effects of disabling ECN 18
  • 19.
    Real life test 2xDDN ES7990 (4 OSS), 4 LNET routers (RoCE <-> IB FDR), 14 km Bandwidth: IOR 112 tasks @ 28 client nodes Max Write: 29872.21 MiB/sec (31323.28 MB/sec) Max Read: 34368.27 MiB/sec (36037.74 MB/sec) 19
  • 20.
    Conclusions 20 ● Forbandwidth workloads latency on MAN distances is not an issue ● ECN mechanisms for RoCE needs to be enabled to significantly reduce packet drops during congestion ● Aggregation of links (LACP+Adaptive Load Balancing or ECMP for L3) allows to scale bandwidth linearly by evenly utilizing available links ● RoCE allows more flexibility in terms of transport links compared to IB - ie. backup routing, cheaper and more scalable infrastructure
  • 21.
    Acknowledgements 21 Thanks forthe test infrastructure and support
  • 22.
    22 Visit us atbooth H-710! (and taste some krówka) Thank you!