SlideShare a Scribd company logo
1 of 15
Download to read offline
15th ANNUAL WORKSHOP 2019
HPC NETWORKING IN THE REAL WORLD
Jesse Martinez
Los Alamos National Laboratory
March 19th, 2019
[ LOGO HERE ]
LA-UR-19-22146
ABSTRACT
▪Introduction to LANL High Speed Networking
▪Production Needs/Challenges
▪Administration and Monitoring
▪Future
2 OpenFabrics Alliance Workshop 2019
INTRODUCTION INTO
LANL HPC HIGH SPEED NETWORKING
3 OpenFabrics Alliance Workshop 2019
LANL HPC SUPERCOMPUTERS
▪ LANL HPC has very diverse networks and technologies for its
users
• Ranges from old legacy hardware, new commodity systems, to new advanced
technology systems (ATS-1) such as Trinity
▪ Workloads ranging from MPI/Parallel based codes to GPU/on-
node computations
▪ User datasets into the TB storage range (and growing quickly to
PB!)
• Need to be able to move data quickly and reliably to ensure continued production use
and on-going research
▪ Storage solutions evolving quickly to handle not just data
sizes/capacities but parallel workloads and metadata access
• ATS-1 (Trinity)
– Advanced Technology Systems
– ~19,000 compute nodes
– 2 PB RAM
– 42 PFlops
– Cray Aries Network
• CTS-1 Systems
– Commodity Technology Systems
– ~60-1500 compute node
clusters
– ~30-200 TB RAM
– ~.5-2 PFlops
– Intel Omni-Path (OPA) /
Mellanox EDR
• Shares Storage Systems
• Home/Projects (NFS)
• >TB (Ranges)
• Scratch (Lustre)
• ~5PB-72PB
• Campaign Storage (MarFS)
• 30PB
• Archive (HPSS)
• >PB (Ranges)
4 OpenFabrics Alliance Workshop 2019
DIVERSE FABRICS
▪ Mellanox InfiniBand Networks
• Cluster High Speed Network
• Needed for MPI/UCX low latency communication
• SAN Network
• Bandwidth capabilities but most importantly metadata access (dedicated systems for metadata, such as Lustre
MDS)
• Dedicated Subnet Manager systems
• Used for Monitoring purposes
• Topologies ranging from full fat trees to Dragonfly topologies
• SAN made up of “Damselfly Topology” for Lustre Fine Grain Routing
▪ Intel Omni-Path Networks
• Cluster High Speed Network
• MPI/PSM2 low latency communication on Intel hardware/fabrics
• Various fat-tree solutions (full fat tree, 2:1 oversubscription)
• Dedicated Fabric Mangers
▪ Cray Aries Networks
• Used in advanced technology systems
5 OpenFabrics Alliance Workshop 2019
PRODUCTION DESIGN
6 OpenFabrics Alliance Workshop 2019
▪ All super computers have shared file system
resources (except for high end, which have dedicated)
• Need to provide capability for moving data between shared storage
tiers (on system memory -> burst buffers -> scratch -> archive)
• Data movers need to be able to communicate between most of these
tiers to help in data movement for user’s workloads
▪ SAN needs to be able to handle bandwidth needs
• High Speed Network (HSN) needed for bandwidth and latency
(metadata)
• Fabric needs to be designed in a way to allow for expansion and at the
same time, being cost effective
• Chassis solution would simplify everything, but not cheap
• Different types of storage systems will live and be accessed on SAN in
different ways (account for dynamic usage)
At the same time, trying to make sure all data is intact and secure!
STORAGE AREA NETWORK (SAN) DESIGN
7 OpenFabrics Alliance Workshop 2019
Storage Tiers
Memory
Burst Buffers (Flash)
Scratch (Disk)
Campaign (Disk)
Archive (Disk/Tape)
CHALLENGES
▪ SAN technology require different network technologies
• NFS/Archive today on Ethernet
• Scratch/Campaign on IB
• Cluster Network on OPA/IB/Aries
• What will be the next system?
▪ How can we simplify data movement between storage tiers?
• Data Transfer Nodes
• Need to bridge or route between all the storage tiers/types of networks
• Simplify enough for users to be able to move data between the tiers easily
• Requires tools/resources and systems capable of handling data movement
▪ Redundancy/Failure options
• Lustre Fine Grain Routing (FGR) allows for OSS failure modes and reduce need for full fat-tree fabrics (isolates traffic for data
movement)
• Multiple IO/LNET routers provide multiple paths and access to FGR groups from all systems
• Ethernet / HSN routers used to bridge the main to types of networks we have today
• Used to move data between external resources and HPC systems as well as FTA data movement between storage tiers
8 OpenFabrics Alliance Workshop 2019
FTA
NETWORK INTEGRATION
Cluster1 Cluster2
FE FE
HPC
Ethernet
Damselfly
IB
a
Eth/HSN
Routers
Sonexion
Lustre GarciaLNET
Routers
NFS
FTA Campaign
Ethernet Network
HSN Network
Lustre1 Lustre2
OpenFabrics Alliance Workshop 2019
IO/
LNET
IO/
LNET
Cluster3
LNET
Routers
Campus
Archive
ADMINISTRATION AND MONITORING
10 OpenFabrics Alliance Workshop 2019
FABRIC MONITORING AND ADMINISTRATION
▪ Multiple Networks to manage and monitor
• Separate subnet fabric manager systems allow for integrating monitoring software
▪ Monitoring
• IBMon - LANL
• HSNMon - LANL
• Dead Gateway Detection (DGD) - LANL
• Lightweight Distributed Metric Service (LDMS) - Sandia
▪ Alerts network administrations and 24/7 tech support staff of
hardware issues to resolve
▪ Overall data collections allow for real time usage of fabrics
▪ Performance analysis allows correlation between user job
submissions and fabric health counters
• Splunk and Grafana used to detail analysis of network performance in conjunction with
user jobs
• Mapping of routes through fabric used to identify bottlenecks/congestion
• IBMon
– Monitors error and performance of
IB fabrics
– Gathers counters from PerfMgr
in OpenSM
– Detects errors on fabric
• HSNMon
– Monitors error and performance of
OPA fabrics
– Gathers counters from FM /
opareport tools
– Queries fast fabric tools for
hardware information
• Dead Gateway Detection (DGD)
– Monitors communication between
compute clusters and their IO
devices
– Verifies connectivity between IO
devices and SAN networks (IB,
Ethernet)
– Updates routes on computer
clusters/SAN when IO non-
operational
– ECMP Routes
– LNET Routing
– OSPF Routes
11 OpenFabrics Alliance Workshop 2019
PREPARING FOR THE FUTURE
12 OpenFabrics Alliance Workshop 2019
NEXT HIGH SPEED TECHNOLOGIES?
▪ Storage solutions getting faster and more robust
• Need to design SANs to handle existing infrastructure but at the same time make room for other technologies
needed
• Ultra-Scale Research Center identifying ways to make use of different types of network technologies and storage
solutions to get the performance and scales that we need, leveraging existing technologies today and expanding on
use
▪ Integrating Network Technologies
• LANL efforts to have network infrastructure support computing systems for future design
• EMC3 (Efficient Mission-Centric Computing Consortium) working with vendors to look at integrating network
components into helping the growth of computing technologies
▪ New systems looking at existing capabilities
• Existing clusters meant for Hadoop/Container research only have Ethernet internal (look at soft RoCE?)
• Storage leverage RDMA capabilities more efficiently
▪ Can we merge many of these systems together?
• Separate routers and FTAs have to bridge many networks
• How can we easily design a network to encompass all of these needs?
13 OpenFabrics Alliance Workshop 2019
• LANL HPC is ever growing and expanding
fast!
• New technologies provide new
opportunities/challenges
• Difficult to stay ahead of the curve of new systems, but
designing networks efficiently allows to stay on top of the
need for bandwidth and reliability
• Efficient modeling techniques allow us to
map usage of the networks and plan for
future designs of user needs
• LANL making strides into providing more monitoring
capabilities for administrators
It’s never the network! Except for when it is…
CONCLUSION
14 OpenFabrics Alliance Workshop 2019
15th ANNUAL WORKSHOP 2019
THANK YOU
Jesse Martinez
Los Alamos National Laboratory
[ LOGO HERE ]

More Related Content

Similar to HPC Networking in the Real World

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_TechnologiesManish Chopra
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAHAkash M Shah
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster ComputingNIKHIL NAIR
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Casesinside-BigData.com
 
NECOS - Concertation Meeting EUBrasilCloudFORUM
NECOS -  Concertation Meeting EUBrasilCloudFORUMNECOS -  Concertation Meeting EUBrasilCloudFORUM
NECOS - Concertation Meeting EUBrasilCloudFORUMEUBrasilCloudFORUM .
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...Angela Williams
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Ed Dodds
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 

Similar to HPC Networking in the Real World (20)

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_Technologies
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cluster cmputing
Cluster cmputingCluster cmputing
Cluster cmputing
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
 
NECOS - Concertation Meeting EUBrasilCloudFORUM
NECOS -  Concertation Meeting EUBrasilCloudFORUMNECOS -  Concertation Meeting EUBrasilCloudFORUM
NECOS - Concertation Meeting EUBrasilCloudFORUM
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Clustering
ClusteringClustering
Clustering
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

HPC Networking in the Real World

  • 1. 15th ANNUAL WORKSHOP 2019 HPC NETWORKING IN THE REAL WORLD Jesse Martinez Los Alamos National Laboratory March 19th, 2019 [ LOGO HERE ] LA-UR-19-22146
  • 2. ABSTRACT ▪Introduction to LANL High Speed Networking ▪Production Needs/Challenges ▪Administration and Monitoring ▪Future 2 OpenFabrics Alliance Workshop 2019
  • 3. INTRODUCTION INTO LANL HPC HIGH SPEED NETWORKING 3 OpenFabrics Alliance Workshop 2019
  • 4. LANL HPC SUPERCOMPUTERS ▪ LANL HPC has very diverse networks and technologies for its users • Ranges from old legacy hardware, new commodity systems, to new advanced technology systems (ATS-1) such as Trinity ▪ Workloads ranging from MPI/Parallel based codes to GPU/on- node computations ▪ User datasets into the TB storage range (and growing quickly to PB!) • Need to be able to move data quickly and reliably to ensure continued production use and on-going research ▪ Storage solutions evolving quickly to handle not just data sizes/capacities but parallel workloads and metadata access • ATS-1 (Trinity) – Advanced Technology Systems – ~19,000 compute nodes – 2 PB RAM – 42 PFlops – Cray Aries Network • CTS-1 Systems – Commodity Technology Systems – ~60-1500 compute node clusters – ~30-200 TB RAM – ~.5-2 PFlops – Intel Omni-Path (OPA) / Mellanox EDR • Shares Storage Systems • Home/Projects (NFS) • >TB (Ranges) • Scratch (Lustre) • ~5PB-72PB • Campaign Storage (MarFS) • 30PB • Archive (HPSS) • >PB (Ranges) 4 OpenFabrics Alliance Workshop 2019
  • 5. DIVERSE FABRICS ▪ Mellanox InfiniBand Networks • Cluster High Speed Network • Needed for MPI/UCX low latency communication • SAN Network • Bandwidth capabilities but most importantly metadata access (dedicated systems for metadata, such as Lustre MDS) • Dedicated Subnet Manager systems • Used for Monitoring purposes • Topologies ranging from full fat trees to Dragonfly topologies • SAN made up of “Damselfly Topology” for Lustre Fine Grain Routing ▪ Intel Omni-Path Networks • Cluster High Speed Network • MPI/PSM2 low latency communication on Intel hardware/fabrics • Various fat-tree solutions (full fat tree, 2:1 oversubscription) • Dedicated Fabric Mangers ▪ Cray Aries Networks • Used in advanced technology systems 5 OpenFabrics Alliance Workshop 2019
  • 6. PRODUCTION DESIGN 6 OpenFabrics Alliance Workshop 2019
  • 7. ▪ All super computers have shared file system resources (except for high end, which have dedicated) • Need to provide capability for moving data between shared storage tiers (on system memory -> burst buffers -> scratch -> archive) • Data movers need to be able to communicate between most of these tiers to help in data movement for user’s workloads ▪ SAN needs to be able to handle bandwidth needs • High Speed Network (HSN) needed for bandwidth and latency (metadata) • Fabric needs to be designed in a way to allow for expansion and at the same time, being cost effective • Chassis solution would simplify everything, but not cheap • Different types of storage systems will live and be accessed on SAN in different ways (account for dynamic usage) At the same time, trying to make sure all data is intact and secure! STORAGE AREA NETWORK (SAN) DESIGN 7 OpenFabrics Alliance Workshop 2019 Storage Tiers Memory Burst Buffers (Flash) Scratch (Disk) Campaign (Disk) Archive (Disk/Tape)
  • 8. CHALLENGES ▪ SAN technology require different network technologies • NFS/Archive today on Ethernet • Scratch/Campaign on IB • Cluster Network on OPA/IB/Aries • What will be the next system? ▪ How can we simplify data movement between storage tiers? • Data Transfer Nodes • Need to bridge or route between all the storage tiers/types of networks • Simplify enough for users to be able to move data between the tiers easily • Requires tools/resources and systems capable of handling data movement ▪ Redundancy/Failure options • Lustre Fine Grain Routing (FGR) allows for OSS failure modes and reduce need for full fat-tree fabrics (isolates traffic for data movement) • Multiple IO/LNET routers provide multiple paths and access to FGR groups from all systems • Ethernet / HSN routers used to bridge the main to types of networks we have today • Used to move data between external resources and HPC systems as well as FTA data movement between storage tiers 8 OpenFabrics Alliance Workshop 2019
  • 9. FTA NETWORK INTEGRATION Cluster1 Cluster2 FE FE HPC Ethernet Damselfly IB a Eth/HSN Routers Sonexion Lustre GarciaLNET Routers NFS FTA Campaign Ethernet Network HSN Network Lustre1 Lustre2 OpenFabrics Alliance Workshop 2019 IO/ LNET IO/ LNET Cluster3 LNET Routers Campus Archive
  • 10. ADMINISTRATION AND MONITORING 10 OpenFabrics Alliance Workshop 2019
  • 11. FABRIC MONITORING AND ADMINISTRATION ▪ Multiple Networks to manage and monitor • Separate subnet fabric manager systems allow for integrating monitoring software ▪ Monitoring • IBMon - LANL • HSNMon - LANL • Dead Gateway Detection (DGD) - LANL • Lightweight Distributed Metric Service (LDMS) - Sandia ▪ Alerts network administrations and 24/7 tech support staff of hardware issues to resolve ▪ Overall data collections allow for real time usage of fabrics ▪ Performance analysis allows correlation between user job submissions and fabric health counters • Splunk and Grafana used to detail analysis of network performance in conjunction with user jobs • Mapping of routes through fabric used to identify bottlenecks/congestion • IBMon – Monitors error and performance of IB fabrics – Gathers counters from PerfMgr in OpenSM – Detects errors on fabric • HSNMon – Monitors error and performance of OPA fabrics – Gathers counters from FM / opareport tools – Queries fast fabric tools for hardware information • Dead Gateway Detection (DGD) – Monitors communication between compute clusters and their IO devices – Verifies connectivity between IO devices and SAN networks (IB, Ethernet) – Updates routes on computer clusters/SAN when IO non- operational – ECMP Routes – LNET Routing – OSPF Routes 11 OpenFabrics Alliance Workshop 2019
  • 12. PREPARING FOR THE FUTURE 12 OpenFabrics Alliance Workshop 2019
  • 13. NEXT HIGH SPEED TECHNOLOGIES? ▪ Storage solutions getting faster and more robust • Need to design SANs to handle existing infrastructure but at the same time make room for other technologies needed • Ultra-Scale Research Center identifying ways to make use of different types of network technologies and storage solutions to get the performance and scales that we need, leveraging existing technologies today and expanding on use ▪ Integrating Network Technologies • LANL efforts to have network infrastructure support computing systems for future design • EMC3 (Efficient Mission-Centric Computing Consortium) working with vendors to look at integrating network components into helping the growth of computing technologies ▪ New systems looking at existing capabilities • Existing clusters meant for Hadoop/Container research only have Ethernet internal (look at soft RoCE?) • Storage leverage RDMA capabilities more efficiently ▪ Can we merge many of these systems together? • Separate routers and FTAs have to bridge many networks • How can we easily design a network to encompass all of these needs? 13 OpenFabrics Alliance Workshop 2019
  • 14. • LANL HPC is ever growing and expanding fast! • New technologies provide new opportunities/challenges • Difficult to stay ahead of the curve of new systems, but designing networks efficiently allows to stay on top of the need for bandwidth and reliability • Efficient modeling techniques allow us to map usage of the networks and plan for future designs of user needs • LANL making strides into providing more monitoring capabilities for administrators It’s never the network! Except for when it is… CONCLUSION 14 OpenFabrics Alliance Workshop 2019
  • 15. 15th ANNUAL WORKSHOP 2019 THANK YOU Jesse Martinez Los Alamos National Laboratory [ LOGO HERE ]