SlideShare a Scribd company logo
technology consulting
Arno Kolster, Principal & Co-Founder
(Un)Obscured By Clouds
Applying Cloud Techniques To Address Complexity
In High Performance System Integrations
What Prompted This Talk?
3
Integrations between HPC and Enterprise have been coming for quite some time.
HPC in the cloud is now a reality - easier and cheaper to use than ever before.
But cloud still a huge learning curve and paradigm shift for traditional HPC.
FOMO on latest technical trends, technology itself and desire to increase ROI.
Majority of our work has been leveraging our hyperscale/cloud experience in
HPC environments - time to talk about ’eating our own dog food’.
As it is for enterprise embracing HPC tech – i.e. GPUs for ML/DL/AI.
Use Case #1
Leadership Class Supercomputer Operations Integration
“As we build faster and faster supercomputers, their energy efficiency becomes more
and more important,” said Jim Rogers, computing and facilities director for ORNL’s
National Center for Computational Sciences. “We wanted to couple Summit’s
mechanical cooling system with its computational workload to optimize efficiency,
which can translate to significant cost savings for a system of this size.”
Drivers For Summit Cooling Intelligence
6
Summit is the largest water cooled supercomputer in the world.
Minor fluctuations in temperature across 4,608 nodes can have a dramatic
positive or negative impact on Summit’s operational efficiency.
No mechanism exists to monitor the temperature in real time so that systems
operators can make adjustments as appropriate.
Can take upwards of 3 mins for change in water temp to reach furthest node
in the grid.
Systems-level intelligence can inform decisions and impact costs, efficiency
and operational excellence.
SCI Design Goals
7
Architect, develop and implement a scalable platform to:
Ingest line-rate telemetry from Summit & supporting systems in real time.
Add data sets easily to enhance analytics and insight.
Employ cloud-style ‘reliable messaging’ as a common interface to provide
consistency and coherency for ‘in-flight’ data.
Scale data at orders-of-magnitude without proportional staff increases.
Obey the axiom: No single failure results in a single point-of-failure.
Currently 4 Source-Data Metrics
8
TOTAL ~460,000 metrics per second available for real-time analytics
1. IBM OpenBMC framework (99 metrics * 4608 nodes/second = 456,192/sec)
2. IBM LSF jobs data for running applications (~10sec)
3. NOAA weather/wet bulb for Oak Ridge area (continuous)
4. Programmable Logic Circuit (PLC) water flow (continuous)
Instrumentation and monitoring data for developed applications (ad hoc)
Summarized and collated metrics (continuous)
Instrumented Ascent (dev cluster) at the same time using same platform
SCI Architecture Created
9
Event Message Bus
Prometheus
Operational
Etcd/Consensus
Transient Config
ElasticSearch, Log
Unstructured Data
Structured Data
Relational DBs
Streaming
Analytics
Neural Networks
TensorFlow, Caffe, etc
openBMC NOAA WeatherLSF job data PLC
Other
Data
Machine
Learning
Data Sources
Persistence Learning
PLC
Recommendations
User Dashboarding
Monitoring
Reporting
Research/Search
Technologies Used & Integrated
10
High-availability microservices designed for scalable, small footprint
Kafka message bus for ubiquitous message delivery
Docker containers for ease of deployment and elastic scalability
Openshift/Kubernetes for container orchestration and instrumentation
Spark Streaming for on-the-wire data analytics
Prometheus for service discovery metrics collection and time-series db
Jupyter Notebooks for models and data science exploration (Python/Spark)
Grafana, Seaborn for visualization
Breakthroughs
11
Created a new paradigm for analyzing data beyond data-at-rest or data at the sensor.
Data-in-flight is agnostic and can be captured prior to being translated into
specific software formats, some of which can be proprietary.
Overlapping data sets from disassociated sources can now be displayed on an
identical scale (time), which is something new.
What Can Be Done Now
12
• Core temps for CPU, GPU, memory DIMMs, HBMs
• Temps by CPU core, GPU core
• Power to the node, fans and to other individual components
Capture 99 node metrics per second for streaming analytics; for example,
Visualize per-second node temperature+power and temperature+jobs (currently).
Optimize operations by combining metrics from Summit, control systems & weather.
Add more real-time data streams from varying sources to augment analytics.
Analyze behavior from both a systems level and node level.
Impact
13
• Efficiencies in Summit ecosystem, job throughput and individual jobs
• Micro-managed cooling leads to running at higher temps with less risk
• Greater efficiency leads to Summit optimization, health and lower costs
Real-time analytics helps operators detect anomalies quickly and in a manner that is
more responsive than querying a data warehouse. This approach is especially valuable
with such a large investment in a critical asset.
This platform captures all the data needed to conduct predictive analytics, thus
ensuring new levels of insight, alerting and monitoring.
Real-time data can ultimately help micro-manage such a large system in ways that
are far more efficient and cost effective than human-only intervention.
Real-Time Visualization
Summit Power/Jobs Every Second
Summit Temperature/Jobs Every Second
This Was Not Without Its Challenges…
17
Long ramp up time due to contract, legal & solidifying SOW requirements.
Resource constraints on both sides due to scheduling and unforeseen delays.
Had to work within the confines of the Summit security model – no direct access -
containers saved the day.
Limited computer resources due to h/w budget. Built everything shown on 3 nodes.
But The Effort Was Worth It.
18
Overlay additional disparate data sources in a production environment that is both
scalable and elastic.
Provide predictive analytics for water temperature control based on scheduled jobs in
the queue that can provide input to the PLCs.
Develop predictive algorithms that intelligently affect power/temp across multiple
dimensions: CPU, GPU, memory, power.
Determine how this platoform can be utilized in Exascale systems.
Provide interoperability between expert systems and components to create self-healing,
scaling and operational control.
Research how specific codes/algorithms affect job performance.
Use Case #2
Aerospace Defense Contractor
Problem Statement
21
Client wanted understanding on cloud technologies and how to move
certain workloads into the cloud
They had minimal cloud experience
Wanted on-site, hands-on training
Comparative analysis between major cloud providers
Alleviate concerns around security, data transfer, multi-tenancy, account
management
Solution Provided
22
Designed a custom two day ‘Cloud 101’ course tailored to the project the client
desired to migrate into the cloud
Engagement model included weekly consultation for ongoing feedback,
technical direction and open dialog (“Are we doing the right thing?”)
Showcased best practices that can be used on all cloud platforms
Presented how to think about cloud services as simplified tasks vs single
monolithic HPC jobs
CLIENT PROPOSED ARCHITECTURE
Client Solution Is…
24
Overly complex, both from a workflow and operational perspective.
Cannot meet availability requirements as structured.
Extremely difficult to scale, deploy and integrate new components.
Using mix of legacy, client/server Linux, HPC and standalone Windows.
On-prem / hybrid / cloud delineation is blurred.
25
PW CLOUD BASED ARCHITECTURE
26
Cloud Based Solutions Applied
27
Now have availability zones across multiple regions for potential worldwide site
deployments
Simple containerized deployment can be pushed to any cloud provider
Message bus allows faster and easier future integrations from disparate sites
just by pushing container
Applied a cloud-centric model leveraging modern architectures including
containerized microservices
On-prem / cloud integration delineated, but seamless
In Summary
28
HPC Integrations CAN Be Simplified
29
Leverage cloud based architectures that maximize efficiencies of scale.
Middleware is important - Use an ubiquitous transport system to deliver data,
payloads, logs and control messages. Two use cases, two completely different
reasons for bringing messages to bear.
Break large tasks into smaller ones to decrease complexity and introduce
parallelism (where it makes sense).
Microservices can then be scaled independently as the needs arise.
Thank you.
arno.kolster@providentiaworldwide.com
@providentia_ww
30
www.providentiaworldwide.com

More Related Content

What's hot

High performance computing
High performance computingHigh performance computing
High performance computing
punjab engineering college, chandigarh
 
High performance computing
High performance computingHigh performance computing
High performance computing
Guy Tel-Zur
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
Divyen Patel
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
ijujournal
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
BRAC University Computer Club
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
ijujournal
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
Ankit Gupta
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
Esteban Hernandez
 
High performance computing
High performance computingHigh performance computing
High performance computing
Maher Alshammari
 
HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020
NECST Lab @ Politecnico di Milano
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud Computing
Ramandeep Kaur
 
cloud schedualing
cloud schedualingcloud schedualing
cloud schedualing
twomarkopolo
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
IEEEFINALYEARPROJECTS
 
Scheduling in cloud
Scheduling in cloudScheduling in cloud
Scheduling in cloud
Dr.Manjunath Kotari
 
CloudLightning Simulator
CloudLightning SimulatorCloudLightning Simulator
CloudLightning Simulator
CloudLightning
 
Job sequence scheduling for cloud computing
Job sequence scheduling for cloud computingJob sequence scheduling for cloud computing
Job sequence scheduling for cloud computing
Samruddhi Gaikwad
 
A survey of various scheduling algorithm in cloud computing environment
A survey of various scheduling algorithm in cloud computing environmentA survey of various scheduling algorithm in cloud computing environment
A survey of various scheduling algorithm in cloud computing environment
eSAT Publishing House
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
Ian Lumb
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 

What's hot (20)

High performance computing
High performance computingHigh performance computing
High performance computing
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020HYPPO - NECSTTechTalk 23/04/2020
HYPPO - NECSTTechTalk 23/04/2020
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud Computing
 
cloud schedualing
cloud schedualingcloud schedualing
cloud schedualing
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
 
Scheduling in cloud
Scheduling in cloudScheduling in cloud
Scheduling in cloud
 
CloudLightning Simulator
CloudLightning SimulatorCloudLightning Simulator
CloudLightning Simulator
 
Job sequence scheduling for cloud computing
Job sequence scheduling for cloud computingJob sequence scheduling for cloud computing
Job sequence scheduling for cloud computing
 
A survey of various scheduling algorithm in cloud computing environment
A survey of various scheduling algorithm in cloud computing environmentA survey of various scheduling algorithm in cloud computing environment
A survey of various scheduling algorithm in cloud computing environment
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Similar to Applying Cloud Techniques to Address Complexity in HPC System Integrations

High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
Nous Infosystems
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
LevLafayette1
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
HasanAfwaaz1
 
Workload Automation for Cloud Migration and Machine Learning Platform
Workload Automation for Cloud Migration and Machine Learning PlatformWorkload Automation for Cloud Migration and Machine Learning Platform
Workload Automation for Cloud Migration and Machine Learning Platform
Activeeon
 
Energy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud EnvironmentEnergy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud Environment
IRJET Journal
 
Wp a-break-in-the-clouds
Wp a-break-in-the-cloudsWp a-break-in-the-clouds
Wp a-break-in-the-clouds
Mohsen Tayefeh
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
iguazio
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
blewington
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
ijccsa
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
Angela Williams
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
mictc
 
Priorities Shift In IC Design
Priorities Shift In IC DesignPriorities Shift In IC Design
Priorities Shift In IC Design
Abacus Technologies
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
Ankit Singh
 
Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3
Abhishek Gupta
 
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & WieckIBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
IBM Events
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Marketing Donalba
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
IEEEGLOBALSOFTTECHNOLOGIES
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storage
dbpublications
 
CloudComputing_UNIT5.pdf
CloudComputing_UNIT5.pdfCloudComputing_UNIT5.pdf
CloudComputing_UNIT5.pdf
khan593595
 

Similar to Applying Cloud Techniques to Address Complexity in HPC System Integrations (20)

High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Workload Automation for Cloud Migration and Machine Learning Platform
Workload Automation for Cloud Migration and Machine Learning PlatformWorkload Automation for Cloud Migration and Machine Learning Platform
Workload Automation for Cloud Migration and Machine Learning Platform
 
Energy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud EnvironmentEnergy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud Environment
 
Wp a-break-in-the-clouds
Wp a-break-in-the-cloudsWp a-break-in-the-clouds
Wp a-break-in-the-clouds
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
 
Priorities Shift In IC Design
Priorities Shift In IC DesignPriorities Shift In IC Design
Priorities Shift In IC Design
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3
 
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & WieckIBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
IBM InterConnect 2013 Expert Integrated Systems Keynote: Sotiropoulos & Wieck
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storage
 
CloudComputing_UNIT5.pdf
CloudComputing_UNIT5.pdfCloudComputing_UNIT5.pdf
CloudComputing_UNIT5.pdf
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 

Recently uploaded (20)

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 

Applying Cloud Techniques to Address Complexity in HPC System Integrations

  • 2. Arno Kolster, Principal & Co-Founder (Un)Obscured By Clouds Applying Cloud Techniques To Address Complexity In High Performance System Integrations
  • 3. What Prompted This Talk? 3 Integrations between HPC and Enterprise have been coming for quite some time. HPC in the cloud is now a reality - easier and cheaper to use than ever before. But cloud still a huge learning curve and paradigm shift for traditional HPC. FOMO on latest technical trends, technology itself and desire to increase ROI. Majority of our work has been leveraging our hyperscale/cloud experience in HPC environments - time to talk about ’eating our own dog food’. As it is for enterprise embracing HPC tech – i.e. GPUs for ML/DL/AI.
  • 4. Use Case #1 Leadership Class Supercomputer Operations Integration
  • 5. “As we build faster and faster supercomputers, their energy efficiency becomes more and more important,” said Jim Rogers, computing and facilities director for ORNL’s National Center for Computational Sciences. “We wanted to couple Summit’s mechanical cooling system with its computational workload to optimize efficiency, which can translate to significant cost savings for a system of this size.”
  • 6. Drivers For Summit Cooling Intelligence 6 Summit is the largest water cooled supercomputer in the world. Minor fluctuations in temperature across 4,608 nodes can have a dramatic positive or negative impact on Summit’s operational efficiency. No mechanism exists to monitor the temperature in real time so that systems operators can make adjustments as appropriate. Can take upwards of 3 mins for change in water temp to reach furthest node in the grid. Systems-level intelligence can inform decisions and impact costs, efficiency and operational excellence.
  • 7. SCI Design Goals 7 Architect, develop and implement a scalable platform to: Ingest line-rate telemetry from Summit & supporting systems in real time. Add data sets easily to enhance analytics and insight. Employ cloud-style ‘reliable messaging’ as a common interface to provide consistency and coherency for ‘in-flight’ data. Scale data at orders-of-magnitude without proportional staff increases. Obey the axiom: No single failure results in a single point-of-failure.
  • 8. Currently 4 Source-Data Metrics 8 TOTAL ~460,000 metrics per second available for real-time analytics 1. IBM OpenBMC framework (99 metrics * 4608 nodes/second = 456,192/sec) 2. IBM LSF jobs data for running applications (~10sec) 3. NOAA weather/wet bulb for Oak Ridge area (continuous) 4. Programmable Logic Circuit (PLC) water flow (continuous) Instrumentation and monitoring data for developed applications (ad hoc) Summarized and collated metrics (continuous) Instrumented Ascent (dev cluster) at the same time using same platform
  • 9. SCI Architecture Created 9 Event Message Bus Prometheus Operational Etcd/Consensus Transient Config ElasticSearch, Log Unstructured Data Structured Data Relational DBs Streaming Analytics Neural Networks TensorFlow, Caffe, etc openBMC NOAA WeatherLSF job data PLC Other Data Machine Learning Data Sources Persistence Learning PLC Recommendations User Dashboarding Monitoring Reporting Research/Search
  • 10. Technologies Used & Integrated 10 High-availability microservices designed for scalable, small footprint Kafka message bus for ubiquitous message delivery Docker containers for ease of deployment and elastic scalability Openshift/Kubernetes for container orchestration and instrumentation Spark Streaming for on-the-wire data analytics Prometheus for service discovery metrics collection and time-series db Jupyter Notebooks for models and data science exploration (Python/Spark) Grafana, Seaborn for visualization
  • 11. Breakthroughs 11 Created a new paradigm for analyzing data beyond data-at-rest or data at the sensor. Data-in-flight is agnostic and can be captured prior to being translated into specific software formats, some of which can be proprietary. Overlapping data sets from disassociated sources can now be displayed on an identical scale (time), which is something new.
  • 12. What Can Be Done Now 12 • Core temps for CPU, GPU, memory DIMMs, HBMs • Temps by CPU core, GPU core • Power to the node, fans and to other individual components Capture 99 node metrics per second for streaming analytics; for example, Visualize per-second node temperature+power and temperature+jobs (currently). Optimize operations by combining metrics from Summit, control systems & weather. Add more real-time data streams from varying sources to augment analytics. Analyze behavior from both a systems level and node level.
  • 13. Impact 13 • Efficiencies in Summit ecosystem, job throughput and individual jobs • Micro-managed cooling leads to running at higher temps with less risk • Greater efficiency leads to Summit optimization, health and lower costs Real-time analytics helps operators detect anomalies quickly and in a manner that is more responsive than querying a data warehouse. This approach is especially valuable with such a large investment in a critical asset. This platform captures all the data needed to conduct predictive analytics, thus ensuring new levels of insight, alerting and monitoring. Real-time data can ultimately help micro-manage such a large system in ways that are far more efficient and cost effective than human-only intervention.
  • 17. This Was Not Without Its Challenges… 17 Long ramp up time due to contract, legal & solidifying SOW requirements. Resource constraints on both sides due to scheduling and unforeseen delays. Had to work within the confines of the Summit security model – no direct access - containers saved the day. Limited computer resources due to h/w budget. Built everything shown on 3 nodes.
  • 18. But The Effort Was Worth It. 18 Overlay additional disparate data sources in a production environment that is both scalable and elastic. Provide predictive analytics for water temperature control based on scheduled jobs in the queue that can provide input to the PLCs. Develop predictive algorithms that intelligently affect power/temp across multiple dimensions: CPU, GPU, memory, power. Determine how this platoform can be utilized in Exascale systems. Provide interoperability between expert systems and components to create self-healing, scaling and operational control. Research how specific codes/algorithms affect job performance.
  • 19.
  • 20. Use Case #2 Aerospace Defense Contractor
  • 21. Problem Statement 21 Client wanted understanding on cloud technologies and how to move certain workloads into the cloud They had minimal cloud experience Wanted on-site, hands-on training Comparative analysis between major cloud providers Alleviate concerns around security, data transfer, multi-tenancy, account management
  • 22. Solution Provided 22 Designed a custom two day ‘Cloud 101’ course tailored to the project the client desired to migrate into the cloud Engagement model included weekly consultation for ongoing feedback, technical direction and open dialog (“Are we doing the right thing?”) Showcased best practices that can be used on all cloud platforms Presented how to think about cloud services as simplified tasks vs single monolithic HPC jobs
  • 24. Client Solution Is… 24 Overly complex, both from a workflow and operational perspective. Cannot meet availability requirements as structured. Extremely difficult to scale, deploy and integrate new components. Using mix of legacy, client/server Linux, HPC and standalone Windows. On-prem / hybrid / cloud delineation is blurred.
  • 25. 25 PW CLOUD BASED ARCHITECTURE
  • 26. 26
  • 27. Cloud Based Solutions Applied 27 Now have availability zones across multiple regions for potential worldwide site deployments Simple containerized deployment can be pushed to any cloud provider Message bus allows faster and easier future integrations from disparate sites just by pushing container Applied a cloud-centric model leveraging modern architectures including containerized microservices On-prem / cloud integration delineated, but seamless
  • 29. HPC Integrations CAN Be Simplified 29 Leverage cloud based architectures that maximize efficiencies of scale. Middleware is important - Use an ubiquitous transport system to deliver data, payloads, logs and control messages. Two use cases, two completely different reasons for bringing messages to bear. Break large tasks into smaller ones to decrease complexity and introduce parallelism (where it makes sense). Microservices can then be scaled independently as the needs arise.