Building on prem HPC is a challenging but sometimes necessary task for companies. Having specialised hardware for sophisticated tasks such as simulation, deep learning or optimisation can be financially, strategically or from pure availability point of view be important. This slide deck gives an general overview how this can be approached.
4. Hardware for
HPC
HPC starts when you
feel that you need a
bigger laptop!
http://www.advancedclustering.com/hpc-compute-blocks-built-to-order-for-intensive-
workloads/
5. Good to know:
Scaling up with
Hardware
Understand
what speed is
possible.
http://www.moorinsightsstrategy.com/wp-content/uploads/2015/04/unnamed.png
6. Scaling up with
Hardware
You want to optimize how your
data flows!
https://www.microway.com/product/octoputer-4u-tesla-8-gpu-server-nvlink/blockdiagram-sys-4028gr-tvxrt-teslav100/
7. Python tools for the help
Recommended talk:
https://youtu.be/HKjM3
peINtwe
tbb4py: python c-extension that is instantiated via monkey
patching python pools → enables TBB (threading building blocks
lib) of intel MKL (math kernel lib.). → I saw ~20% speedup on
some tests.
8. Goal: Fast R&D
turnover
Time is costly
Hardware is costly
Speed matters
Scaling is important
Data Science has a 1:n
compute requirement
Cloud: large differences in offers.
Costs per compute unit ...
9. … the story of a model that could not
be trained in sufficient time for
production ...
10. Product Strategy → ML Strategy →
Data Strategy + Compute Strategy →
HPC Case
16. Define your
compute strategy?
Why HPC?
Why Cloud?
When Hybrid?
Answer this as early as possible →
large buy in risks.
Cloud:
Good if you have no owned
infrastructure, manpower and
want to get ready fast. Good for
scale and resilience.
Hybrid:
Best option if you have the
manpower and use case.
Gives you option to pick the
best from both worlds
On Premise:
Good if you can manage
the hardware. Good if you
want to be highly
optimised and know your
case.
17. What is your
compute strategy?
Why HPC
Why Cloud
When Hybrid
Be careful with case studies!!
What is your priority? Fast results,
Scalability, Resilience, Cost
efficiency
.
Data locality?
Utilization will drive your costs
structures
R&D turnover
Talent
available?
Business case ? Eg. IOT, embedded,
special hardware, consumer
electronics.
19. A possible HPC setup for research
Slurm Network
NFS
(nearli
ne)
Master Node (weak)
Nodes
Send Jobs
User Login
SLURM Compute Node:
N CPU (specialized)
X RAM (Main memory)
Generic resources:
- Enhanced network
- Fast Internal flash
storage
- GPU /Phi Coprocessors
Same user id and
permissions on all
system components!!
Mounted
network file
system
Focus:
- Connect: Central model repo
- Connect: Central data repo
- Use templates that can be
ported to your cloud platform
20. Best Practices
Educate your researchers how to
best use the system
Develop standards and best
practices for the ML dev cycle (eg.
model versioning and testing)
Develope standards for transitions
between eg. between on premise
and cloud.
Check your product use case:
- eg. requirements for training
- many products vs. single (see
deepl.com example)