Serverless for HPC
Luciano Mammino
fourTheorem
@loige
Diamond Sponsor
Partner
Platinum Sponsor Gold Sponsor
👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino
Middy Framework
SLIC Starter - Serverless Accelerator
SLIC Watch - Observability Plugin
Business focused technologists.
Accelerated Serverless | AI as a Service | Platform Modernisation
We host a podcast about AWS and Cloud computing
🔗 awsbites.com
🎬 YouTube Channel
🎙 Podcast
📅 Episodes every week
@loige
#CLOUDDAY2022
Get the slides: fth.link/cd22
@loige
#CLOUDDAY2022
Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
fth.link/cd22
@loige
#CLOUDDAY2022
The 6 Rs of Cloud Migrations
🗑 🕸 🚚
Retire Retain Rehost
🏗 📐 💰
Replatform Refactor Repurchase
@loige
#CLOUDDAY2022
fth.link/cd22
A case study
Case study on AWS blog:
fth.link/awshpc
@loige
#CLOUDDAY2022
The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
#CLOUDDAY2022
The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
#CLOUDDAY2022
Original on-prem implementation
@loige
#CLOUDDAY2022
Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
#CLOUDDAY2022
Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
#CLOUDDAY2022
The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
#CLOUDDAY2022
Architecture Options for Compute/Orchestration
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
@loige
#CLOUDDAY2022
POC Architecture
AWS Batch
S3
Step Functions
Lambda
SQS
@loige
#CLOUDDAY2022
Measure Everything! 📏
⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
#CLOUDDAY2022
Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to
be addressed next!
@loige
#CLOUDDAY2022
Beyond the PoC
Production: optimise for unique workload characteristics
@loige
#CLOUDDAY2022
Job Plan
@loige
#CLOUDDAY2022
In reality, not all jobs are alike!
@loige
#CLOUDDAY2022
Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
#CLOUDDAY2022
Moving to production 🚢
@loige
#CLOUDDAY2022
Scope
@loige
#CLOUDDAY2022
Actual End to End overview
@loige
#CLOUDDAY2022
Modelling Worker
@loige
#CLOUDDAY2022
Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
#CLOUDDAY2022
Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
#CLOUDDAY2022
Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
#CLOUDDAY2022
Lifecycle of a Job
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
@loige
#CLOUDDAY2022
Event-Driven Scheduler
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
@loige
#CLOUDDAY2022
Dynamic Runtime
Handling
We also need to handle
system failures!
@loige
#CLOUDDAY2022
Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver new
capabilities
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
@loige
#CLOUDDAY2022
Hitting the limits 😰
@loige
#CLOUDDAY2022
S3 Throughput
@loige
#CLOUDDAY2022
S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
#CLOUDDAY2022
The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
@loige
#CLOUDDAY2022
Fargate Scaling
●We want to run 3000 containers ASAP
●This took > 1 hour!
●We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
●3000 containers in ~20 minutes
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
@loige
#CLOUDDAY2022
How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
#CLOUDDAY2022
Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
#CLOUDDAY2022
Special thanks to @eoins and @cmthorne10
fth.link/cd22
@loige
#CLOUDDAY2022

Serverless for High Performance Computing

  • 1.
    Serverless for HPC LucianoMammino fourTheorem @loige
  • 2.
  • 3.
    👋 Hello, Iam Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino
  • 4.
    Middy Framework SLIC Starter- Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation
  • 5.
    We host apodcast about AWS and Cloud computing 🔗 awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige #CLOUDDAY2022
  • 6.
    Get the slides:fth.link/cd22 @loige #CLOUDDAY2022
  • 7.
    Agenda ● The 6Rs of Cloud Migration ● A serverless case study ○ The problem space and types of workflows ○ Original on premise implementation ○ The PoC ○ The final production version ○ The components of a serverless job scheduler ○ Challenges & Limits fth.link/cd22 @loige #CLOUDDAY2022
  • 8.
    The 6 Rsof Cloud Migrations 🗑 🕸 🚚 Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase @loige #CLOUDDAY2022 fth.link/cd22
  • 9.
    A case study Casestudy on AWS blog: fth.link/awshpc @loige #CLOUDDAY2022
  • 10.
    The workloads -Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige #CLOUDDAY2022
  • 11.
    The workloads -Deal Analytics ⚡ Near real-time deal pricing using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige #CLOUDDAY2022
  • 12.
  • 13.
    Challenges 🐢 Long executiontimes, constraining business agility 🥊 Competing workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige #CLOUDDAY2022
  • 14.
    Thinking Big 💭 Imaginea solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige #CLOUDDAY2022
  • 15.
    The Goal ⚽ Runa Risk Rollup in 1 hour! @loige #CLOUDDAY2022
  • 16.
    Architecture Options forCompute/Orchestration AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s @loige #CLOUDDAY2022
  • 17.
    POC Architecture AWS Batch S3 StepFunctions Lambda SQS @loige #CLOUDDAY2022
  • 18.
    Measure Everything! 📏 ⏱Built metrics in from the start 󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige #CLOUDDAY2022
  • 19.
    Measure Everything! 📏 👍Rollup in 1 hour ☁ Running on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed next! @loige #CLOUDDAY2022
  • 20.
    Beyond the PoC Production:optimise for unique workload characteristics @loige #CLOUDDAY2022
  • 21.
  • 22.
    In reality, notall jobs are alike! @loige #CLOUDDAY2022
  • 23.
    Horizontal scaling 🚀 1000’sof jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige #CLOUDDAY2022
  • 24.
    Moving to production🚢 @loige #CLOUDDAY2022
  • 25.
  • 26.
    Actual End toEnd overview @loige #CLOUDDAY2022
  • 27.
  • 28.
    Compute Services Scales to1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige #CLOUDDAY2022
  • 29.
    Store all thethings in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige #CLOUDDAY2022
  • 30.
    Scheduling and Orchestration ✅We have our cluster (Fargate or Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige #CLOUDDAY2022
  • 31.
    Lifecycle of aJob A new job get queued here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure) @loige #CLOUDDAY2022
  • 32.
    Event-Driven Scheduler Job statesare pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next @loige #CLOUDDAY2022
  • 33.
    Dynamic Runtime Handling We alsoneed to handle system failures! @loige #CLOUDDAY2022
  • 34.
    Outcomes 🙌 Business ● Rollupin 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint @loige #CLOUDDAY2022
  • 35.
    Hitting the limits😰 @loige #CLOUDDAY2022
  • 36.
  • 37.
    S3 Partitioning S3 cleverlydetects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige #CLOUDDAY2022
  • 38.
    The Solution Explicit Partitioning: ○Figureout how many partitions you need ○Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments! @loige #CLOUDDAY2022
  • 39.
    Fargate Scaling ●We wantto run 3000 containers ASAP ●This took > 1 hour! ●We built a custom Fargate scaler ○Using the RunTask API (no ECS Service) ○Hidden quota increases ○Step Function + Lambda ●3000 containers in ~20 minutes The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes @loige #CLOUDDAY2022
  • 40.
    How high canwe go today? 🚀 10,000 concurrent Lambda functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige #CLOUDDAY2022
  • 41.
    Wrapping up 🎁 ●"Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige #CLOUDDAY2022
  • 42.
    Special thanks to@eoins and @cmthorne10 fth.link/cd22 @loige #CLOUDDAY2022