2. Who I am?
• Experienced principal solutions architect, a lead developer
and head of practice for Inawisdom.
• All 12 AWS Certifications including SA Pro, Dev Ops Data
Analytics Specialism, and Machine Learning Specialism.
• Over 6 years of AWS experience and he has been
responsible for running production workloads of over 200
containers in a performance system that responded to
18,000 requests per second
• Visionary in ML Ops, Produced production workloads of
ML models at scale, including 1500 inferences per minute,
including active monitoring and alerting
• Has developed in Python, NodeJS + J2EE
• I am one of the Ipswich AWS User Group Leaders and
contributes to the AWS Community by speaking at several
summits, community days and meet-ups.
• Regular blogger, open-source contributor, and SME on
Machine Learning, MLOps, DevOps, Containers and
Serverless.
• I work for Inawisdom (an AWS Partner) as a principal
solutions architect and head of practice. I am Inawisdom’s
AWS APN Ambassador and evangelist.
Phil Basford
phil@inawisdom.com
@philipbasford
#1 EMEA
4. 4
ML LIFE CYCLE
Data Exploration
SageMaker Ground Truth
AWS Data Exchange
AWS ‘Lake House’
Open Data Sets
Experiment
SageMaker Notebooks
SageMaker Auto Pilot
ML Market Place
Testing and Evolution
SageMaker Debugger
SageMaker Experiments
Refinement
SageMaker Hyperparameter Tuning
SageMaker Notebooks
Inference
SageMaker Endpoints
SageMaker Batch Transform
Operationalize
SageMaker Model Monitor
AWS Step Functions Data Science SDK
SageMaker Pipelines
Define the Problem and Value
6. 6
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
7. 7
SERVERLESS
Lambda API Gateway
DynamoDB is A fully
managed non-sql
cloud service from
AWS. For machine
learning it is typically
used for reference
data.
DynamoDB
S3
SNS ; Pub + Sub
SQS : Queues
Fargate : Containers
Step Functions:
Workflows
..and more
Highly durable object
storage used for many
things including data
lakes. For machine
learning it is used to
store training data sets
and model artefacts
API Gateway is the
endpoint for your API,
it has extensive
security measures,
logging, and API
definition using open
API or swagger.
AWS Lambda is
AWS’s native and fully
managed cloud
service for running
application code
without the need to
run servers.
9. 9
Remember to always apply least privilege and other AWS Security best practice, be very protective of your data
SECURITY
AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider
using a dedicated Custom Key in KMS to-do this. This allows you tighter control by
limiting the ability to decrypt data, providing another layer security over S3.
AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM
roles and you need to make sure your policies are locked down to only the Actions
and Resources you need.
Amazon S3: SageMaker can use a range of data stores, however S3 is the most
popular. However please make sure you enable encryption, resource policies,
logging and versioning on your buckets.
Amazon VPC: SageMaker can run outside a VPC and access data over the public
internet (hopefully using HTTPS). This runs contrary to most corporate Information
Security Policies. Therefore please deploy in VPC with Private Links for extra security.
Data: Most importantly, only use the data you need. If the data contains PII or
PCI-DSS and you do not need those values then remove them or sanitised.
11. 11
Dev Ops in Machine Learning
ML OPS
Data Updates / Drift Detection
Structured, Simi
Structured,
Unstructured
Spark, EMR,
Glue, Matillion
Spark,
scikit-learn,
Containers,
SageMaker
processing
Including
validation of
Data
Technology
Considerations
ML
Algorithms and
Frameworks
SageMaker
training jobs
Accuracy Checks,
Golden Data Set
testing.
Model Debugging
New Data
Available
Data Pre
Processing
Component
ETL Training Verification Inference Monitoring
Batch or
Real-time
SageMaker
Endpoints,
SageMaker
Batch
Transform, ECS
Docker and
Functions,
SageMaker
Debugger
Base lining /
Sampling
predictions
Model drift
detection, Model
selection
automation
SageMaker
Model Monitor,
CloudWatch
12. 12
Dev Ops in Machine Learning
ML OPS
New Data Features / DS Changes (script mode)
Verified Data
Available
Data Pre-
processing
Data set used to
train previously
CI/CD is used to
build model
code
Component
Technology
Training Verification Inference Monitoring
Data Scientist ML Engineer Source Control
ETL
DevOps
Recommend
Additions
Potential
changes
SageMaker
Experiments and
hyper parameter
tuning jobs
14. 18
Optimising training and reach the business needs
TRAINING
Cost
Effort
Speed/Time
Complexity
Distributed Training
Split up large amounts of data into chucks and
training the chunks across many instances then
combining the outputs at the end
Multi Job Training
Used when a generalise model does not represent
the characteristics of the data or different
hyperparameters are need, i.e. Locations or Product
Groups. This involves running multiple training
process for different data sets at the same time
Data Parallelism
Using many cores or instances to train algorithms like
GPT-3’s that has billions of parameters
.
Model Parallelism
Splitting up training for a model that uses a Deep
Learning algorithm and a dense and/or a large
number of layers. As a single GPU cannot handled it
Pipe vs File
Improving training times by loading data incrementally
into models during training. Instead of requiring a
large amount of data to be downloaded before
training can start
Common Issues
Ø Train takes too long! We need it to take hours
not days
Ø Training is costing lots of money and we are
not sure if all the resources are being fully
utilised.
Ø Our data set is too big and uses a lot of
memory and network IO to process.
Ø We need to train hundreds of models at the
same time
Ø Client teams have limited experience in
orchestrion of training at scale
16. Inference types
ML OPS – INFERENCE TYPES
Real Time
➤ Business Critical, commonly uses are chat
bots, classifiers, recommenders or liner
regressors. Like credit risk, journey times
etc
➤ Hundred or thousands individual
predictions per second
➤ API Driven with Low Latency, typically
below 135ms at the 90th percentile.
Near Real Time
➤ Commonly used for image classification or
file analysis
➤ Hundred individual predictions per minute
and processing needs to be done within
seconds
➤ Event or Message Queue based,
predictions are sent back or stored
Occasional
➤ Examples are simple classifiers like Tax
codes
➤ Only a few predictions a month and
processing needs to be completed with
minutes
➤ API, Event or Message Queue based,
predicts sent back or stored
Batch
➤ End of month reporting, invoice
generation, warranty plan management
➤ Runs at Daily / Monthly / Set Times
➤ The data set is typically millions or tens of
millions of rows at once
Micro Batch
➤ Anomaly detection, invoice
approval and Image processing
➤ Executed regularly : every x
minutes or Y number of events.
Triggered by file upload or data
ingestion
➤ The data set is typically hundreds
or thousands of rows at once
Edge
➤ Used for Computer Vision, Fault Detection
in Manufacturing
➤ Runs on mobile phone apps and low
power devices. Uses sensors (i.e. video,
location, or heat)
➤ Model output is normally sent back to the
Cloud at regular intervals for analysis.
17. 23
Endpoint
Docker containers host the inference engines, inference engines can be written in any language and endpoints can use
more than one container. Primary container needs to implement a simple REST API.
Common Engines:
➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-cpu-py2
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-gpu-py2
➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-
inference:1.13-gpu
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow-serving:1.11-cpu
AMAZON SAGEMAKER – INFERENCE ENGINES
Dockerfile:
FROM tensorflow/serving:latest
RUN apt-get update && apt-get install -y --no-install-
recommends nginx git
RUN mkdir -p /opt/ml/model
COPY nginx.conf /etc/nginx/nginx.conf
ENTRYPOINT service nginx start | tensorflow_model_server --
rest_api_port=8501 --
model_config_file=/opt/ml/model/models.config
Container
http://localhost:8080/invocations
http://localhost:8080/ping
Amazon
SageMaker model.tar.gz
Primary Container
Nginx Gunicorn Model
Runtime
link
/opt/ml/model
X-Amzn-SageMaker-Custom-Attributes
18. 24
Logical components of an endpoint within Amazon SageMaker
AMAZON SAGEMAKER – REAL TIME INFERENCE
All components are immutable, any configuration changes require new models and endpoint configurations,
however there is a specific SageMaker API to update instance count and variant weight
Endpoint
Configuration
Endpoint
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Production Variant
Production Variant
Model
Initial
Count + Weight
Instance Type
SDKs
REST
SignV4
Requests
Name
19. 25
The following shows same experiment with M5 Instances and autoscaling enabled:
M5 INSTANCES WITH AUTOSCALING
The autoscaling group was set
between 2-4 instances and the
scaling policy to 100k requests.
The number of innovations
continued to rise and CPU never
went above 100%.
A scaling event happen at 08:45
and took 5 minutes to warm up.
No instances crashed and up to 4
instances were used.
20. 26
The following chart compares the two M5 based experiments:
WHY IS CPU USAGE THAT IMPORTANT?
Latency(red) increased when the
CPU went over 100%. The is due
to invocations having to wait
within SageMaker to be processed
Zzzzz, Phil does sleep!
The two M5 experiments had a
cost of $42.96
SageMaker Studio was used
instead of a SageMaker notebook
instances.
21. 27
The following are the four ways to deploy new versions of models in Amazon SageMaker
Rolling:
DEV OPS WITH SAGEMAKER
Endpoint
Configuration
Canary Variant
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
New Variant
Old Variant
Canary: Blue/Green: Linear:
weight
The default option, SageMaker
will start new instances and then
once they are healthy stop the
old ones
Canary deployments are done
using two Variants in the
Endpoint Configuration and
performed over two
CloudFormation updates.
Requires two CloudFormation
stacks and then changing the
endpoint name in the AWS
Lambda using an Environment
Variable
Linear uses two Variants in the
Endpoint Configuration and using
an AWS Step Function and AWS
Lambda to call the
UpdateEndpointWeightsAndCap
acities API.
23. Cost optimisation for training and inference
ML OPS – A 360°
Change in
Instance Size
Change in
Instance Type
No RI or
Saving Plans
for ML
Top Tips
➤ Spot instances (surplus capacity from cloud
providers) are cheaper for workloads that can
handle being rerun like batch or training. For
longer execution times consider using spot
instances with model checkpointing.
Daily
Feb
20
Mar
20
Apr
20
May
20
Jun
20
Jul
20
Aug
20
Sep
20
Oct
20
Nov
20
Dec
20
Jan
21
Inference Training Notebooks
Inference
57%
Training
15%
Notebooks
28%
Monthly Yearly
➤ Models that require GPU for training justify
additional consideration due to the use of more
expensive instance types.
➤ For GPUs analysis of the utilization of the GPUs
Cores and Memory is needed. However, CPU and
Network IO all need looking at. Make sure you feed
the GPUs enough data without bottlenecking
➤ Multi Model support allows for more than one
model to be hosted on the same instance. This
is very efficient for hosting many small models
(e.g. a model per city) as hosting one per
instance each would give poor resource
utilisation.
24. 30
Business Performance and KPIs
KPIS AND MODEL MONITORING
➤ The most import measure of a model is it
accomplishing what it set out to achieve
➤ This is judged by setting some clear KPIs and
judging how the model affects them.
➤ This can be done a number of ways however one
of the most simplest and impactful is constructing
a dashboard in a BI tool like QuickSight
Model Performance
➤ SageMaker Monitor can be used to base line a
model and detect diff
➤ Another important aspect to monitor is that
predictions are with in known boundaries
➤ Performance monitoring of the model can trigger
retraining when issues arise
25. AWS CloudWatch a dashboard providing complete oversight of the inference process
PERFORMANCE MONITORING
API error and
success rates
API Gateway
response times
using percentiles
Lambda
executions
Availability
recorded from
health checker
API Usage data
for Usage Plan
26. 32
X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models.
OBSERVING INFERENCE
Inference Function
Inference Function
Function A
Function B
Function C
Function C
Function D
Function E
Function F
Function G
Function H
APIGWUrl
Model
Function 1
Function 2
SQL: db_url
Model
27. 33
Amazon SageMaker exposes metrics to AWS CloudWatch
MONITORING SAGEMAKER
Name Dimension Statistic Threshold Time Period Missing
Endpoint model
latency
Milliseconds Average >100 For 5 minutes ignore
Endpoint model
invocations
Count Sum
> 10000
For 15 minutes
notBreaching
< 1000 breaching
Endpoint disk
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint CPU
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint memory
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint 5XX
errors
Count Sum >10 For 5 minutes
notBreaching
Endpoint 4XX
errors
Count Sum >50 For 5 minutes
The metrics in AWS CloudWatch
can then be used for alarms:
➤ Always pay attention to how to
handle missing data
➤ Always test your alarms
➤ Look to level your alarms
➤ Make your alarms complement
each other
29. Using automation and tools to deploy models and to maintain consistency
AUTOMATION AND PIPELINES
Data Foundation
Governance and Control
Experiments Development
Pre-
Production
Production
Infrastructure
Foundations:
➤ A solid Data
Lake/Warehouse with good
sources of data is required
for long term scaling of ML
usage
➤ Running models
operationally also means
considering availability,
fault tolerance and scaling
of instances.
➤ Having a robust security
posture using multiple
layers with auditability is
essential
➤ Consistent architecture,
development approaches
and deployments aid
maintainability
Scaling and refinement:
➤ Did your models improve,
or do they still meet, the
outcomes and KPIs that
you set out to affect?
➤ Has innovations in
technology meant that
complexity in development
or deployment can be
simplified? Allowing more
focus to be put on other
uses of ML?
➤ Are your models running
on the latest and most
optimal hardware?
➤ Do you need a feature
store to improve
collaboration and sharing
of features?
➤ Do you need a model
registry for control and
governance?
30. 36
AWS Step Functions Data Science Software Development Kit
MODEL RETRAINING
AWS Glue: Used for raw data ingress, cleaning that data and
then transforming that data into a training data set
Deployments to Amazon SageMaker endpoints: The ability
to perform deployments from the pipeline, including
blue/green, linear and canary style updates.
AWS Lambda: Used to stitch elements together and perform
any additional logic
AWS ECS/Fargate: There are situations where you may need
to run very long running processes over the data to prep the
data for training. Lambda is not suitable for this due to its
maximum execution time and memory limits, therefore
Fargate is preferred in these situations.
Amazon SageMaker training jobs: The ability to run training
on the data that the pipeline has got ready for you