SlideShare a Scribd company logo
1 of 33
Download to read offline
15 APRIL 2021
Machine Learning Operations
On AWS
Who I am?
• Experienced principal solutions architect, a lead developer
and head of practice for Inawisdom.
• All 12 AWS Certifications including SA Pro, Dev Ops Data
Analytics Specialism, and Machine Learning Specialism.
• Over 6 years of AWS experience and he has been
responsible for running production workloads of over 200
containers in a performance system that responded to
18,000 requests per second
• Visionary in ML Ops, Produced production workloads of
ML models at scale, including 1500 inferences per minute,
including active monitoring and alerting
• Has developed in Python, NodeJS + J2EE
• I am one of the Ipswich AWS User Group Leaders and
contributes to the AWS Community by speaking at several
summits, community days and meet-ups.
• Regular blogger, open-source contributor, and SME on
Machine Learning, MLOps, DevOps, Containers and
Serverless.
• I work for Inawisdom (an AWS Partner) as a principal
solutions architect and head of practice. I am Inawisdom’s
AWS APN Ambassador and evangelist.
Phil Basford
phil@inawisdom.com
@philipbasford
#1 EMEA
2
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
The AWS ML Stack
Broadest and most complete set of Machine Learning capabilities
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Ground
Truth
ML
Marketplace
Neo
Augmented
AI
Built-in
algorithms
Notebooks Experiments
Model
training &
tuning
Debugger Autopilot
Model
hosting
Model Monitor
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
AI SERVICES
ML SERVICES
ML FRAMEWORKS & INFRASTRUCTURE
Amazon
Textract
Amazon
Kendra
Contact Lens
For Amazon Connect
SageMaker Studio IDE
Amazon SageMaker
DeepGraphLibrary
4
ML LIFE CYCLE
Data Exploration
SageMaker Ground Truth
AWS Data Exchange
AWS ‘Lake House’
Open Data Sets
Experiment
SageMaker Notebooks
SageMaker Auto Pilot
ML Market Place
Testing and Evolution
SageMaker Debugger
SageMaker Experiments
Refinement
SageMaker Hyperparameter Tuning
SageMaker Notebooks
Inference
SageMaker Endpoints
SageMaker Batch Transform
Operationalize
SageMaker Model Monitor
AWS Step Functions Data Science SDK
SageMaker Pipelines
Define the Problem and Value
ARCHITECTURE
6
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
7
SERVERLESS
Lambda API Gateway
DynamoDB is A fully
managed non-sql
cloud service from
AWS. For machine
learning it is typically
used for reference
data.
DynamoDB
S3
SNS ; Pub + Sub
SQS : Queues
Fargate : Containers
Step Functions:
Workflows
..and more
Highly durable object
storage used for many
things including data
lakes. For machine
learning it is used to
store training data sets
and model artefacts
API Gateway is the
endpoint for your API,
it has extensive
security measures,
logging, and API
definition using open
API or swagger.
AWS Lambda is
AWS’s native and fully
managed cloud
service for running
application code
without the need to
run servers.
8
THE SOLUTION AND ARCHITECTURE
9
Remember to always apply least privilege and other AWS Security best practice, be very protective of your data
SECURITY
AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider
using a dedicated Custom Key in KMS to-do this. This allows you tighter control by
limiting the ability to decrypt data, providing another layer security over S3.
AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM
roles and you need to make sure your policies are locked down to only the Actions
and Resources you need.
Amazon S3: SageMaker can use a range of data stores, however S3 is the most
popular. However please make sure you enable encryption, resource policies,
logging and versioning on your buckets.
Amazon VPC: SageMaker can run outside a VPC and access data over the public
internet (hopefully using HTTPS). This runs contrary to most corporate Information
Security Policies. Therefore please deploy in VPC with Private Links for extra security.
Data: Most importantly, only use the data you need. If the data contains PII or
PCI-DSS and you do not need those values then remove them or sanitised.
ML OPS PROCESSES
11
Dev Ops in Machine Learning
ML OPS
Data Updates / Drift Detection
Structured, Simi
Structured,
Unstructured
Spark, EMR,
Glue, Matillion
Spark,
scikit-learn,
Containers,
SageMaker
processing
Including
validation of
Data
Technology
Considerations
ML
Algorithms and
Frameworks
SageMaker
training jobs
Accuracy Checks,
Golden Data Set
testing.
Model Debugging
New Data
Available
Data Pre
Processing
Component
ETL Training Verification Inference Monitoring
Batch or
Real-time
SageMaker
Endpoints,
SageMaker
Batch
Transform, ECS
Docker and
Functions,
SageMaker
Debugger
Base lining /
Sampling
predictions
Model drift
detection, Model
selection
automation
SageMaker
Model Monitor,
CloudWatch
12
Dev Ops in Machine Learning
ML OPS
New Data Features / DS Changes (script mode)
Verified Data
Available
Data Pre-
processing
Data set used to
train previously
CI/CD is used to
build model
code
Component
Technology
Training Verification Inference Monitoring
Data Scientist ML Engineer Source Control
ETL
DevOps
Recommend
Additions
Potential
changes
SageMaker
Experiments and
hyper parameter
tuning jobs
TRAINING
18
Optimising training and reach the business needs
TRAINING
Cost
Effort
Speed/Time
Complexity
Distributed Training
Split up large amounts of data into chucks and
training the chunks across many instances then
combining the outputs at the end
Multi Job Training
Used when a generalise model does not represent
the characteristics of the data or different
hyperparameters are need, i.e. Locations or Product
Groups. This involves running multiple training
process for different data sets at the same time
Data Parallelism
Using many cores or instances to train algorithms like
GPT-3’s that has billions of parameters
.
Model Parallelism
Splitting up training for a model that uses a Deep
Learning algorithm and a dense and/or a large
number of layers. As a single GPU cannot handled it
Pipe vs File
Improving training times by loading data incrementally
into models during training. Instead of requiring a
large amount of data to be downloaded before
training can start
Common Issues
Ø Train takes too long! We need it to take hours
not days
Ø Training is costing lots of money and we are
not sure if all the resources are being fully
utilised.
Ø Our data set is too big and uses a lot of
memory and network IO to process.
Ø We need to train hundreds of models at the
same time
Ø Client teams have limited experience in
orchestrion of training at scale
INFERENCE
Inference types
ML OPS – INFERENCE TYPES
Real Time
➤ Business Critical, commonly uses are chat
bots, classifiers, recommenders or liner
regressors. Like credit risk, journey times
etc
➤ Hundred or thousands individual
predictions per second
➤ API Driven with Low Latency, typically
below 135ms at the 90th percentile.
Near Real Time
➤ Commonly used for image classification or
file analysis
➤ Hundred individual predictions per minute
and processing needs to be done within
seconds
➤ Event or Message Queue based,
predictions are sent back or stored
Occasional
➤ Examples are simple classifiers like Tax
codes
➤ Only a few predictions a month and
processing needs to be completed with
minutes
➤ API, Event or Message Queue based,
predicts sent back or stored
Batch
➤ End of month reporting, invoice
generation, warranty plan management
➤ Runs at Daily / Monthly / Set Times
➤ The data set is typically millions or tens of
millions of rows at once
Micro Batch
➤ Anomaly detection, invoice
approval and Image processing
➤ Executed regularly : every x
minutes or Y number of events.
Triggered by file upload or data
ingestion
➤ The data set is typically hundreds
or thousands of rows at once
Edge
➤ Used for Computer Vision, Fault Detection
in Manufacturing
➤ Runs on mobile phone apps and low
power devices. Uses sensors (i.e. video,
location, or heat)
➤ Model output is normally sent back to the
Cloud at regular intervals for analysis.
23
Endpoint
Docker containers host the inference engines, inference engines can be written in any language and endpoints can use
more than one container. Primary container needs to implement a simple REST API.
Common Engines:
➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-cpu-py2
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-gpu-py2
➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-
inference:1.13-gpu
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow-serving:1.11-cpu
AMAZON SAGEMAKER – INFERENCE ENGINES
Dockerfile:
FROM tensorflow/serving:latest
RUN apt-get update && apt-get install -y --no-install-
recommends nginx git
RUN mkdir -p /opt/ml/model
COPY nginx.conf /etc/nginx/nginx.conf
ENTRYPOINT service nginx start | tensorflow_model_server --
rest_api_port=8501 --
model_config_file=/opt/ml/model/models.config
Container
http://localhost:8080/invocations
http://localhost:8080/ping
Amazon
SageMaker model.tar.gz
Primary Container
Nginx Gunicorn Model
Runtime
link
/opt/ml/model
X-Amzn-SageMaker-Custom-Attributes
24
Logical components of an endpoint within Amazon SageMaker
AMAZON SAGEMAKER – REAL TIME INFERENCE
All components are immutable, any configuration changes require new models and endpoint configurations,
however there is a specific SageMaker API to update instance count and variant weight
Endpoint
Configuration
Endpoint
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Production Variant
Production Variant
Model
Initial
Count + Weight
Instance Type
SDKs
REST
SignV4
Requests
Name
25
The following shows same experiment with M5 Instances and autoscaling enabled:
M5 INSTANCES WITH AUTOSCALING
The autoscaling group was set
between 2-4 instances and the
scaling policy to 100k requests.
The number of innovations
continued to rise and CPU never
went above 100%.
A scaling event happen at 08:45
and took 5 minutes to warm up.
No instances crashed and up to 4
instances were used.
26
The following chart compares the two M5 based experiments:
WHY IS CPU USAGE THAT IMPORTANT?
Latency(red) increased when the
CPU went over 100%. The is due
to invocations having to wait
within SageMaker to be processed
Zzzzz, Phil does sleep!
The two M5 experiments had a
cost of $42.96
SageMaker Studio was used
instead of a SageMaker notebook
instances.
27
The following are the four ways to deploy new versions of models in Amazon SageMaker
Rolling:
DEV OPS WITH SAGEMAKER
Endpoint
Configuration
Canary Variant
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
New Variant
Old Variant
Canary: Blue/Green: Linear:
weight
The default option, SageMaker
will start new instances and then
once they are healthy stop the
old ones
Canary deployments are done
using two Variants in the
Endpoint Configuration and
performed over two
CloudFormation updates.
Requires two CloudFormation
stacks and then changing the
endpoint name in the AWS
Lambda using an Environment
Variable
Linear uses two Variants in the
Endpoint Configuration and using
an AWS Step Function and AWS
Lambda to call the
UpdateEndpointWeightsAndCap
acities API.
MONITORING
Cost optimisation for training and inference
ML OPS – A 360°
Change in
Instance Size
Change in
Instance Type
No RI or
Saving Plans
for ML
Top Tips
➤ Spot instances (surplus capacity from cloud
providers) are cheaper for workloads that can
handle being rerun like batch or training. For
longer execution times consider using spot
instances with model checkpointing.
Daily
Feb
20
Mar
20
Apr
20
May
20
Jun
20
Jul
20
Aug
20
Sep
20
Oct
20
Nov
20
Dec
20
Jan
21
Inference Training Notebooks
Inference
57%
Training
15%
Notebooks
28%
Monthly Yearly
➤ Models that require GPU for training justify
additional consideration due to the use of more
expensive instance types.
➤ For GPUs analysis of the utilization of the GPUs
Cores and Memory is needed. However, CPU and
Network IO all need looking at. Make sure you feed
the GPUs enough data without bottlenecking
➤ Multi Model support allows for more than one
model to be hosted on the same instance. This
is very efficient for hosting many small models
(e.g. a model per city) as hosting one per
instance each would give poor resource
utilisation.
30
Business Performance and KPIs
KPIS AND MODEL MONITORING
➤ The most import measure of a model is it
accomplishing what it set out to achieve
➤ This is judged by setting some clear KPIs and
judging how the model affects them.
➤ This can be done a number of ways however one
of the most simplest and impactful is constructing
a dashboard in a BI tool like QuickSight
Model Performance
➤ SageMaker Monitor can be used to base line a
model and detect diff
➤ Another important aspect to monitor is that
predictions are with in known boundaries
➤ Performance monitoring of the model can trigger
retraining when issues arise
AWS CloudWatch a dashboard providing complete oversight of the inference process
PERFORMANCE MONITORING
API error and
success rates
API Gateway
response times
using percentiles
Lambda
executions
Availability
recorded from
health checker
API Usage data
for Usage Plan
32
X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models.
OBSERVING INFERENCE
Inference Function
Inference Function
Function A
Function B
Function C
Function C
Function D
Function E
Function F
Function G
Function H
APIGWUrl
Model
Function 1
Function 2
SQL: db_url
Model
33
Amazon SageMaker exposes metrics to AWS CloudWatch
MONITORING SAGEMAKER
Name Dimension Statistic Threshold Time Period Missing
Endpoint model
latency
Milliseconds Average >100 For 5 minutes ignore
Endpoint model
invocations
Count Sum
> 10000
For 15 minutes
notBreaching
< 1000 breaching
Endpoint disk
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint CPU
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint memory
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint 5XX
errors
Count Sum >10 For 5 minutes
notBreaching
Endpoint 4XX
errors
Count Sum >50 For 5 minutes
The metrics in AWS CloudWatch
can then be used for alarms:
➤ Always pay attention to how to
handle missing data
➤ Always test your alarms
➤ Look to level your alarms
➤ Make your alarms complement
each other
AUTOMATION
Using automation and tools to deploy models and to maintain consistency
AUTOMATION AND PIPELINES
Data Foundation
Governance and Control
Experiments Development
Pre-
Production
Production
Infrastructure
Foundations:
➤ A solid Data
Lake/Warehouse with good
sources of data is required
for long term scaling of ML
usage
➤ Running models
operationally also means
considering availability,
fault tolerance and scaling
of instances.
➤ Having a robust security
posture using multiple
layers with auditability is
essential
➤ Consistent architecture,
development approaches
and deployments aid
maintainability
Scaling and refinement:
➤ Did your models improve,
or do they still meet, the
outcomes and KPIs that
you set out to affect?
➤ Has innovations in
technology meant that
complexity in development
or deployment can be
simplified? Allowing more
focus to be put on other
uses of ML?
➤ Are your models running
on the latest and most
optimal hardware?
➤ Do you need a feature
store to improve
collaboration and sharing
of features?
➤ Do you need a model
registry for control and
governance?
36
AWS Step Functions Data Science Software Development Kit
MODEL RETRAINING
AWS Glue: Used for raw data ingress, cleaning that data and
then transforming that data into a training data set
Deployments to Amazon SageMaker endpoints: The ability
to perform deployments from the pipeline, including
blue/green, linear and canary style updates.
AWS Lambda: Used to stitch elements together and perform
any additional logic
AWS ECS/Fargate: There are situations where you may need
to run very long running processes over the data to prep the
data for training. Lambda is not suitable for this due to its
maximum execution time and memory limits, therefore
Fargate is preferred in these situations.
Amazon SageMaker training jobs: The ability to run training
on the data that the pipeline has got ready for you
38
re:Invent and Webinar:
➤ https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-event.html
➤ https://www.youtube.com/watch?v=lx9fP_4yi2s
➤ https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/
➤ https://www.inawisdom.com/machine-learning/machine-learning-performance-more-than-skin-deep/
➤ https://www.inawisdom.com/machine-learning/a-model-is-for-life-not-just-for-christmas/
➤ https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html
➤ https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alar
ms-and-missing-data
➤ https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/readmelink.html#getting-started-
with-sample-jupyter-notebooks
REFERENCES
Other:
My blogs:
QUESTIONS
020 3575 1337
info@inawisdom.com
Columba House,
Adastral Park, Martlesham Heath
Ipswich, Suffolk, IP5 3RE
www.inawisdom.com
@philipbasford

More Related Content

What's hot

From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon Web Services Korea
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersDaniel Zivkovic
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...Amazon Web Services Korea
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro sessionAvinash Patil
 

What's hot (20)

From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Cloud Migration: A How-To Guide
Cloud Migration: A How-To GuideCloud Migration: A How-To Guide
Cloud Migration: A How-To Guide
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
 
CI/CD on AWS
CI/CD on AWSCI/CD on AWS
CI/CD on AWS
 
AWS Technical Essentials Day
AWS Technical Essentials DayAWS Technical Essentials Day
AWS Technical Essentials Day
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Introduction to Sagemaker
Introduction to SagemakerIntroduction to Sagemaker
Introduction to Sagemaker
 
Developer Experience on AWS
Developer Experience on AWSDeveloper Experience on AWS
Developer Experience on AWS
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...
Amazon Personalize Event Tracker 실시간 고객 반응을 고려한 추천::김태수, 솔루션즈 아키텍트, AWS::AWS ...
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 

Similar to Ml ops on AWS

World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning modelsPhilipBasford
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAmazon Web Services
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning India Quotient
 
Building what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructureBuilding what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructureMediaAgility
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskCloverDX
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014Amazon Web Services
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWSChristian Beedgen
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13Amazon Web Services
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsAmazon Web Services
 
Machine Learning in azione con Amazon SageMaker
Machine Learning in azione con Amazon SageMakerMachine Learning in azione con Amazon SageMaker
Machine Learning in azione con Amazon SageMakerAmazon Web Services
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksIntegrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksAmazon Web Services
 

Similar to Ml ops on AWS (20)

World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Building what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructureBuilding what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructure
 
Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017
 
Moving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid RiskMoving Legacy Apps to Cloud: How to Avoid Risk
Moving Legacy Apps to Cloud: How to Avoid Risk
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital Markets
 
Serverless_with_MongoDB
Serverless_with_MongoDBServerless_with_MongoDB
Serverless_with_MongoDB
 
Machine Learning in azione con Amazon SageMaker
Machine Learning in azione con Amazon SageMakerMachine Learning in azione con Amazon SageMaker
Machine Learning in azione con Amazon SageMaker
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksIntegrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
 

More from PhilipBasford

re:cap Generative AI journey with Bedrock
re:cap Generative AI journey  with Bedrockre:cap Generative AI journey  with Bedrock
re:cap Generative AI journey with BedrockPhilipBasford
 
AIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitivePhilipBasford
 
Inawisdom Quick Sight
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick SightPhilipBasford
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data JourneyPhilipBasford
 
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfRealizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfPhilipBasford
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfPhilipBasford
 
Inawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfInawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfPhilipBasford
 
C04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfC04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfPhilipBasford
 
Palringo AWS London Summit 2017
Palringo AWS London Summit 2017Palringo AWS London Summit 2017
Palringo AWS London Summit 2017PhilipBasford
 
Palringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPalringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPhilipBasford
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage makerPhilipBasford
 

More from PhilipBasford (16)

re:cap Generative AI journey with Bedrock
re:cap Generative AI journey  with Bedrockre:cap Generative AI journey  with Bedrock
re:cap Generative AI journey with Bedrock
 
AIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
 
Inawisdom IDP
Inawisdom IDPInawisdom IDP
Inawisdom IDP
 
Inawisdom MLOPS
Inawisdom MLOPSInawisdom MLOPS
Inawisdom MLOPS
 
Inawisdom Quick Sight
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick Sight
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data Journey
 
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfRealizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
 
Inawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfInawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdf
 
D3 IDP Slides.pdf
D3 IDP Slides.pdfD3 IDP Slides.pdf
D3 IDP Slides.pdf
 
C04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfC04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdf
 
Fish Cam.pptx
Fish Cam.pptxFish Cam.pptx
Fish Cam.pptx
 
Ml 3 ways
Ml 3 waysMl 3 ways
Ml 3 ways
 
Palringo AWS London Summit 2017
Palringo AWS London Summit 2017Palringo AWS London Summit 2017
Palringo AWS London Summit 2017
 
Palringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPalringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloud
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Ml ops on AWS

  • 1. 15 APRIL 2021 Machine Learning Operations On AWS
  • 2. Who I am? • Experienced principal solutions architect, a lead developer and head of practice for Inawisdom. • All 12 AWS Certifications including SA Pro, Dev Ops Data Analytics Specialism, and Machine Learning Specialism. • Over 6 years of AWS experience and he has been responsible for running production workloads of over 200 containers in a performance system that responded to 18,000 requests per second • Visionary in ML Ops, Produced production workloads of ML models at scale, including 1500 inferences per minute, including active monitoring and alerting • Has developed in Python, NodeJS + J2EE • I am one of the Ipswich AWS User Group Leaders and contributes to the AWS Community by speaking at several summits, community days and meet-ups. • Regular blogger, open-source contributor, and SME on Machine Learning, MLOps, DevOps, Containers and Serverless. • I work for Inawisdom (an AWS Partner) as a principal solutions architect and head of practice. I am Inawisdom’s AWS APN Ambassador and evangelist. Phil Basford phil@inawisdom.com @philipbasford #1 EMEA
  • 3. 2 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | The AWS ML Stack Broadest and most complete set of Machine Learning capabilities VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS Ground Truth ML Marketplace Neo Augmented AI Built-in algorithms Notebooks Experiments Model training & tuning Debugger Autopilot Model hosting Model Monitor Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru AI SERVICES ML SERVICES ML FRAMEWORKS & INFRASTRUCTURE Amazon Textract Amazon Kendra Contact Lens For Amazon Connect SageMaker Studio IDE Amazon SageMaker DeepGraphLibrary
  • 4. 4 ML LIFE CYCLE Data Exploration SageMaker Ground Truth AWS Data Exchange AWS ‘Lake House’ Open Data Sets Experiment SageMaker Notebooks SageMaker Auto Pilot ML Market Place Testing and Evolution SageMaker Debugger SageMaker Experiments Refinement SageMaker Hyperparameter Tuning SageMaker Notebooks Inference SageMaker Endpoints SageMaker Batch Transform Operationalize SageMaker Model Monitor AWS Step Functions Data Science SDK SageMaker Pipelines Define the Problem and Value
  • 6. 6 Monitoring, observing and alerting using CloudWatch and X- Ray. Infrastructure as Code with SAM and CloudFormation. Operational Excellence Least privilege, Data Encryption at Rest, and Data Encryption in Transit using IAM Policies, Resource Policies, KMS, Secret Manager, VPC and Security Group. Security Elastic scaling based on demand and meeting response times using Auto Scaling, Serverless, and Per Request managed services. Performance Serverless and fully managed services to lower TCO. Resource Tag everything possible for cost analysis. Right sizing instance types for model hosting. Cost Optimisation Fault tolerance and auto healing to meet a target availability using Auto Scaling, Multi AZ, Multi Region, Read Replicas and Snapshots. Reliance https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
  • 7. 7 SERVERLESS Lambda API Gateway DynamoDB is A fully managed non-sql cloud service from AWS. For machine learning it is typically used for reference data. DynamoDB S3 SNS ; Pub + Sub SQS : Queues Fargate : Containers Step Functions: Workflows ..and more Highly durable object storage used for many things including data lakes. For machine learning it is used to store training data sets and model artefacts API Gateway is the endpoint for your API, it has extensive security measures, logging, and API definition using open API or swagger. AWS Lambda is AWS’s native and fully managed cloud service for running application code without the need to run servers.
  • 8. 8 THE SOLUTION AND ARCHITECTURE
  • 9. 9 Remember to always apply least privilege and other AWS Security best practice, be very protective of your data SECURITY AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider using a dedicated Custom Key in KMS to-do this. This allows you tighter control by limiting the ability to decrypt data, providing another layer security over S3. AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM roles and you need to make sure your policies are locked down to only the Actions and Resources you need. Amazon S3: SageMaker can use a range of data stores, however S3 is the most popular. However please make sure you enable encryption, resource policies, logging and versioning on your buckets. Amazon VPC: SageMaker can run outside a VPC and access data over the public internet (hopefully using HTTPS). This runs contrary to most corporate Information Security Policies. Therefore please deploy in VPC with Private Links for extra security. Data: Most importantly, only use the data you need. If the data contains PII or PCI-DSS and you do not need those values then remove them or sanitised.
  • 11. 11 Dev Ops in Machine Learning ML OPS Data Updates / Drift Detection Structured, Simi Structured, Unstructured Spark, EMR, Glue, Matillion Spark, scikit-learn, Containers, SageMaker processing Including validation of Data Technology Considerations ML Algorithms and Frameworks SageMaker training jobs Accuracy Checks, Golden Data Set testing. Model Debugging New Data Available Data Pre Processing Component ETL Training Verification Inference Monitoring Batch or Real-time SageMaker Endpoints, SageMaker Batch Transform, ECS Docker and Functions, SageMaker Debugger Base lining / Sampling predictions Model drift detection, Model selection automation SageMaker Model Monitor, CloudWatch
  • 12. 12 Dev Ops in Machine Learning ML OPS New Data Features / DS Changes (script mode) Verified Data Available Data Pre- processing Data set used to train previously CI/CD is used to build model code Component Technology Training Verification Inference Monitoring Data Scientist ML Engineer Source Control ETL DevOps Recommend Additions Potential changes SageMaker Experiments and hyper parameter tuning jobs
  • 14. 18 Optimising training and reach the business needs TRAINING Cost Effort Speed/Time Complexity Distributed Training Split up large amounts of data into chucks and training the chunks across many instances then combining the outputs at the end Multi Job Training Used when a generalise model does not represent the characteristics of the data or different hyperparameters are need, i.e. Locations or Product Groups. This involves running multiple training process for different data sets at the same time Data Parallelism Using many cores or instances to train algorithms like GPT-3’s that has billions of parameters . Model Parallelism Splitting up training for a model that uses a Deep Learning algorithm and a dense and/or a large number of layers. As a single GPU cannot handled it Pipe vs File Improving training times by loading data incrementally into models during training. Instead of requiring a large amount of data to be downloaded before training can start Common Issues Ø Train takes too long! We need it to take hours not days Ø Training is costing lots of money and we are not sure if all the resources are being fully utilised. Ø Our data set is too big and uses a lot of memory and network IO to process. Ø We need to train hundreds of models at the same time Ø Client teams have limited experience in orchestrion of training at scale
  • 16. Inference types ML OPS – INFERENCE TYPES Real Time ➤ Business Critical, commonly uses are chat bots, classifiers, recommenders or liner regressors. Like credit risk, journey times etc ➤ Hundred or thousands individual predictions per second ➤ API Driven with Low Latency, typically below 135ms at the 90th percentile. Near Real Time ➤ Commonly used for image classification or file analysis ➤ Hundred individual predictions per minute and processing needs to be done within seconds ➤ Event or Message Queue based, predictions are sent back or stored Occasional ➤ Examples are simple classifiers like Tax codes ➤ Only a few predictions a month and processing needs to be completed with minutes ➤ API, Event or Message Queue based, predicts sent back or stored Batch ➤ End of month reporting, invoice generation, warranty plan management ➤ Runs at Daily / Monthly / Set Times ➤ The data set is typically millions or tens of millions of rows at once Micro Batch ➤ Anomaly detection, invoice approval and Image processing ➤ Executed regularly : every x minutes or Y number of events. Triggered by file upload or data ingestion ➤ The data set is typically hundreds or thousands of rows at once Edge ➤ Used for Computer Vision, Fault Detection in Manufacturing ➤ Runs on mobile phone apps and low power devices. Uses sensors (i.e. video, location, or heat) ➤ Model output is normally sent back to the Cloud at regular intervals for analysis.
  • 17. 23 Endpoint Docker containers host the inference engines, inference engines can be written in any language and endpoints can use more than one container. Primary container needs to implement a simple REST API. Common Engines: ➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1 ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow:1.11-cpu-py2 ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow:1.11-gpu-py2 ➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow- inference:1.13-gpu ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow-serving:1.11-cpu AMAZON SAGEMAKER – INFERENCE ENGINES Dockerfile: FROM tensorflow/serving:latest RUN apt-get update && apt-get install -y --no-install- recommends nginx git RUN mkdir -p /opt/ml/model COPY nginx.conf /etc/nginx/nginx.conf ENTRYPOINT service nginx start | tensorflow_model_server -- rest_api_port=8501 -- model_config_file=/opt/ml/model/models.config Container http://localhost:8080/invocations http://localhost:8080/ping Amazon SageMaker model.tar.gz Primary Container Nginx Gunicorn Model Runtime link /opt/ml/model X-Amzn-SageMaker-Custom-Attributes
  • 18. 24 Logical components of an endpoint within Amazon SageMaker AMAZON SAGEMAKER – REAL TIME INFERENCE All components are immutable, any configuration changes require new models and endpoint configurations, however there is a specific SageMaker API to update instance count and variant weight Endpoint Configuration Endpoint Inference Engine + Model Primary Container Container Container VPC S3 KMS + IAM Inference Engine + Model Primary Container Container Container VPC S3 KMS + IAM Production Variant Production Variant Model Initial Count + Weight Instance Type SDKs REST SignV4 Requests Name
  • 19. 25 The following shows same experiment with M5 Instances and autoscaling enabled: M5 INSTANCES WITH AUTOSCALING The autoscaling group was set between 2-4 instances and the scaling policy to 100k requests. The number of innovations continued to rise and CPU never went above 100%. A scaling event happen at 08:45 and took 5 minutes to warm up. No instances crashed and up to 4 instances were used.
  • 20. 26 The following chart compares the two M5 based experiments: WHY IS CPU USAGE THAT IMPORTANT? Latency(red) increased when the CPU went over 100%. The is due to invocations having to wait within SageMaker to be processed Zzzzz, Phil does sleep! The two M5 experiments had a cost of $42.96 SageMaker Studio was used instead of a SageMaker notebook instances.
  • 21. 27 The following are the four ways to deploy new versions of models in Amazon SageMaker Rolling: DEV OPS WITH SAGEMAKER Endpoint Configuration Canary Variant Full Variant Endpoint Configuration Full Variant Endpoint Configuration Full Variant Endpoint Configuration Full Variant Endpoint Configuration New Variant Old Variant Canary: Blue/Green: Linear: weight The default option, SageMaker will start new instances and then once they are healthy stop the old ones Canary deployments are done using two Variants in the Endpoint Configuration and performed over two CloudFormation updates. Requires two CloudFormation stacks and then changing the endpoint name in the AWS Lambda using an Environment Variable Linear uses two Variants in the Endpoint Configuration and using an AWS Step Function and AWS Lambda to call the UpdateEndpointWeightsAndCap acities API.
  • 23. Cost optimisation for training and inference ML OPS – A 360° Change in Instance Size Change in Instance Type No RI or Saving Plans for ML Top Tips ➤ Spot instances (surplus capacity from cloud providers) are cheaper for workloads that can handle being rerun like batch or training. For longer execution times consider using spot instances with model checkpointing. Daily Feb 20 Mar 20 Apr 20 May 20 Jun 20 Jul 20 Aug 20 Sep 20 Oct 20 Nov 20 Dec 20 Jan 21 Inference Training Notebooks Inference 57% Training 15% Notebooks 28% Monthly Yearly ➤ Models that require GPU for training justify additional consideration due to the use of more expensive instance types. ➤ For GPUs analysis of the utilization of the GPUs Cores and Memory is needed. However, CPU and Network IO all need looking at. Make sure you feed the GPUs enough data without bottlenecking ➤ Multi Model support allows for more than one model to be hosted on the same instance. This is very efficient for hosting many small models (e.g. a model per city) as hosting one per instance each would give poor resource utilisation.
  • 24. 30 Business Performance and KPIs KPIS AND MODEL MONITORING ➤ The most import measure of a model is it accomplishing what it set out to achieve ➤ This is judged by setting some clear KPIs and judging how the model affects them. ➤ This can be done a number of ways however one of the most simplest and impactful is constructing a dashboard in a BI tool like QuickSight Model Performance ➤ SageMaker Monitor can be used to base line a model and detect diff ➤ Another important aspect to monitor is that predictions are with in known boundaries ➤ Performance monitoring of the model can trigger retraining when issues arise
  • 25. AWS CloudWatch a dashboard providing complete oversight of the inference process PERFORMANCE MONITORING API error and success rates API Gateway response times using percentiles Lambda executions Availability recorded from health checker API Usage data for Usage Plan
  • 26. 32 X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models. OBSERVING INFERENCE Inference Function Inference Function Function A Function B Function C Function C Function D Function E Function F Function G Function H APIGWUrl Model Function 1 Function 2 SQL: db_url Model
  • 27. 33 Amazon SageMaker exposes metrics to AWS CloudWatch MONITORING SAGEMAKER Name Dimension Statistic Threshold Time Period Missing Endpoint model latency Milliseconds Average >100 For 5 minutes ignore Endpoint model invocations Count Sum > 10000 For 15 minutes notBreaching < 1000 breaching Endpoint disk usage % Average > 90% For 15 minutes ignore > 80% Endpoint CPU usage % Average > 90% For 15 minutes ignore > 80% Endpoint memory usage % Average > 90% For 15 minutes ignore > 80% Endpoint 5XX errors Count Sum >10 For 5 minutes notBreaching Endpoint 4XX errors Count Sum >50 For 5 minutes The metrics in AWS CloudWatch can then be used for alarms: ➤ Always pay attention to how to handle missing data ➤ Always test your alarms ➤ Look to level your alarms ➤ Make your alarms complement each other
  • 29. Using automation and tools to deploy models and to maintain consistency AUTOMATION AND PIPELINES Data Foundation Governance and Control Experiments Development Pre- Production Production Infrastructure Foundations: ➤ A solid Data Lake/Warehouse with good sources of data is required for long term scaling of ML usage ➤ Running models operationally also means considering availability, fault tolerance and scaling of instances. ➤ Having a robust security posture using multiple layers with auditability is essential ➤ Consistent architecture, development approaches and deployments aid maintainability Scaling and refinement: ➤ Did your models improve, or do they still meet, the outcomes and KPIs that you set out to affect? ➤ Has innovations in technology meant that complexity in development or deployment can be simplified? Allowing more focus to be put on other uses of ML? ➤ Are your models running on the latest and most optimal hardware? ➤ Do you need a feature store to improve collaboration and sharing of features? ➤ Do you need a model registry for control and governance?
  • 30. 36 AWS Step Functions Data Science Software Development Kit MODEL RETRAINING AWS Glue: Used for raw data ingress, cleaning that data and then transforming that data into a training data set Deployments to Amazon SageMaker endpoints: The ability to perform deployments from the pipeline, including blue/green, linear and canary style updates. AWS Lambda: Used to stitch elements together and perform any additional logic AWS ECS/Fargate: There are situations where you may need to run very long running processes over the data to prep the data for training. Lambda is not suitable for this due to its maximum execution time and memory limits, therefore Fargate is preferred in these situations. Amazon SageMaker training jobs: The ability to run training on the data that the pipeline has got ready for you
  • 31. 38 re:Invent and Webinar: ➤ https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-event.html ➤ https://www.youtube.com/watch?v=lx9fP_4yi2s ➤ https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/ ➤ https://www.inawisdom.com/machine-learning/machine-learning-performance-more-than-skin-deep/ ➤ https://www.inawisdom.com/machine-learning/a-model-is-for-life-not-just-for-christmas/ ➤ https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html ➤ https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alar ms-and-missing-data ➤ https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/readmelink.html#getting-started- with-sample-jupyter-notebooks REFERENCES Other: My blogs:
  • 33. 020 3575 1337 info@inawisdom.com Columba House, Adastral Park, Martlesham Heath Ipswich, Suffolk, IP5 3RE www.inawisdom.com @philipbasford