SlideShare a Scribd company logo
1 of 33
Download to read offline
PHIL BASFORD – HEAD OF SOLUTION ENGINEERING
Wednesday, 15 July 2020
PRODUCTIONISING MACHINE
LEARNING WITH SAGEMAKER
@philipbasford
ML OPS
Featuring Aramex
© 2023 Cognizant | Private
November 2023
2
Cognizant’s UK&I specialist AWS Data & AI Team
INAWISDOM
Found in 2016, an AWS Partner since 2017 and Premier
Partner since 2019. Inawisdom was acquired by Cognizant in
2020 and is part of Cognizant’s UK&I Consulting
Inawisdom lives and breath AWS including holding over 180
AWS certifications and accreditations. Inawisdom maintain a
close relationship with the AWS team, supporting and staying
up-to-date with all the latest developments.
Inawisdom has been awarded in the following
areas:
► ML Partner of the Year 2020
► Differentiation Partner of the Year 2019
► Global Launch Partner – CCI
► Launch Partner – AWS UAE Region
Inawisdom holds 9 competencies and service designations, reflecting business-
wide expertise in key areas:
Our Qualifications
All of our consultants hold at least 1
AWS certification. Including some
consultants with all certifications
Our CTO has been ranked #1 AWS
Ambassador in EMEA in 2021 and
2022
Your Data, AI and Machine Learning partner
WHY COGNIZANT
We offer a rapid,
proven path to
Machine Learning
excellence
We help customers in
a broad range of
industries achieve
their ML goals
We are recognised,
all-in AWS experts,
focused on customer
success
We offer full-stack
services, including AI /
ML, Data & Analytics,
BI and MLOps
Full-stack capability
OUR SERVICES
Business
Differentiation / Value
Data Driven
Business Decisions
Cloud Transformation
Adoption and Scale
Digital
Enablement
AI and Machine Learning
Data & Analytics
Data Foundations
Cloud Infrastructure
Landing Zone, Control Tower, migration
Discover. Deliver. Productionise. Scale.
ACCELERATE AI/ML ADOPTION ON AWS
D
e
p
l
o
y
Scaled AI Productionise
Operationalise
Discovery
D
i
f
f
e
r
e
n
t
i
a
t
e
I
n
n
o
v
a
t
e
P
r
o
v
e
V
a
l
u
e
B
u
s
i
n
e
s
s
C
a
s
e
D
e
p
l
o
y
E
m
b
e
d
A
u
t
o
m
a
t
e
E
n
a
b
l
e
Data Science
AI/ML
Data
Engineering
& Platform
DevOps &
MLOps
Cloud/Data
Architecture
6
INTRODUCTION
Aramex is an international express, mail delivery and
logistics services company based in Dubai, United Arab
Emirates, shipping internationally and regionally.
Aramex’s current focus is on Digital Transformation to
improve their Last Mile and customer experience. Big data,
analytics and Machine Learning are at the heart of this
journey.
7
Aramex are undertaking a Digital Transformation to improve their user experience,
using Machine Learning and powered by a rich Data Lake.
THE VISION TO IMPROVE THE LAST MILE
Address Prediction
In the Middle East there is a
lack of defined addresses.
Aramex is therefore using
Machine Learning to identify a
location from descriptive text
Transit Time
Aramex is using Machine
Learning to predict the amount
of time both international and
domestic shipments take
Consignee Profiling
Aramex is using Machine
Learning to better rate the
likelihood of successful delivery
at location at certain times of
day. This is helping to reduce
costs by lowering delivery
attempts
8
Aramex faced the following issues before using AWS and engaging with Inawisdom
THE CHALLENGE
Data Access: Aramex has a number of relational OLTP databases at the
heart of their business which are hosted on a series of ‘on-premise’ Microsoft
SQL Servers. Using these databases for insights, analytics and data science
was problematic as any additional load on the databases caused a degraded
service to their operations business.
Model Training: Aramex’s data centres were built for their e-commerce and
operations business. They were not built for data science; they do not have
servers that have GPUs and they are not able to distribute training over 10 or
100 of servers depending on need.
Impaired Innovation: Due to the constraints Aramex data centres imposed on
their business, they could not readily experiment with different approaches,
evaluate them, and evolve the architectures underneath to adapt to changing
needs.
SAGEMAKER ECOSYSTEM
10
Define use cases and drive value by outcome
focused delivery
ML AND DATA VALUE FLYWHEEL
Realise
Maintain
Evolve & Scale
Data Sources
Embed within
business &
visualise
Structured, Semi-
Structured and
Unstructured data from
Internal, External, and
other sources
Get stake holder
commitment, build a
roadmap around value and
start the first flywheel for the
highest impacting but
deliverable use case
Discover
Business Case
Creation, Exploratory
Data Analysis, &
Target Opportunity
Definition
Use Cases
Prioritise & Value
Business + Data Strategy,
Ideation for descriptive to
predictive to prescriptive use
cases, and Opportunity Scoring
Prove
Experiment
and show
potential value
Improve
model(s),
refine data
products &
create MVP
Deliver value to
the business
Maintain value
to the
business
Data & MLOps,
maintain data &
models with
automation and
pipelines
24/7 monitoring,
Incident Response,
& Cost Optimisation
Respond to changes
and detect drift
Scale up and refine
capabilities to accelerate
the delivery of value with
more & faster flywheels
Improve reuse and
collaboration using
tooling such as a
model registry and
a Business Data
Catalogue
Standardise
approaches to
common problems,
provide governance
Change business processes
and refine operating model
to be data-driven with easy
access to insights
POV with initial,
data products,
features creation &
model selection
Measure &
Iterate
Measure each iteration of the
flywheel against CSFs / KPIs
and only invest in further
iterations as needed
Roadmap
Value
11
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
12
SERVERLESS
Lambda API Gateway
DynamoDB is A fully
managed non-sql
cloud service from
AWS. For machine
learning it is typically
used for reference
data.
DynamoDB
S3
SNS ; Pub + Sub
SQS : Queues
Fargate : Containers
Step Functions:
Workflows
..and more
Highly durable object
storage used for many
things including data
lakes. For machine
learning it is used to
store training data sets
and model artefacts
API Gateway is the
endpoint for your API,
it has extensive
security measures,
logging, and API
definition using open
API or swagger.
AWS Lambda is
AWS’s native and fully
managed cloud
service for running
application code
without the need to
run servers.
13
THE SOLUTION AND ARCHITECTURE
AMAZON SAGEMAKER
REAL TIME INFERENCE (HOSTING)
15
Logical components of an endpoint within Amazon SageMaker
AMAZON SAGEMAKER – REAL TIME INFERENCE
All components are immutable, any configuration changes require new models and endpoint configurations,
however there is a specific SageMaker API to update instance count and variant weight
Endpoint
Configuration
Endpoint
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Production Variant
Production Variant
Model
Initial
Count + Weight
Instance Type
SDKs
REST
SignV4
Requests
Name
16
Endpoint
Docker containers host the inference engines, inference engines can be written in any language and endpoints can use
more than one container. Primary container needs to implement a simple REST API.
Common Engines:
➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-cpu-py2
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-gpu-py2
➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-
inference:1.13-gpu
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow-serving:1.11-cpu
AMAZON SAGEMAKER – INFERENCE ENGINES
Dockerfile:
FROM tensorflow/serving:latest
RUN apt-get update && apt-get install -y --no-install-
recommends nginx git
RUN mkdir -p /opt/ml/model
COPY nginx.conf /etc/nginx/nginx.conf
ENTRYPOINT service nginx start | tensorflow_model_server --
rest_api_port=8501 --
model_config_file=/opt/ml/model/models.config
Container
http://localhost:8080/invocations
http://localhost:8080/ping
Amazon
SageMaker model.tar.gz
Primary Container
Nginx Gunicorn Model
Runtime
link
/opt/ml/model
X-Amzn-SageMaker-Custom-Attributes
17
Using Docker immediately raises the following questions
➤ How many Docker containers are run on a single underlying EC2 instance?
➤ Is Kubernetes or ECS used? And do I have to become a Docker expert?
➤ How fast and how slow are instances started and stopped?
➤ How do instances reside within the VPC and use network resources? For example, can
the number of instances exhaust the network addresses of a VPC?
➤ How isolated are my models? as Docker uses soft CPU and Memory units?
➤ Will I suffer issues if containers are bin packed or re-distributed?
IMPLICATIONS OF USING DOCKER
18
In order to answer these questions a series of experiments were carried out
THE EXPERIMENT
AZ Available Address
EU-West-1a 4091
EU-West-1b 4091
EU-West-1c 4091
AZ Available Address
EU-West-1a 4090
EU-West-1b 4091
EU-West-1c 4091
AZ Available Address
EU-West-1a 4090
EU-West-1b 4090
EU-West-1c 4090
After VPC Creation:
After Notebook Instance Creation:
After Endpoint Creation:
primary_container ={
"Image": "685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1",
"ModelDataUrl": "s3://mybucket/mymodel/output/model.tar.gz",
}
create_model_response = sm_client.create_model(
ModelName = ‘load-test’,
ExecutionRoleArn = role,
PrimaryContainer = primary_container,
VpcConfig = {
"SecurityGroupIds": [
"My SecurityGroupId”
],
“Subnets": [
"Subnet Id 1b”,
"Subnet Id 1c”
]}
}
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
ProductionVariants=[{
'InstanceType':'ml.t2.medium',
'InitialInstanceCount':2,
'InitialVariantWeight':1,
'ModelName’: ‘load-test’,
'VariantName':'AllTraffic’}
])
Endpoint Creation:
20
The CPU usage in AWS CloudWatch for a load run test experiment
RESULTS
At 13:20 we saw the start of a drop
in the CPU usage and at 13:40 it
stopped at 100%, why was this?
From the load script I configured
we know that this is when
serverless-artillery entered the 2nd
phase of sustained load.
There was a slow ramp up for the
first 15 mins until we hit around the
200% CPU usage mark. The 200%
CPU usage means we are using
more than the capacity of a single
endpoint instance.
We then saw a return to 200%
CPU usage 10 minutes later. At
14:40 we saw a complete stop in
load and this is when the
serverless-artillery job completed
Support Email
25
The following shows results of the same experiment ran on M5 Instances:
M5 INSTANCES
Again we hit around the 200%
CPU usage mark. The 200% CPU
usage means we are using more
than the capacity of a single
endpoint instance.
This time no instance crashed and
the same two instances were used
during the entire experiment
There is a strong relationship
between invocations and CPU for
this XGBoost model.
27
The following shows same experiment with M5 Instances and autoscaling enabled:
M5 INSTANCES WITH AUTOSCALING
The autoscaling group was set
between 2-4 instances and the
scaling policy to 100k requests.
The number of innovations
continued to rise and CPU never
went above 100%.
A scaling event happen at 08:45
and took 5 minutes to warm up.
Again no instance crashed and up
to 4 instances were used.
28
The following chart compares the two M5 based experiments:
WHY IS CPU USAGE THAT IMPORTANT?
Latency(red) increased when the
CPU went over 100%. The is due
to invocations having to wait
within SageMaker to be processed
Zzzzz, Phil does sleep!
The two M5 experiments had a
cost of $42.96
SageMaker Studio was used
instead of a SageMaker notebook
instances.
29
It is import to right size your ML workload to make sure you pay for only what you need. Also be
very careful with GPUs
COST OPTIMISATION
Change in
Instance Size
Change in
Instance Type
No RI or Saving
Plans for ML
AMAZON SAGEMAKER
ML OPS
31
AWS Step Functions Data Science Software Development Kit
ML OPS: MODEL RETRAINING AND DEPLOYMENT
AWS Glue: Used for raw data ingress, cleaning that
data and then transforming that data into a training
data set
Deployments to Amazon SageMaker
endpoints: The ability to perform deployments from
the pipeline, including blue/green, linear and canary
style updates.
AWS Lambda: Used to stitch elements together and
perform any additional logic
AWS ECS/Fargate: There are situations where you
may need to run very long running processes over
the data to prep the data for training. Lambda is not
suitable for this due to its maximum execution time
and memory limits, therefore Fargate is preferred in
these situations.
Amazon SageMaker training jobs: The ability to
run training on the data that the pipeline has got
ready for you.=
32
The following are the four ways to deploy new versions of models in Amazon SageMaker
Rolling:
DEV OPS WITH SAGEMAKER
Endpoint
Configuration
Canary Variant
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
New Variant
Old Variant
Canary: Blue/Green: Linear:
weight
The default option, SageMaker
will start new instances and then
once they are healthy stop the
old ones
Canary deployments are done
using two Variants in the
Endpoint Configuration and
performed over two
CloudFormation updates.
Requires two CloudFormation
stacks and then changing the
endpoint name in the AWS
Lambda using an Environment
Variable
Linear uses two Variants in the
Endpoint Configuration and using
an AWS Step Function and AWS
Lambda to call the
UpdateEndpointWeightsAndCap
acities API.
33
Amazon SageMaker exposes metrics to AWS CloudWatch
MONITORING SAGEMAKER
Name Dimension Statistic Threshold Time Period Missing
Endpoint model
latency
Milliseconds Average >100 For 5 minutes ignore
Endpoint model
invocations
Count Sum
> 10000
For 15 minutes
notBreaching
< 1000 breaching
Endpoint disk
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint CPU
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint memory
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint 5XX
errors
Count Sum >10 For 5 minutes
notBreaching
Endpoint 4XX
errors
Count Sum >50 For 5 minutes
The metrics in AWS CloudWatch
can then be used for alarms:
➤ Always pay attention to how to
handle missing data
➤ Always test your alarms
➤ Look to level your alarms
➤ Make your alarms complement
each other
34
X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models.
OBSERVING SAGEMAKER
Inference Function
Inference Function
Function A
Function B
Function C
Function C
Function D
Function E
Function F
Function G
Function H
APIGWUrl
Model
Function 1
Function 2
SQL: db_url
Model
35
Remember to always apply least privilege and other AWS Security best practice, be very protective of your data
SECURITY AND SAGEMAKER
AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider
using a dedicated Custom Key in KMS to-do this. This allows you tighter control by
limiting the ability to decrypt data, providing another layer security over S3.
AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM
roles and you need to make sure your policies are locked down to only the Actions
and Resources you need.
Amazon S3: SageMaker can use a range of data stores, however S3 is the most
popular. However please make sure you enable encryption, resource policies,
logging and versioning on your buckets.
Amazon VPC: SageMaker can run outside a VPC and access data over the public
internet (hopefully using HTTPS). This runs contrary to most corporate Information
Security Policies. Therefore please deploy in VPC with Private Links for extra security.
Data: Most importantly, only use the data you need. If the data contains PII or
PCI-DSS and you do not need those values then remove them or sanitised.
36
Aramex achieved the following results with Inawisdom, AWS and Matillion
THE OUTCOME
Aramex has seen a 74 percent increase
in the accuracy of their transit time
predictions because of the machine
learning models developed on AWS with
Inawisdom.
Aramex has improved its contact center
efficiency with the Inawisdom solution,
eliminating 40 percent of inbound
customer calls related to shipments.
Aramex got workloads live in 8 weeks
with Inawisdom where previously they had
struggled for 5 years.
A data pipeline that ingests 1.2 million
updates every 15 minutes. 70
orchestration jobs. 50 of these are for
the incremental load from SQL Server
to Redshift
Storing over 7.2 TB of data,
comprising of 3 months in hot storage
and 7.5 years queryable from long
term storage using Redshift Spectrum
Over 20 million predictions per month,
averaging 600 requests per minute, with
daily peaks of 800 requests per minute,
Achieving response times of 135ms at
90th percentile vs originally 2500ms
Business Results Technical Results:
37
re:Invent and Webinar:
➤ https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-event.html
➤ https://www.youtube.com/watch?v=lx9fP_4yi2s
➤ https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/
➤ https://www.inawisdom.com/machine-learning/machine-learning-performance-more-than-skin-deep/
➤ https://www.inawisdom.com/machine-learning/a-model-is-for-life-not-just-for-christmas/
➤ https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html
➤ https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alar
ms-and-missing-data
➤ https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/readmelink.html#getting-started-
with-sample-jupyter-notebooks
REFERENCES
Other:
My blogs:
020 3575 1337
info@inawisdom.com
Columba House,
Adastral Park, Martlesham Heath
Ipswich, Suffolk, IP5 3RE
www.inawisdom.com
@inawisdom Inawisdom
phil@Inawisdom.com
+44 20 8133 8349
Thank you
© 2023 Cognizant | Private
Phil Basford
Senior Director – Consulting / Inawisdom CTO
Philip.basford@cognizant.com
23

More Related Content

Similar to Inawisdom MLOPS

Inawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfInawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfPhilipBasford
 
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...Amazon Web Services LATAM
 
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)Amazon Web Services
 
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...Amazon Web Services
 
Keynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaKeynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaAmazon Web Services
 
A Multi-Company Perspective: Enterprise Cloud and PaaS
A Multi-Company Perspective: Enterprise Cloud and PaaSA Multi-Company Perspective: Enterprise Cloud and PaaS
A Multi-Company Perspective: Enterprise Cloud and PaaSThoughtworks
 
Modern application architectures
Modern application architecturesModern application architectures
Modern application architecturesAmazon Web Services
 
AWS Partnership Model - AWS - AWSome Day Zurich - 112016
AWS Partnership Model - AWS - AWSome Day Zurich - 112016AWS Partnership Model - AWS - AWSome Day Zurich - 112016
AWS Partnership Model - AWS - AWSome Day Zurich - 112016Amazon Web Services
 
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構Amazon Web Services
 
Developing Modern Applications in the Cloud
Developing Modern Applications in the CloudDeveloping Modern Applications in the Cloud
Developing Modern Applications in the CloudAmazon Web Services
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's differentChen-Tien Tsai
 
Artificial Intelligence - Get Started - 1 episodio
Artificial Intelligence - Get Started - 1 episodioArtificial Intelligence - Get Started - 1 episodio
Artificial Intelligence - Get Started - 1 episodioAmazon Web Services
 
Cloud Migration Insights Forum, Sydney
Cloud Migration Insights Forum, SydneyCloud Migration Insights Forum, Sydney
Cloud Migration Insights Forum, SydneyAmazon Web Services
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...DataWorks Summit
 
Cloud Migration Insights Forum, Perth
Cloud Migration Insights Forum, PerthCloud Migration Insights Forum, Perth
Cloud Migration Insights Forum, PerthAmazon Web Services
 
AWS Technical Day Riyadh Nov 2019 [Migration]
AWS Technical Day Riyadh Nov 2019 [Migration]AWS Technical Day Riyadh Nov 2019 [Migration]
AWS Technical Day Riyadh Nov 2019 [Migration]AWS Riyadh User Group
 

Similar to Inawisdom MLOPS (20)

Inawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdfInawisdom Overview - construction.pdf
Inawisdom Overview - construction.pdf
 
AWS Partnership Model
AWS Partnership ModelAWS Partnership Model
AWS Partnership Model
 
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
 
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...
PARTNER PRESENTATION: Transform into a Cloud First Business with Capgemini’s ...
 
Keynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaKeynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it Canada
 
A Multi-Company Perspective: Enterprise Cloud and PaaS
A Multi-Company Perspective: Enterprise Cloud and PaaSA Multi-Company Perspective: Enterprise Cloud and PaaS
A Multi-Company Perspective: Enterprise Cloud and PaaS
 
Modern application architectures
Modern application architecturesModern application architectures
Modern application architectures
 
AWS Partnership Model - AWS - AWSome Day Zurich - 112016
AWS Partnership Model - AWS - AWSome Day Zurich - 112016AWS Partnership Model - AWS - AWSome Day Zurich - 112016
AWS Partnership Model - AWS - AWSome Day Zurich - 112016
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構Track 3 Session 2_從傳統  legacy  邁向數位化與現代化架構
Track 3 Session 2_從傳統 legacy 邁向數位化與現代化架構
 
Developing Modern Applications in the Cloud
Developing Modern Applications in the CloudDeveloping Modern Applications in the Cloud
Developing Modern Applications in the Cloud
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's different
 
Artificial Intelligence - Get Started - 1 episodio
Artificial Intelligence - Get Started - 1 episodioArtificial Intelligence - Get Started - 1 episodio
Artificial Intelligence - Get Started - 1 episodio
 
Cloud Migration Insights Forum, Sydney
Cloud Migration Insights Forum, SydneyCloud Migration Insights Forum, Sydney
Cloud Migration Insights Forum, Sydney
 
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
A Journey to a Serverless Business Intelligence, Machine Learning and Big Dat...
 
Cloud Migration Insights Forum, Perth
Cloud Migration Insights Forum, PerthCloud Migration Insights Forum, Perth
Cloud Migration Insights Forum, Perth
 
AWS Technical Day Riyadh Nov 2019 [Migration]
AWS Technical Day Riyadh Nov 2019 [Migration]AWS Technical Day Riyadh Nov 2019 [Migration]
AWS Technical Day Riyadh Nov 2019 [Migration]
 

More from PhilipBasford

re:cap Generative AI journey with Bedrock
re:cap Generative AI journey  with Bedrockre:cap Generative AI journey  with Bedrock
re:cap Generative AI journey with BedrockPhilipBasford
 
AIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitivePhilipBasford
 
Inawisdom Quick Sight
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick SightPhilipBasford
 
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfRealizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfPhilipBasford
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfPhilipBasford
 
C04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfC04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfPhilipBasford
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning modelsPhilipBasford
 
Palringo AWS London Summit 2017
Palringo AWS London Summit 2017Palringo AWS London Summit 2017
Palringo AWS London Summit 2017PhilipBasford
 
Palringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPalringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPhilipBasford
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage makerPhilipBasford
 

More from PhilipBasford (14)

re:cap Generative AI journey with Bedrock
re:cap Generative AI journey  with Bedrockre:cap Generative AI journey  with Bedrock
re:cap Generative AI journey with Bedrock
 
AIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
 
Inawisdom IDP
Inawisdom IDPInawisdom IDP
Inawisdom IDP
 
Inawisdom Quick Sight
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick Sight
 
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdfRealizing_the_real_business_impact_of_gen_AI_white_paper.pdf
Realizing_the_real_business_impact_of_gen_AI_white_paper.pdf
 
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdfGen AI Cognizant & AWS event presentation_12 Oct.pdf
Gen AI Cognizant & AWS event presentation_12 Oct.pdf
 
D3 IDP Slides.pdf
D3 IDP Slides.pdfD3 IDP Slides.pdf
D3 IDP Slides.pdf
 
C04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdfC04 Driving understanding from Documents and unstructured data sources final.pdf
C04 Driving understanding from Documents and unstructured data sources final.pdf
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
Fish Cam.pptx
Fish Cam.pptxFish Cam.pptx
Fish Cam.pptx
 
Ml 3 ways
Ml 3 waysMl 3 ways
Ml 3 ways
 
Palringo AWS London Summit 2017
Palringo AWS London Summit 2017Palringo AWS London Summit 2017
Palringo AWS London Summit 2017
 
Palringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloudPalringo : a startup's journey from a data center to the cloud
Palringo : a startup's journey from a data center to the cloud
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Inawisdom MLOPS

  • 1. PHIL BASFORD – HEAD OF SOLUTION ENGINEERING Wednesday, 15 July 2020 PRODUCTIONISING MACHINE LEARNING WITH SAGEMAKER @philipbasford ML OPS Featuring Aramex © 2023 Cognizant | Private November 2023
  • 2. 2 Cognizant’s UK&I specialist AWS Data & AI Team INAWISDOM Found in 2016, an AWS Partner since 2017 and Premier Partner since 2019. Inawisdom was acquired by Cognizant in 2020 and is part of Cognizant’s UK&I Consulting Inawisdom lives and breath AWS including holding over 180 AWS certifications and accreditations. Inawisdom maintain a close relationship with the AWS team, supporting and staying up-to-date with all the latest developments. Inawisdom has been awarded in the following areas: ► ML Partner of the Year 2020 ► Differentiation Partner of the Year 2019 ► Global Launch Partner – CCI ► Launch Partner – AWS UAE Region Inawisdom holds 9 competencies and service designations, reflecting business- wide expertise in key areas: Our Qualifications All of our consultants hold at least 1 AWS certification. Including some consultants with all certifications Our CTO has been ranked #1 AWS Ambassador in EMEA in 2021 and 2022
  • 3. Your Data, AI and Machine Learning partner WHY COGNIZANT We offer a rapid, proven path to Machine Learning excellence We help customers in a broad range of industries achieve their ML goals We are recognised, all-in AWS experts, focused on customer success We offer full-stack services, including AI / ML, Data & Analytics, BI and MLOps
  • 4. Full-stack capability OUR SERVICES Business Differentiation / Value Data Driven Business Decisions Cloud Transformation Adoption and Scale Digital Enablement AI and Machine Learning Data & Analytics Data Foundations Cloud Infrastructure Landing Zone, Control Tower, migration
  • 5. Discover. Deliver. Productionise. Scale. ACCELERATE AI/ML ADOPTION ON AWS D e p l o y Scaled AI Productionise Operationalise Discovery D i f f e r e n t i a t e I n n o v a t e P r o v e V a l u e B u s i n e s s C a s e D e p l o y E m b e d A u t o m a t e E n a b l e Data Science AI/ML Data Engineering & Platform DevOps & MLOps Cloud/Data Architecture
  • 6. 6 INTRODUCTION Aramex is an international express, mail delivery and logistics services company based in Dubai, United Arab Emirates, shipping internationally and regionally. Aramex’s current focus is on Digital Transformation to improve their Last Mile and customer experience. Big data, analytics and Machine Learning are at the heart of this journey.
  • 7. 7 Aramex are undertaking a Digital Transformation to improve their user experience, using Machine Learning and powered by a rich Data Lake. THE VISION TO IMPROVE THE LAST MILE Address Prediction In the Middle East there is a lack of defined addresses. Aramex is therefore using Machine Learning to identify a location from descriptive text Transit Time Aramex is using Machine Learning to predict the amount of time both international and domestic shipments take Consignee Profiling Aramex is using Machine Learning to better rate the likelihood of successful delivery at location at certain times of day. This is helping to reduce costs by lowering delivery attempts
  • 8. 8 Aramex faced the following issues before using AWS and engaging with Inawisdom THE CHALLENGE Data Access: Aramex has a number of relational OLTP databases at the heart of their business which are hosted on a series of ‘on-premise’ Microsoft SQL Servers. Using these databases for insights, analytics and data science was problematic as any additional load on the databases caused a degraded service to their operations business. Model Training: Aramex’s data centres were built for their e-commerce and operations business. They were not built for data science; they do not have servers that have GPUs and they are not able to distribute training over 10 or 100 of servers depending on need. Impaired Innovation: Due to the constraints Aramex data centres imposed on their business, they could not readily experiment with different approaches, evaluate them, and evolve the architectures underneath to adapt to changing needs.
  • 10. 10 Define use cases and drive value by outcome focused delivery ML AND DATA VALUE FLYWHEEL Realise Maintain Evolve & Scale Data Sources Embed within business & visualise Structured, Semi- Structured and Unstructured data from Internal, External, and other sources Get stake holder commitment, build a roadmap around value and start the first flywheel for the highest impacting but deliverable use case Discover Business Case Creation, Exploratory Data Analysis, & Target Opportunity Definition Use Cases Prioritise & Value Business + Data Strategy, Ideation for descriptive to predictive to prescriptive use cases, and Opportunity Scoring Prove Experiment and show potential value Improve model(s), refine data products & create MVP Deliver value to the business Maintain value to the business Data & MLOps, maintain data & models with automation and pipelines 24/7 monitoring, Incident Response, & Cost Optimisation Respond to changes and detect drift Scale up and refine capabilities to accelerate the delivery of value with more & faster flywheels Improve reuse and collaboration using tooling such as a model registry and a Business Data Catalogue Standardise approaches to common problems, provide governance Change business processes and refine operating model to be data-driven with easy access to insights POV with initial, data products, features creation & model selection Measure & Iterate Measure each iteration of the flywheel against CSFs / KPIs and only invest in further iterations as needed Roadmap Value
  • 11. 11 Monitoring, observing and alerting using CloudWatch and X- Ray. Infrastructure as Code with SAM and CloudFormation. Operational Excellence Least privilege, Data Encryption at Rest, and Data Encryption in Transit using IAM Policies, Resource Policies, KMS, Secret Manager, VPC and Security Group. Security Elastic scaling based on demand and meeting response times using Auto Scaling, Serverless, and Per Request managed services. Performance Serverless and fully managed services to lower TCO. Resource Tag everything possible for cost analysis. Right sizing instance types for model hosting. Cost Optimisation Fault tolerance and auto healing to meet a target availability using Auto Scaling, Multi AZ, Multi Region, Read Replicas and Snapshots. Reliance https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
  • 12. 12 SERVERLESS Lambda API Gateway DynamoDB is A fully managed non-sql cloud service from AWS. For machine learning it is typically used for reference data. DynamoDB S3 SNS ; Pub + Sub SQS : Queues Fargate : Containers Step Functions: Workflows ..and more Highly durable object storage used for many things including data lakes. For machine learning it is used to store training data sets and model artefacts API Gateway is the endpoint for your API, it has extensive security measures, logging, and API definition using open API or swagger. AWS Lambda is AWS’s native and fully managed cloud service for running application code without the need to run servers.
  • 13. 13 THE SOLUTION AND ARCHITECTURE
  • 14. AMAZON SAGEMAKER REAL TIME INFERENCE (HOSTING)
  • 15. 15 Logical components of an endpoint within Amazon SageMaker AMAZON SAGEMAKER – REAL TIME INFERENCE All components are immutable, any configuration changes require new models and endpoint configurations, however there is a specific SageMaker API to update instance count and variant weight Endpoint Configuration Endpoint Inference Engine + Model Primary Container Container Container VPC S3 KMS + IAM Inference Engine + Model Primary Container Container Container VPC S3 KMS + IAM Production Variant Production Variant Model Initial Count + Weight Instance Type SDKs REST SignV4 Requests Name
  • 16. 16 Endpoint Docker containers host the inference engines, inference engines can be written in any language and endpoints can use more than one container. Primary container needs to implement a simple REST API. Common Engines: ➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1 ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow:1.11-cpu-py2 ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow:1.11-gpu-py2 ➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow- inference:1.13-gpu ➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker- tensorflow-serving:1.11-cpu AMAZON SAGEMAKER – INFERENCE ENGINES Dockerfile: FROM tensorflow/serving:latest RUN apt-get update && apt-get install -y --no-install- recommends nginx git RUN mkdir -p /opt/ml/model COPY nginx.conf /etc/nginx/nginx.conf ENTRYPOINT service nginx start | tensorflow_model_server -- rest_api_port=8501 -- model_config_file=/opt/ml/model/models.config Container http://localhost:8080/invocations http://localhost:8080/ping Amazon SageMaker model.tar.gz Primary Container Nginx Gunicorn Model Runtime link /opt/ml/model X-Amzn-SageMaker-Custom-Attributes
  • 17. 17 Using Docker immediately raises the following questions ➤ How many Docker containers are run on a single underlying EC2 instance? ➤ Is Kubernetes or ECS used? And do I have to become a Docker expert? ➤ How fast and how slow are instances started and stopped? ➤ How do instances reside within the VPC and use network resources? For example, can the number of instances exhaust the network addresses of a VPC? ➤ How isolated are my models? as Docker uses soft CPU and Memory units? ➤ Will I suffer issues if containers are bin packed or re-distributed? IMPLICATIONS OF USING DOCKER
  • 18. 18 In order to answer these questions a series of experiments were carried out THE EXPERIMENT AZ Available Address EU-West-1a 4091 EU-West-1b 4091 EU-West-1c 4091 AZ Available Address EU-West-1a 4090 EU-West-1b 4091 EU-West-1c 4091 AZ Available Address EU-West-1a 4090 EU-West-1b 4090 EU-West-1c 4090 After VPC Creation: After Notebook Instance Creation: After Endpoint Creation: primary_container ={ "Image": "685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1", "ModelDataUrl": "s3://mybucket/mymodel/output/model.tar.gz", } create_model_response = sm_client.create_model( ModelName = ‘load-test’, ExecutionRoleArn = role, PrimaryContainer = primary_container, VpcConfig = { "SecurityGroupIds": [ "My SecurityGroupId” ], “Subnets": [ "Subnet Id 1b”, "Subnet Id 1c” ]} } create_endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName = endpoint_config_name, ProductionVariants=[{ 'InstanceType':'ml.t2.medium', 'InitialInstanceCount':2, 'InitialVariantWeight':1, 'ModelName’: ‘load-test’, 'VariantName':'AllTraffic’} ]) Endpoint Creation:
  • 19. 20 The CPU usage in AWS CloudWatch for a load run test experiment RESULTS At 13:20 we saw the start of a drop in the CPU usage and at 13:40 it stopped at 100%, why was this? From the load script I configured we know that this is when serverless-artillery entered the 2nd phase of sustained load. There was a slow ramp up for the first 15 mins until we hit around the 200% CPU usage mark. The 200% CPU usage means we are using more than the capacity of a single endpoint instance. We then saw a return to 200% CPU usage 10 minutes later. At 14:40 we saw a complete stop in load and this is when the serverless-artillery job completed
  • 21. 25 The following shows results of the same experiment ran on M5 Instances: M5 INSTANCES Again we hit around the 200% CPU usage mark. The 200% CPU usage means we are using more than the capacity of a single endpoint instance. This time no instance crashed and the same two instances were used during the entire experiment There is a strong relationship between invocations and CPU for this XGBoost model.
  • 22. 27 The following shows same experiment with M5 Instances and autoscaling enabled: M5 INSTANCES WITH AUTOSCALING The autoscaling group was set between 2-4 instances and the scaling policy to 100k requests. The number of innovations continued to rise and CPU never went above 100%. A scaling event happen at 08:45 and took 5 minutes to warm up. Again no instance crashed and up to 4 instances were used.
  • 23. 28 The following chart compares the two M5 based experiments: WHY IS CPU USAGE THAT IMPORTANT? Latency(red) increased when the CPU went over 100%. The is due to invocations having to wait within SageMaker to be processed Zzzzz, Phil does sleep! The two M5 experiments had a cost of $42.96 SageMaker Studio was used instead of a SageMaker notebook instances.
  • 24. 29 It is import to right size your ML workload to make sure you pay for only what you need. Also be very careful with GPUs COST OPTIMISATION Change in Instance Size Change in Instance Type No RI or Saving Plans for ML
  • 26. 31 AWS Step Functions Data Science Software Development Kit ML OPS: MODEL RETRAINING AND DEPLOYMENT AWS Glue: Used for raw data ingress, cleaning that data and then transforming that data into a training data set Deployments to Amazon SageMaker endpoints: The ability to perform deployments from the pipeline, including blue/green, linear and canary style updates. AWS Lambda: Used to stitch elements together and perform any additional logic AWS ECS/Fargate: There are situations where you may need to run very long running processes over the data to prep the data for training. Lambda is not suitable for this due to its maximum execution time and memory limits, therefore Fargate is preferred in these situations. Amazon SageMaker training jobs: The ability to run training on the data that the pipeline has got ready for you.=
  • 27. 32 The following are the four ways to deploy new versions of models in Amazon SageMaker Rolling: DEV OPS WITH SAGEMAKER Endpoint Configuration Canary Variant Full Variant Endpoint Configuration Full Variant Endpoint Configuration Full Variant Endpoint Configuration Full Variant Endpoint Configuration New Variant Old Variant Canary: Blue/Green: Linear: weight The default option, SageMaker will start new instances and then once they are healthy stop the old ones Canary deployments are done using two Variants in the Endpoint Configuration and performed over two CloudFormation updates. Requires two CloudFormation stacks and then changing the endpoint name in the AWS Lambda using an Environment Variable Linear uses two Variants in the Endpoint Configuration and using an AWS Step Function and AWS Lambda to call the UpdateEndpointWeightsAndCap acities API.
  • 28. 33 Amazon SageMaker exposes metrics to AWS CloudWatch MONITORING SAGEMAKER Name Dimension Statistic Threshold Time Period Missing Endpoint model latency Milliseconds Average >100 For 5 minutes ignore Endpoint model invocations Count Sum > 10000 For 15 minutes notBreaching < 1000 breaching Endpoint disk usage % Average > 90% For 15 minutes ignore > 80% Endpoint CPU usage % Average > 90% For 15 minutes ignore > 80% Endpoint memory usage % Average > 90% For 15 minutes ignore > 80% Endpoint 5XX errors Count Sum >10 For 5 minutes notBreaching Endpoint 4XX errors Count Sum >50 For 5 minutes The metrics in AWS CloudWatch can then be used for alarms: ➤ Always pay attention to how to handle missing data ➤ Always test your alarms ➤ Look to level your alarms ➤ Make your alarms complement each other
  • 29. 34 X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models. OBSERVING SAGEMAKER Inference Function Inference Function Function A Function B Function C Function C Function D Function E Function F Function G Function H APIGWUrl Model Function 1 Function 2 SQL: db_url Model
  • 30. 35 Remember to always apply least privilege and other AWS Security best practice, be very protective of your data SECURITY AND SAGEMAKER AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider using a dedicated Custom Key in KMS to-do this. This allows you tighter control by limiting the ability to decrypt data, providing another layer security over S3. AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM roles and you need to make sure your policies are locked down to only the Actions and Resources you need. Amazon S3: SageMaker can use a range of data stores, however S3 is the most popular. However please make sure you enable encryption, resource policies, logging and versioning on your buckets. Amazon VPC: SageMaker can run outside a VPC and access data over the public internet (hopefully using HTTPS). This runs contrary to most corporate Information Security Policies. Therefore please deploy in VPC with Private Links for extra security. Data: Most importantly, only use the data you need. If the data contains PII or PCI-DSS and you do not need those values then remove them or sanitised.
  • 31. 36 Aramex achieved the following results with Inawisdom, AWS and Matillion THE OUTCOME Aramex has seen a 74 percent increase in the accuracy of their transit time predictions because of the machine learning models developed on AWS with Inawisdom. Aramex has improved its contact center efficiency with the Inawisdom solution, eliminating 40 percent of inbound customer calls related to shipments. Aramex got workloads live in 8 weeks with Inawisdom where previously they had struggled for 5 years. A data pipeline that ingests 1.2 million updates every 15 minutes. 70 orchestration jobs. 50 of these are for the incremental load from SQL Server to Redshift Storing over 7.2 TB of data, comprising of 3 months in hot storage and 7.5 years queryable from long term storage using Redshift Spectrum Over 20 million predictions per month, averaging 600 requests per minute, with daily peaks of 800 requests per minute, Achieving response times of 135ms at 90th percentile vs originally 2500ms Business Results Technical Results:
  • 32. 37 re:Invent and Webinar: ➤ https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-event.html ➤ https://www.youtube.com/watch?v=lx9fP_4yi2s ➤ https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/ ➤ https://www.inawisdom.com/machine-learning/machine-learning-performance-more-than-skin-deep/ ➤ https://www.inawisdom.com/machine-learning/a-model-is-for-life-not-just-for-christmas/ ➤ https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html ➤ https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alar ms-and-missing-data ➤ https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/readmelink.html#getting-started- with-sample-jupyter-notebooks REFERENCES Other: My blogs:
  • 33. 020 3575 1337 info@inawisdom.com Columba House, Adastral Park, Martlesham Heath Ipswich, Suffolk, IP5 3RE www.inawisdom.com @inawisdom Inawisdom phil@Inawisdom.com +44 20 8133 8349 Thank you © 2023 Cognizant | Private Phil Basford Senior Director – Consulting / Inawisdom CTO Philip.basford@cognizant.com 23