Inawisdom MLOPS

PHIL BASFORD – HEAD OF SOLUTION ENGINEERING
Wednesday, 15 July 2020
PRODUCTIONISING MACHINE
LEARNING WITH SAGEMAKER
@philipbasford
ML OPS
Featuring Aramex
© 2023 Cognizant | Private
November 2023

2
Cognizant’s UK&I specialist AWS Data & AI Team
INAWISDOM
Found in 2016, an AWS Partner since 2017 and Premier
Partner since 2019. Inawisdom was acquired by Cognizant in
2020 and is part of Cognizant’s UK&I Consulting
Inawisdom lives and breath AWS including holding over 180
AWS certifications and accreditations. Inawisdom maintain a
close relationship with the AWS team, supporting and staying
up-to-date with all the latest developments.
Inawisdom has been awarded in the following
areas:
► ML Partner of the Year 2020
► Differentiation Partner of the Year 2019
► Global Launch Partner – CCI
► Launch Partner – AWS UAE Region
Inawisdom holds 9 competencies and service designations, reflecting business-
wide expertise in key areas:
Our Qualifications
All of our consultants hold at least 1
AWS certification. Including some
consultants with all certifications
Our CTO has been ranked #1 AWS
Ambassador in EMEA in 2021 and
2022

Your Data, AI and Machine Learning partner
WHY COGNIZANT
We offer a rapid,
proven path to
Machine Learning
excellence
We help customers in
a broad range of
industries achieve
their ML goals
We are recognised,
all-in AWS experts,
focused on customer
success
We offer full-stack
services, including AI /
ML, Data & Analytics,
BI and MLOps

Full-stack capability
OUR SERVICES
Business
Differentiation / Value
Data Driven
Business Decisions
Cloud Transformation
Adoption and Scale
Digital
Enablement
AI and Machine Learning
Data & Analytics
Data Foundations
Cloud Infrastructure
Landing Zone, Control Tower, migration

Discover. Deliver. Productionise. Scale.
ACCELERATE AI/ML ADOPTION ON AWS
D
e
p
l
o
y
Scaled AI Productionise
Operationalise
Discovery
D
i
f
f
e
r
e
n
t
i
a
t
e
I
n
n
o
v
a
t
e
P
r
o
v
e
V
a
l
u
e
B
u
s
i
n
e
s
s
C
a
s
e
D
e
p
l
o
y
E
m
b
e
d
A
u
t
o
m
a
t
e
E
n
a
b
l
e
Data Science
AI/ML
Data
Engineering
& Platform
DevOps &
MLOps
Cloud/Data
Architecture

6
INTRODUCTION
Aramex is an international express, mail delivery and
logistics services company based in Dubai, United Arab
Emirates, shipping internationally and regionally.
Aramex’s current focus is on Digital Transformation to
improve their Last Mile and customer experience. Big data,
analytics and Machine Learning are at the heart of this
journey.

7
Aramex are undertaking a Digital Transformation to improve their user experience,
using Machine Learning and powered by a rich Data Lake.
THE VISION TO IMPROVE THE LAST MILE
Address Prediction
In the Middle East there is a
lack of defined addresses.
Aramex is therefore using
Machine Learning to identify a
location from descriptive text
Transit Time
Aramex is using Machine
Learning to predict the amount
of time both international and
domestic shipments take
Consignee Profiling
Aramex is using Machine
Learning to better rate the
likelihood of successful delivery
at location at certain times of
day. This is helping to reduce
costs by lowering delivery
attempts

8
Aramex faced the following issues before using AWS and engaging with Inawisdom
THE CHALLENGE
Data Access: Aramex has a number of relational OLTP databases at the
heart of their business which are hosted on a series of ‘on-premise’ Microsoft
SQL Servers. Using these databases for insights, analytics and data science
was problematic as any additional load on the databases caused a degraded
service to their operations business.
Model Training: Aramex’s data centres were built for their e-commerce and
operations business. They were not built for data science; they do not have
servers that have GPUs and they are not able to distribute training over 10 or
100 of servers depending on need.
Impaired Innovation: Due to the constraints Aramex data centres imposed on
their business, they could not readily experiment with different approaches,
evaluate them, and evolve the architectures underneath to adapt to changing
needs.

10
Define use cases and drive value by outcome
focused delivery
ML AND DATA VALUE FLYWHEEL
Realise
Maintain
Evolve & Scale
Data Sources
Embed within
business &
visualise
Structured, Semi-
Structured and
Unstructured data from
Internal, External, and
other sources
Get stake holder
commitment, build a
roadmap around value and
start the first flywheel for the
highest impacting but
deliverable use case
Discover
Business Case
Creation, Exploratory
Data Analysis, &
Target Opportunity
Definition
Use Cases
Prioritise & Value
Business + Data Strategy,
Ideation for descriptive to
predictive to prescriptive use
cases, and Opportunity Scoring
Prove
Experiment
and show
potential value
Improve
model(s),
refine data
products &
create MVP
Deliver value to
the business
Maintain value
to the
business
Data & MLOps,
maintain data &
models with
automation and
pipelines
24/7 monitoring,
Incident Response,
& Cost Optimisation
Respond to changes
and detect drift
Scale up and refine
capabilities to accelerate
the delivery of value with
more & faster flywheels
Improve reuse and
collaboration using
tooling such as a
model registry and
a Business Data
Catalogue
Standardise
approaches to
common problems,
provide governance
Change business processes
and refine operating model
to be data-driven with easy
access to insights
POV with initial,
data products,
features creation &
model selection
Measure &
Iterate
Measure each iteration of the
flywheel against CSFs / KPIs
and only invest in further
iterations as needed
Roadmap
Value

11
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf

12
SERVERLESS
Lambda API Gateway
DynamoDB is A fully
managed non-sql
cloud service from
AWS. For machine
learning it is typically
used for reference
data.
DynamoDB
S3
SNS ; Pub + Sub
SQS : Queues
Fargate : Containers
Step Functions:
Workflows
..and more
Highly durable object
storage used for many
things including data
lakes. For machine
learning it is used to
store training data sets
and model artefacts
API Gateway is the
endpoint for your API,
it has extensive
security measures,
logging, and API
definition using open
API or swagger.
AWS Lambda is
AWS’s native and fully
managed cloud
service for running
application code
without the need to
run servers.

13
THE SOLUTION AND ARCHITECTURE

AMAZON SAGEMAKER
REAL TIME INFERENCE (HOSTING)

15
Logical components of an endpoint within Amazon SageMaker
AMAZON SAGEMAKER – REAL TIME INFERENCE
All components are immutable, any configuration changes require new models and endpoint configurations,
however there is a specific SageMaker API to update instance count and variant weight
Endpoint
Configuration
Endpoint
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Inference Engine + Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Production Variant
Production Variant
Model
Initial
Count + Weight
Instance Type
SDKs
REST
SignV4
Requests
Name

16
Endpoint
Docker containers host the inference engines, inference engines can be written in any language and endpoints can use
more than one container. Primary container needs to implement a simple REST API.
Common Engines:
➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-cpu-py2
tensorflow:1.11-gpu-py2
➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-
inference:1.13-gpu
tensorflow-serving:1.11-cpu
AMAZON SAGEMAKER – INFERENCE ENGINES
Dockerfile:
FROM tensorflow/serving:latest
RUN apt-get update && apt-get install -y --no-install-
recommends nginx git
RUN mkdir -p /opt/ml/model
COPY nginx.conf /etc/nginx/nginx.conf
ENTRYPOINT service nginx start | tensorflow_model_server --
rest_api_port=8501 --
model_config_file=/opt/ml/model/models.config
Container
http://localhost:8080/invocations
http://localhost:8080/ping
Amazon
SageMaker model.tar.gz
Primary Container
Nginx Gunicorn Model
Runtime
link
/opt/ml/model
X-Amzn-SageMaker-Custom-Attributes

17
Using Docker immediately raises the following questions
➤ How many Docker containers are run on a single underlying EC2 instance?
➤ Is Kubernetes or ECS used? And do I have to become a Docker expert?
➤ How fast and how slow are instances started and stopped?
➤ How do instances reside within the VPC and use network resources? For example, can
the number of instances exhaust the network addresses of a VPC?
➤ How isolated are my models? as Docker uses soft CPU and Memory units?
➤ Will I suffer issues if containers are bin packed or re-distributed?
IMPLICATIONS OF USING DOCKER

18
In order to answer these questions a series of experiments were carried out
THE EXPERIMENT
AZ Available Address
EU-West-1a 4091
EU-West-1b 4091
EU-West-1c 4091
EU-West-1a 4090
EU-West-1b 4091
EU-West-1c 4091
EU-West-1a 4090
EU-West-1b 4090
EU-West-1c 4090
After VPC Creation:
After Notebook Instance Creation:
After Endpoint Creation:
primary_container ={
"Image": "685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1",
"ModelDataUrl": "s3://mybucket/mymodel/output/model.tar.gz",
}
create_model_response = sm_client.create_model(
ModelName = ‘load-test’,
ExecutionRoleArn = role,
PrimaryContainer = primary_container,
VpcConfig = {
"SecurityGroupIds": [
"My SecurityGroupId”
],
“Subnets": [
"Subnet Id 1b”,
"Subnet Id 1c”
]}
}
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
ProductionVariants=[{
'InstanceType':'ml.t2.medium',
'InitialInstanceCount':2,
'InitialVariantWeight':1,
'ModelName’: ‘load-test’,
'VariantName':'AllTraffic’}
])
Endpoint Creation:

20
The CPU usage in AWS CloudWatch for a load run test experiment
RESULTS
At 13:20 we saw the start of a drop
in the CPU usage and at 13:40 it
stopped at 100%, why was this?
From the load script I configured
we know that this is when
serverless-artillery entered the 2nd
phase of sustained load.
There was a slow ramp up for the
first 15 mins until we hit around the
200% CPU usage mark. The 200%
CPU usage means we are using
more than the capacity of a single
endpoint instance.
We then saw a return to 200%
CPU usage 10 minutes later. At
14:40 we saw a complete stop in
load and this is when the
serverless-artillery job completed

25
The following shows results of the same experiment ran on M5 Instances:
M5 INSTANCES
Again we hit around the 200%
CPU usage mark. The 200% CPU
usage means we are using more
than the capacity of a single
endpoint instance.
This time no instance crashed and
the same two instances were used
during the entire experiment
There is a strong relationship
between invocations and CPU for
this XGBoost model.

27
The following shows same experiment with M5 Instances and autoscaling enabled:
M5 INSTANCES WITH AUTOSCALING
The autoscaling group was set
between 2-4 instances and the
scaling policy to 100k requests.
The number of innovations
continued to rise and CPU never
went above 100%.
A scaling event happen at 08:45
and took 5 minutes to warm up.
Again no instance crashed and up
to 4 instances were used.

28
The following chart compares the two M5 based experiments:
WHY IS CPU USAGE THAT IMPORTANT?
Latency(red) increased when the
CPU went over 100%. The is due
to invocations having to wait
within SageMaker to be processed
Zzzzz, Phil does sleep!
The two M5 experiments had a
cost of $42.96
SageMaker Studio was used
instead of a SageMaker notebook
instances.

29
It is import to right size your ML workload to make sure you pay for only what you need. Also be
very careful with GPUs
COST OPTIMISATION
Change in
Instance Size
Change in
Instance Type
No RI or Saving
Plans for ML

31
AWS Step Functions Data Science Software Development Kit
ML OPS: MODEL RETRAINING AND DEPLOYMENT
AWS Glue: Used for raw data ingress, cleaning that
data and then transforming that data into a training
data set
Deployments to Amazon SageMaker
endpoints: The ability to perform deployments from
the pipeline, including blue/green, linear and canary
style updates.
AWS Lambda: Used to stitch elements together and
perform any additional logic
AWS ECS/Fargate: There are situations where you
may need to run very long running processes over
the data to prep the data for training. Lambda is not
suitable for this due to its maximum execution time
and memory limits, therefore Fargate is preferred in
these situations.
Amazon SageMaker training jobs: The ability to
run training on the data that the pipeline has got
ready for you.=

32
The following are the four ways to deploy new versions of models in Amazon SageMaker
Rolling:
DEV OPS WITH SAGEMAKER
Endpoint
Configuration
Canary Variant
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
New Variant
Old Variant
Canary: Blue/Green: Linear:
weight
The default option, SageMaker
will start new instances and then
once they are healthy stop the
old ones
Canary deployments are done
using two Variants in the
Endpoint Configuration and
performed over two
CloudFormation updates.
Requires two CloudFormation
stacks and then changing the
endpoint name in the AWS
Lambda using an Environment
Variable
Linear uses two Variants in the
Endpoint Configuration and using
an AWS Step Function and AWS
Lambda to call the
UpdateEndpointWeightsAndCap
acities API.

33
Amazon SageMaker exposes metrics to AWS CloudWatch
MONITORING SAGEMAKER
Name Dimension Statistic Threshold Time Period Missing
Endpoint model
latency
Milliseconds Average >100 For 5 minutes ignore
Endpoint model
invocations
Count Sum
> 10000
For 15 minutes
notBreaching
< 1000 breaching
Endpoint disk
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint CPU
usage
% Average
> 90%
> 80%
Endpoint memory
usage
% Average
> 90%
> 80%
Endpoint 5XX
errors
Count Sum >10 For 5 minutes
notBreaching
Endpoint 4XX
errors
Count Sum >50 For 5 minutes
The metrics in AWS CloudWatch
can then be used for alarms:
➤ Always pay attention to how to
handle missing data
➤ Always test your alarms
➤ Look to level your alarms
➤ Make your alarms complement
each other

34
X-RAY traces can help you spot bottlenecks and costly areas of the code including inside your models.
OBSERVING SAGEMAKER
Inference Function
Inference Function
Function A
Function B
Function C
Function C
Function D
Function E
Function F
Function G
Function H
APIGWUrl
Model
Function 1
Function 2
SQL: db_url
Model

35
Remember to always apply least privilege and other AWS Security best practice, be very protective of your data
SECURITY AND SAGEMAKER
AWS KMS: Encrypt everything! however if your data is PII or PCI-DSS then consider
using a dedicated Custom Key in KMS to-do this. This allows you tighter control by
limiting the ability to decrypt data, providing another layer security over S3.
AWS IAM: SageMaker like EC2 is granted access to other AWS services using IAM
roles and you need to make sure your policies are locked down to only the Actions
and Resources you need.
Amazon S3: SageMaker can use a range of data stores, however S3 is the most
popular. However please make sure you enable encryption, resource policies,
logging and versioning on your buckets.
Amazon VPC: SageMaker can run outside a VPC and access data over the public
internet (hopefully using HTTPS). This runs contrary to most corporate Information
Security Policies. Therefore please deploy in VPC with Private Links for extra security.
Data: Most importantly, only use the data you need. If the data contains PII or
PCI-DSS and you do not need those values then remove them or sanitised.

36
Aramex achieved the following results with Inawisdom, AWS and Matillion
THE OUTCOME
Aramex has seen a 74 percent increase
in the accuracy of their transit time
predictions because of the machine
learning models developed on AWS with
Inawisdom.
Aramex has improved its contact center
efficiency with the Inawisdom solution,
eliminating 40 percent of inbound
customer calls related to shipments.
Aramex got workloads live in 8 weeks
with Inawisdom where previously they had
struggled for 5 years.
A data pipeline that ingests 1.2 million
updates every 15 minutes. 70
orchestration jobs. 50 of these are for
the incremental load from SQL Server
to Redshift
Storing over 7.2 TB of data,
comprising of 3 months in hot storage
and 7.5 years queryable from long
term storage using Redshift Spectrum
Over 20 million predictions per month,
averaging 600 requests per minute, with
daily peaks of 800 requests per minute,
Achieving response times of 135ms at
90th percentile vs originally 2500ms
Business Results Technical Results:

37
re:Invent and Webinar:
➤ https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-event.html
➤ https://www.youtube.com/watch?v=lx9fP_4yi2s
➤ https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/
➤ https://www.inawisdom.com/machine-learning/machine-learning-performance-more-than-skin-deep/
➤ https://www.inawisdom.com/machine-learning/a-model-is-for-life-not-just-for-christmas/
➤ https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html
➤ https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alar
ms-and-missing-data
➤ https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/readmelink.html#getting-started-
with-sample-jupyter-notebooks
REFERENCES
Other:
My blogs:

020 3575 1337
info@inawisdom.com
Columba House,
Adastral Park, Martlesham Heath
Ipswich, Suffolk, IP5 3RE
www.inawisdom.com
@inawisdom Inawisdom
phil@Inawisdom.com
+44 20 8133 8349
Thank you
© 2023 Cognizant | Private
Phil Basford
Senior Director – Consulting / Inawisdom CTO
Philip.basford@cognizant.com
23

Inawisdom MLOPS

Recommended

Recommended

More Related Content

Similar to Inawisdom MLOPS

Similar to Inawisdom MLOPS (20)

More from PhilipBasford

More from PhilipBasford (14)

Recently uploaded

Recently uploaded (20)

Inawisdom MLOPS