Apply MLOps at Scale

Apply MLOps At Scale
Keven(Qi) Wang
Linkedin: https://www.linkedin.com/in/kevenqiwang/
Medium: https://medium.com/@kevenwang_33862
Lead AI Architect @ H&M

Agenda
AI journey @H&M
Quick facts and use cases
Reference Architecture gen1
ML process and ML training
MLOps and Operationalize AI

AI journey @H&M
Quick facts and use cases

General Information
74 markets
5000+ stores
177,000 employees
More than
Over
Sales including VAT SEK 210 billion (2018)
E-commerce in 51 markets

Our Journey
2016
Exploration
Run initial PoCs
Test AA appetite &
applicability
2017
Initiation
Industrialize early use cases
Defining organization and
capability needs
Establishing the IT / data
environment
2018
Establish AA & AI
function
Roll-out & hand over of
successful pilots
Establishing AA-WoW,
team, governance
2019
AA Leader
Increasingly data &
algo-driven retail business
Analytical support
across entire value chain
Strong internal AA teams
Engage in partnership with
strong AI players
2022
AI Leader of the Fashion
Industry
Lead the frontier of AI at scale in
delivering customer value
Global leader in developing
talent pools and supporting
AI hubs and networks
AI-powered tools and capabilities
supporting core processes and business
decisions in all functions
World leading ecosystem of cutting edge
AI partners
Today
Algo library, IT platform, Business Impact

H&M use cases
Analytics and Data Platform
LogisticsProduction Sales MarketingDesign / Buying
Assortment quantification
Fashion Forecast
Allocation Markdown Online
Markdown Store
Personalized Promotions,
Recommendations &
Journeys
Movebox
Knowledge &
Best Practice
AI exploration
and Research
Rapid Dev
enablement
AI platform

AI @ H&M quick facts
100+ co-located
FTEs
Growing # of
colleagues
30+ different
nationalities
Several
nationalities
Combined
teams
Sprints
Standups
Product
mgmt.
Epics
Algo
Cloud
New ways of
working
Consultants
HAAL
Azure Databricks

ML process and ML training

Starting point – fragemented architecture

ML Process and Tooling
Model Deployment
Model training
Data
acquisition
Data
preparation
Feature
Engineering
Model training
Model
repository
Unseen data
acquisition
Data
preparation
Transform
data into
feature
Model
prediction Results
Deployment orchestration
Datastorage
Training orchestration
Data Lake Store
Model and data versioning
Automated, e2e feedback loop
e2e monitoring

Interactive model development
Kubernetes
Container
Registry
Triggering
CI Orchestrator
Model
repository
Azure Databricks
1 Code commit
2 code static check,
unit test,
Packaging
3.2 Trigger pipeline
4.3 Commit model
5.1 Fetch model
5.2 Build container image
6 Push image
7 Auto deploy
PyCharm
3.1 Push
to DBFS
4.2 log model info
4.1 job execution

Automated model training pipeline 1
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Scenario i
• Geo location li
• Product type pi
• Time ti
Scenario set
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Databricks Cluster
Databricks Cluster
Databricks Cluster
VM
VM
Container

Automated model training pipeline 2
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
DAG
Scenario
set
Scenario 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 2 Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 3
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario i
Source
data
Prep data
Feature
engine…
Train Optimize
Databricks
Cluster
Databricks
Cluster
Databricks
Cluster
Azure Kubernetes Service
Container RegistryAirflow
Logs
Airflow
dags
Persistent Volume
Airflow
Webserver
Airflow
Scheduler
Kubernetes Pod
Azure File share
Airflow MetaDB

Trick for Airflow dependency challenge
Actual
python method
Little trick:
python_callable
Call the function
without import the
module
For more detail, check this blog post:
https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d

15
General Information
Evolve to scale and industrialize across H&M
Make AI available
for product teams
across H&M Group
Facilitate scalability and
specialization
Continue to build word-class AI
products, engines and core
components
Proven the value in use
case by use case
Now: to reach next level we
need to industrialize and
scale AI across H&M

MLOps and Operationalize AI

General Information
Version compatibility
Reproducibility
Approve process
Model format
Experiment strategy
Feedback loop
Model traceability
Model metadata
Deployment strategy
MLOps
Scalability

Model development - Interactive VS Automated
▪ AI product lifecycle
▪ Notebook and Python modules
▪ Container as first class citizen
▪ Airflow VS Kubeflow

Model serving – deployment strategy
Router Model 1.1
Router
(canary)
Model 1.1
Model 1.2
Router
(shadow)
Model 1.1
Model 1.2
Router
Model A1
Model A2
Model A3
Router
Model A1
Model A2
Model A3
Reward
System
Release Strategies Experiment Strategy
A/B test
Experiment Strategy
Multi-armed Bandit

Model serving – Inference Graph
Router 1
(Multi-armed
Bandit)
Router 2
(A/B test)
Model B1
Model B2
Model A1
Model A2
Model A3
Input
Transformer
Output
Transformer

Model management and lifecycle
Staging ProductionModel AprovalBack TestModel Development
PR
pipeline
Back test
pipeline
Trainning
CI pipeline
CD – Staging
Pipeline
CD – prod
pipeline CI/CD pipeline
develop feature
Pull Req
Infra as code
#dev #stage #prod
Infra as code Infra as code

Take away
▪ Problem, Process and Architecture
▪ Platform approach
▪ Leverage cloud native service

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Apply MLOps at Scale

More Related Content

What's hot

Similar to Apply MLOps at Scale

More from Databricks

Recently uploaded

Apply MLOps at Scale