Apply MLOps At Scale
Keven(Qi) Wang
Linkedin: https://www.linkedin.com/in/kevenqiwang/
Medium: https://medium.com/@kevenwang_33862
Lead AI Architect @ H&M
Agenda
AI journey @H&M
Quick facts and use cases
Reference Architecture gen1
ML process and ML training
Reference Architecture gen2
MLOps and Operationalize AI
AI journey @H&M
Quick facts and use cases
General Information
74 markets
5000+ stores
177,000 employees
More than
Over
Sales including VAT SEK 210 billion (2018)
E-commerce in 51 markets
Our Journey
2016
Exploration
Run initial PoCs
Test AA appetite &
applicability
2017
Initiation
Industrialize early use cases
Defining organization and
capability needs
Establishing the IT / data
environment
2018
Establish AA & AI
function
Roll-out & hand over of
successful pilots
Establishing AA-WoW,
team, governance
2019
AA Leader
Increasingly data &
algo-driven retail business
Analytical support
across entire value chain
Strong internal AA teams
Engage in partnership with
strong AI players
2022
AI Leader of the Fashion
Industry
Lead the frontier of AI at scale in
delivering customer value
Global leader in developing
talent pools and supporting
AI hubs and networks
AI-powered tools and capabilities
supporting core processes and business
decisions in all functions
World leading ecosystem of cutting edge
AI partners
Today
Algo library, IT platform, Business Impact
H&M use cases
Analytics and Data Platform
LogisticsProduction Sales MarketingDesign / Buying
Assortment quantification
Fashion Forecast
Allocation Markdown Online
Markdown Store
Personalized Promotions,
Recommendations &
Journeys
Movebox
Knowledge &
Best Practice
AI exploration
and Research
Rapid Dev
enablement
AI platform
AI @ H&M quick facts
100+ co-located
FTEs
Growing # of
colleagues
30+ different
nationalities
Several
nationalities
Combined
teams
Sprints
Standups
Product
mgmt.
Epics
Algo
Cloud
New ways of
working
Consultants
HAAL
Azure Databricks
Reference Architecture gen1
ML process and ML training
Starting point – fragemented architecture
ML Process and Tooling
Model Deployment
Model training
Data
acquisition
Data
preparation
Feature
Engineering
Model training
Model
repository
Unseen data
acquisition
Data
preparation
Transform
data into
feature
Model
prediction Results
Deployment orchestration
Datastorage
Training orchestration
Data Lake Store
Model and data versioning
Automated, e2e feedback loop
e2e monitoring
Interactive model development
Kubernetes
Container
Registry
Triggering
CI Orchestrator
Model
repository
Azure Databricks
1 Code commit
2 code static check,
unit test,
Packaging
3.2 Trigger pipeline
4.3 Commit model
5.1 Fetch model
5.2 Build container image
6 Push image
7 Auto deploy
PyCharm
3.1 Push
to DBFS
4.2 log model info
4.1 job execution
Automated model training pipeline 1
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Scenario i
• Geo location li
• Product type pi
• Time ti
Scenario set
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Databricks Cluster
Databricks Cluster
Databricks Cluster
VM
VM
Container
Automated model training pipeline 2
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
DAG
Scenario
set
Scenario 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 2 Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 3
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario i
Source
data
Prep data
Feature
engine…
Train Optimize
Databricks
Cluster
Databricks
Cluster
Databricks
Cluster
Azure Kubernetes Service
Container RegistryAirflow
Logs
Airflow
dags
Persistent Volume
Airflow
Webserver
Airflow
Scheduler
Kubernetes Pod
Azure File share
Airflow MetaDB
Trick for Airflow dependency challenge
Actual
python method
Little trick:
python_callable
Call the function
without import the
module
For more detail, check this blog post:
https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d
15
General Information
Evolve to scale and industrialize across H&M
Make AI available
for product teams
across H&M Group
Facilitate scalability and
specialization
Continue to build word-class AI
products, engines and core
components
Proven the value in use
case by use case
Now: to reach next level we
need to industrialize and
scale AI across H&M
Reference Architecture gen2
MLOps and Operationalize AI
General Information
Version compatibility
Reproducibility
Approve process
Model format
Experiment strategy
Feedback loop
Model traceability
Model metadata
Deployment strategy
MLOps
Scalability
MLOps tech stack
Model development - Interactive VS Automated
▪ AI product lifecycle
▪ Notebook and Python modules
▪ Container as first class citizen
▪ Airflow VS Kubeflow
Model serving – deployment strategy
Router Model 1.1
Router
(canary)
Model 1.1
Model 1.2
Router
(shadow)
Model 1.1
Model 1.2
Router
Model A1
Model A2
Model A3
Router
Model A1
Model A2
Model A3
Reward
System
Release Strategies Experiment Strategy
A/B test
Experiment Strategy
Multi-armed Bandit
Model serving – Inference Graph
Router 1
(Multi-armed
Bandit)
Router 2
(A/B test)
Model B1
Model B2
Model A1
Model A2
Model A3
Input
Transformer
Output
Transformer
Model management and lifecycle
Staging ProductionModel AprovalBack TestModel Development
PR
pipeline
Back test
pipeline
Trainning
CI pipeline
CD – Staging
Pipeline
CD – prod
pipeline CI/CD pipeline
develop feature
Pull Req
Infra as code
#dev #stage #prod
Infra as code Infra as code
Take away
▪ Problem, Process and Architecture
▪ Platform approach
▪ Leverage cloud native service
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Apply MLOps at Scale

  • 1.
    Apply MLOps AtScale Keven(Qi) Wang Linkedin: https://www.linkedin.com/in/kevenqiwang/ Medium: https://medium.com/@kevenwang_33862 Lead AI Architect @ H&M
  • 2.
    Agenda AI journey @H&M Quickfacts and use cases Reference Architecture gen1 ML process and ML training Reference Architecture gen2 MLOps and Operationalize AI
  • 3.
    AI journey @H&M Quickfacts and use cases
  • 4.
    General Information 74 markets 5000+stores 177,000 employees More than Over Sales including VAT SEK 210 billion (2018) E-commerce in 51 markets
  • 5.
    Our Journey 2016 Exploration Run initialPoCs Test AA appetite & applicability 2017 Initiation Industrialize early use cases Defining organization and capability needs Establishing the IT / data environment 2018 Establish AA & AI function Roll-out & hand over of successful pilots Establishing AA-WoW, team, governance 2019 AA Leader Increasingly data & algo-driven retail business Analytical support across entire value chain Strong internal AA teams Engage in partnership with strong AI players 2022 AI Leader of the Fashion Industry Lead the frontier of AI at scale in delivering customer value Global leader in developing talent pools and supporting AI hubs and networks AI-powered tools and capabilities supporting core processes and business decisions in all functions World leading ecosystem of cutting edge AI partners Today Algo library, IT platform, Business Impact
  • 6.
    H&M use cases Analyticsand Data Platform LogisticsProduction Sales MarketingDesign / Buying Assortment quantification Fashion Forecast Allocation Markdown Online Markdown Store Personalized Promotions, Recommendations & Journeys Movebox Knowledge & Best Practice AI exploration and Research Rapid Dev enablement AI platform
  • 7.
    AI @ H&Mquick facts 100+ co-located FTEs Growing # of colleagues 30+ different nationalities Several nationalities Combined teams Sprints Standups Product mgmt. Epics Algo Cloud New ways of working Consultants HAAL Azure Databricks
  • 8.
    Reference Architecture gen1 MLprocess and ML training
  • 9.
    Starting point –fragemented architecture
  • 10.
    ML Process andTooling Model Deployment Model training Data acquisition Data preparation Feature Engineering Model training Model repository Unseen data acquisition Data preparation Transform data into feature Model prediction Results Deployment orchestration Datastorage Training orchestration Data Lake Store Model and data versioning Automated, e2e feedback loop e2e monitoring
  • 11.
    Interactive model development Kubernetes Container Registry Triggering CIOrchestrator Model repository Azure Databricks 1 Code commit 2 code static check, unit test, Packaging 3.2 Trigger pipeline 4.3 Commit model 5.1 Fetch model 5.2 Build container image 6 Push image 7 Auto deploy PyCharm 3.1 Push to DBFS 4.2 log model info 4.1 job execution
  • 12.
    Automated model trainingpipeline 1 Scenario 1 • Geo location l1 • Product type p1 • Time t1 Scenario 2 • Geo location l2 • Product type p2 • Time t2 Scenario 3 • Geo location l3 • Product type p3 • Time t3 Scenario i • Geo location li • Product type pi • Time ti Scenario set Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Databricks Cluster Databricks Cluster Databricks Cluster VM VM Container
  • 13.
    Automated model trainingpipeline 2 Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize DAG Scenario set Scenario 1 Source data Prep data Feature engine… Train Optimize Scenario 2 Source data Prep data Feature engine… Train Optimize Scenario 3 Source data Prep data Feature engine… Train Optimize Scenario i Source data Prep data Feature engine… Train Optimize Databricks Cluster Databricks Cluster Databricks Cluster Azure Kubernetes Service Container RegistryAirflow Logs Airflow dags Persistent Volume Airflow Webserver Airflow Scheduler Kubernetes Pod Azure File share Airflow MetaDB
  • 14.
    Trick for Airflowdependency challenge Actual python method Little trick: python_callable Call the function without import the module For more detail, check this blog post: https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d
  • 15.
    15 General Information Evolve toscale and industrialize across H&M Make AI available for product teams across H&M Group Facilitate scalability and specialization Continue to build word-class AI products, engines and core components Proven the value in use case by use case Now: to reach next level we need to industrialize and scale AI across H&M
  • 16.
    Reference Architecture gen2 MLOpsand Operationalize AI
  • 17.
    General Information Version compatibility Reproducibility Approveprocess Model format Experiment strategy Feedback loop Model traceability Model metadata Deployment strategy MLOps Scalability
  • 18.
  • 19.
    Model development -Interactive VS Automated ▪ AI product lifecycle ▪ Notebook and Python modules ▪ Container as first class citizen ▪ Airflow VS Kubeflow
  • 20.
    Model serving –deployment strategy Router Model 1.1 Router (canary) Model 1.1 Model 1.2 Router (shadow) Model 1.1 Model 1.2 Router Model A1 Model A2 Model A3 Router Model A1 Model A2 Model A3 Reward System Release Strategies Experiment Strategy A/B test Experiment Strategy Multi-armed Bandit
  • 21.
    Model serving –Inference Graph Router 1 (Multi-armed Bandit) Router 2 (A/B test) Model B1 Model B2 Model A1 Model A2 Model A3 Input Transformer Output Transformer
  • 22.
    Model management andlifecycle Staging ProductionModel AprovalBack TestModel Development PR pipeline Back test pipeline Trainning CI pipeline CD – Staging Pipeline CD – prod pipeline CI/CD pipeline develop feature Pull Req Infra as code #dev #stage #prod Infra as code Infra as code
  • 23.
    Take away ▪ Problem,Process and Architecture ▪ Platform approach ▪ Leverage cloud native service
  • 24.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.