Successfully reported this slideshow.
Your SlideShare is downloading. ×

Automated Production Ready ML at Scale

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 28 Ad

Automated Production Ready ML at Scale

Download to read offline

In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.

In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Automated Production Ready ML at Scale (20)

Advertisement

More from Databricks (20)

Advertisement

Automated Production Ready ML at Scale

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Errol Koolmeister, H&M Keven Wang, H&M AUTOMATED PRODUCTION READY ML @ SCALE #UnifiedDataAnalytics #SparkAISummit
  3. 3. Agenda • AI journey @ H&M • Machine learning blueprint • Automated ML development process • ML orchestration for scale 3
  4. 4. Agenda • AI journey @ H&M • Machine learning blueprint • Automated ML development process • ML orchestration for scale 4
  5. 5. 5
  6. 6. Algo library, IT platform, Business Impact 2016 Exploration Run initial PoCs Test AA appetite & applicability 2017 Initiation Industrialize early use cases Defining organization and capability needs Establishing the IT / data environment 2018 Establish AA & AI function Roll-out & hand over of successful pilots Establishing AA-WoW, team, governance 2019 AA Leader Increasingly data & algo-driven retail business Analytical support across entire value chain Strong internal AA teams Engage in partnership with strong AI players 2022 AI Leader of the Fashion Industry Lead the frontier of AI at scale in delivering customer value Global leader in developing talent pools and supporting AI hubs and networks AI-powered tools and capabilities supporting core processes and business decisions in all functions World leading ecosystem of cutting edge AI partners Today Our journey:
  7. 7. AI @ H&M quick facts 100+ co-located FTEs Growing # of colleagues 30+ different nationalities Several nationalities Combined teams Sprints Standups Product mgmt. Epics Algo Cloud New ways of working Consultants HAAL Azure Databricks
  8. 8. H&M use cases H&M Advanced Analytics Landscape LogisticsProduction Sales MarketingDesign / Buying Common components eg Algos & Tech Assortment quantification Fashion Forecast Allocation Markdown Online Markdown Store Personalized Promotions, Recommendations & Journeys
  9. 9. Agenda • AI journey @ H&M • Machine learning blueprint • Automated ML development process • ML orchestration for scale 9
  10. 10. Fragmented solution landscape 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. Training data ingestion Persisted Model Model Development Model & data versioning Deployment orchestration DataStorage Model development & usage process Data Preparation Feature Engineering Model development Unseen data ingestion Results Data Preparation Transform data into features Model prediction Model usage
  12. 12. Generic AI development process Model exploration Data exploration Feature engineering Model exploration Try out different libs Model implementation Data onboarding / ETL Model implementation Set up model training pipeline Implement model serving set up container Unit test Model training Execute pipeline Performance evaluation Build model cross validation Output model Model tuning Hyper parameter tuning Model Assembling Data augmentation Build model env Build model serving container Offline model prediction Offline prediction Output result A/B deployment of model serving Online model serving Rolling upgrade A/B deployment Model monitoring Performance monitoring Monitoring non functional
  13. 13. Prediction Model development Development process – tool mapping Model exploration Model implementation Model training Model tuning Build model applying env Offline model prediction Online model serving Model monitoring Azure Databricks VS CodePyCharmData Lake Store Data Lake Store Kubernetes Container Registry PyCharm Azure Databricks Azure DatabricksAirflow
  14. 14. Architecture Principals SEPARATION OF CONCERN AUTOMATEDSTATELESS CLOUD NATIVE SERVERLESS
  15. 15. Unifying architecture for speed & scale 15
  16. 16. Agenda • AI journey @ H&M • Machine learning blueprint • Automated ML development process • ML orchestration for scale 16
  17. 17. AUTOMATED ML DEVELOPMENT 17 Kubernetes Container Registry Triggering CI Orchestrator Model repository Azure Databricks 1 Code commit 2 code static check, unit test, Package 3.2 Trigger pipeline 4.3 Commit model 5.1 Fetch model 5.2 Build container image 6 Push container image 7 Auto deploy new container 4.1 job execution PyCharm 3.1 Push to DBFS 4.2 log model info
  18. 18. Connect the dots 18 Exploration Implementation Build and packaging Training and prediction Monitoring • Shared VS dedicated cluster • Notebook VS python modules • Library management • Training on worker nodes • Logging with Mlflow
  19. 19. Agenda • AI journey @ H&M • Machine learning blueprint • Automated ML development process • ML orchestration for scale 19
  20. 20. 20 Train Test Val Source data Feature engineering Training hyper-param tuning Prepare Data Optimization GLPK Large size Parallel process Large size Parallel process Medium size Parallel process Medium size Iterative/Parallel process Medium size Iterative process
  21. 21. 21 Internal information Distributed computation Single machine computation Spark task 1 Spark task 2 Python task 1 Python task 2 Python task 3
  22. 22. 22 … Scenario 1 • Geo location l1 • Product type p1 • Time t1 Scenario 2 • Geo location l2 • Product type p2 • Time t2 Scenario 3 • Geo location l3 • Product type p3 • Time t3 Databricks Cluster Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Spark task 1 Spark task 2 Python task 1 Python task 2 Python task 3 Source data Prep data Feature engine… Train Optimize … Spark task 1 Spark task 2 Python task 1 Python task 2 Python task 3… Spark task 1 Spark task 2 Python task 1 Python task 2 Python task 3 Scenario m • Geo location lm • Product type pm • Time tm Source data Prep data Feature engine… Train Optimize 30 mins 60 mins 5 mins ? mins
  23. 23. 23 Scenario 1 • Geo location l1 • Product type p1 • Time t1 Scenario 2 • Geo location l2 • Product type p2 • Time t2 Scenario 3 • Geo location l3 • Product type p3 • Time t3 Scenario i • Geo location li • Product type pi • Time ti Databricks Cluster Databricks Cluster Databricks Cluster Scenario set VM VM Container Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize
  24. 24. 24 What we are looking for • A ML orchestrator to train models for different scenarios (scenario set) • Scenario set can be parameterized • Leverage different computation patterns, like Spark, Docker • Parallelize each scenarios as much as possible • Optimize both resource utilization and total lead time
  25. 25. ML orchestrator - Airflow • Multi source of failure • Lack of elasticity, scaling up/down • Coupling app dependency with infrastructure 25 How Apache Airflow Distributes Jobs on Celery workers • Implement Pipeline/DAG by Python • Workflow Scheduler by Airbnb • Integration with different source & sink Feature Challenge
  26. 26. Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize DAG Scenario set Scenario 1 Source data Prep data Feature engine… Train Optimize Scenario 2 Source data Prep data Feature engine… Train Optimize Scenario 3 Source data Prep data Feature engine… Train Optimize Scenario i Source data Prep data Feature engine… Train Optimize Airflow MetaDB Databricks Cluster Databricks Cluster Databricks Cluster AKS Container Registry Airflow Logs Airflow dags Persistent Volume Airflow Webserver Airflow Scheduler Kubernetes Pod Azure File share
  27. 27. DAG at a glance
  28. 28. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×