Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Ride-Hailing with Machine Learning on MLflow

2,041 views

Published on

"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.

Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"

Published in: Data & Analytics

Scaling Ride-Hailing with Machine Learning on MLflow

  1. 1. Scaling ride hailing with Md Jawad Data Scientist GOJEK
  2. 2. Our Scale Operating in 4 countries and more than 70 cities 80mapp downloads +250kmerchants 4countries 1m+drivers 100m+monthly bookings Indonesia Singapore Thailand Vietnam
  3. 3. #JUSTGOJEKIT
  4. 4. Mobility Data Science Team
  5. 5. Mobility Data Science Team ■ Matchmaking ■ Surge pricing
  6. 6. Industry challenge
  7. 7. 1. Matchmaking model a. Background b. Challenges c. Desired state 2. MLflow 3. Solution Agenda
  8. 8. High rating Heading to home area Lowest ETA Customer Selected driver Choosing best driver for the job
  9. 9. Matchmaking: First Cut Raw Data Prod ServingHow can we get models into production asap?
  10. 10. Matchmaking: First Cut Raw Data Process Data Airflow Airflow DAG
  11. 11. Matchmaking: First Cut Prod Serving Deploy Gitlab for CI/CD
  12. 12. Matchmaking: First Cut Raw Data Prod Serving How are we going to train models? Deploy Process Data Airflow
  13. 13. Matchmaking: First Cut Raw Data Prod Serving Build, Test, Deploy Application Process Data, Train Model Airflow Trigger: API CallTrigger: Daily Schedule Helm deploy to Kubernetes
  14. 14. Matchmaking: The Monolith Airflow Raw Data Prod Serving Process data + Train models + Deploy
  15. 15. Challenges with this approach ● Inefficient ○ Need to wait hours for pipeline to run before deploying models ○ Can’t deploy serving without trigger from Airflow
  16. 16. Challenges with this approach ● Inefficient ● Hard to experiment ○ Do we fork the codebase for each small change? ○ Do we fan-in and fan-out a single pipeline? ○ Tracking model performance over time
  17. 17. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken Model tracking by timestamp?
  18. 18. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ○ Pipelines have non-deterministic side inputs (API calls, fetching data, reading configuration) ○ No standardized way to track artifacts or processes
  19. 19. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● No visibility Features? Models? Parameters? Metrics?
  20. 20. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● Low visibility ● Hard to scale How do we scale to 1000s models and new markets? Airflow trains model, triggers new deploy through GitLab Hardcoded deployments targets
  21. 21. Challenges with this approach ● Inefficient ● Hard to experiment ● Versioning is broken ● Low reproducibility ● Low visibility ● Hard to scale ● No separation of roles Raw Data Prod Serving Process data + Train models + Deploy Responsibility of Data Engineers, Software Engineers, Data Scientists
  22. 22. Desired state ● Easy to experiment ● Easy to reproduce results ● Easy to deploy models ● Easy to evaluate performance of features and models ● Capable of scaling to 1000s of models in many regions
  23. 23. Model Exchange Data Prep Training Deploy Raw Data Governance Scale Scale Scale Scale μ λ θ Tuning μ λ θ Tuning An open source platform for the machine learning lifecycle Delta
  24. 24. Tracking Record and query experiments: code, data, config, results Projects Packaging format for reproducible runs on any platform Models General model format that supports diverse deployment tools MLflow Components
  25. 25. • Parameters: key-value inputs to your code • Metrics: numeric values (can update over time) • Artifacts: arbitrary files, including models • Source: which version of code ran? Key Concepts in Tracking
  26. 26. Legacy ML workflow Airflow Raw Data Prod Serving Process data + Train models + Deploy
  27. 27. Approach 1. Decouple based on concerns Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ???
  28. 28. 1. Decouple based on concerns 2. Implement ML pipeline solution Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ??? Approach
  29. 29. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution Raw Data Prod Serving Deploy Airflow Process Data ??? Train Models ??? Approach
  30. 30. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution 3. Add an artifact store between stages for features (Feast) Feature Store Raw Data Prod Serving Deploy Airflow Process Data *GOJEK/ Feast Train Models ??? *http://github.com/gojek/feast Approach
  31. 31. 1. Decouple based on concerns 2. Implement ML pipeline solution and Continuous Delivery solution 3. Add an artifact store between stages for features (Feast) and models (MLflow) Model Store Feature Store Raw Data Prod Serving Airflow Process Data GOJEK/ Feast Train Models Deploy Approach
  32. 32. Advantages: Asynchronous Experimentation Raw Data Process Data Prod Serving Feature Store Train Models Deploy Time based Instance based Artifact based1 2 3 with mlflow.start_run(): # train model... mlflow.log_param("alpha", alpha) mlflow.log_param("l1_ratio", l1_ratio) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model")
  33. 33. Advantages: Reproducible & Traceable Raw Data Process Data Prod Serving Feature Store Train Models Deploy Track artifacts used to train models ● features ● pipeline version (git+SHA) ● and other pipeline variables Track artifacts used to deploy ML systems ● docker image ● configuration ● model version ● feature data Track artifacts used to produce features ● data sources ● jobs ● parameters
  34. 34. Advantages: Governance & Evaluation Prod Serving Feature Store Train Models Deploy training run parameters deployment configuration model performance feature performance 1 2 34
  35. 35. Advantages: Role Separation Raw Data Process Data Prod Serving Feature Store Train Models Deploy Data Scientist Software EngineerData Engineer
  36. 36. Advantages: Scalability Driver Allocation System: (3 environments) x (4 markets) x (5 model types) x (10+ live experiments) = 600+ simultaneous deployments gke-PROD-SG-T1-EXP2323 CD Pipeline (pull based) Configuration Helm Charts Docker Images gke-PROD-TH-T2-EXP1006 gke-PROD-ID-T3-EXP3423 gke-PROD-VN-T4-EXP1800 1 New model is published 2 Monitors all artifacts for new versions 3 Test and deploy changes to relevant clusters
  37. 37. Thank you

×