SlideShare a Scribd company logo
1 of 63
Download to read offline
Michelangelo
Jeremy Hermann, Machine Learning Platform @ Uber
MISSION
Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
AGENDA
○ ML at Uber
○ Why Build an ML Platform?
○ Key Platform Components
○ System Architecture
○ ML as Software Engineering
○ What’s Next?
ML at Uber
ML at Uber
○ Uber Eats
○ ETAs
○ Autonomous Cars
○ Customer Support
○ Dispatch
○ Personalization
○ Demand Modeling
○ Dynamic Pricing
○ Forecasting
○ Maps
○ Fraud
○ Destination Predictions
○ Anomaly Detection
○ Capacity Planning
○ And many more...
ML at Uber - ETAs
○ ETAs are core to customer experience and used
by myriad internal systems
○ ETA are generated by route-based algorithm
called Garafu
○ Garafu is often incorrect - but it’s incorrect in
predictable ways
○ ML model predicts the Garafu error
○ Use the predicted error to correct the ETA
○ ETAs now dramatically more accurate
ML at Uber - Eats
○ Models used for
○ Ranking of restaurants and
dishes
○ Delivery times
○ Search ranking
○ 100s of ML models called to render
Eats homepage
ML at Uber - Autonomous Cars
ML at Uber - Dispatch
○ Optimize matching of rider and
driver
○ Predict if open rider app will
make trip request
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Map Making
ML at Uber - Destination Prediction
ML at Uber - Spatiotemporal Forecasting
Supply
○ Available Drivers
Demand
○ Open Apps
Other
○ Request Times
○ Arrival Times
○ Airport Demand
ML at Uber - Customer Support
○ 5 customer-agent communication
channels
○ Hundreds of thousands of tickets
surfacing daily on the platform across
400+ cities
○ NLP models classify tickets and
suggest response templates
○ Reduce ticket resolution time by 10%+
with same or higher CSAT
Why build an ML platform?
Early challenges with machine learning
○ Limited scale with Python and R
○ Pipelines not reliable or reproducible
○ Many one-off production systems for serving
Goals of platform
○ Standardize workflows and tools
○ Provide scalable support for end-to-end ML workflow
○ Democratize and accelerate machine learning through ease of use
Motivation for Platform
Same basic ML workflow & system requirements for
○ Traditional ML & deep learning
○ Supervised, unsupervised, & semi-supervised
learning
○ Online or continuous learning
○ Batch, online, & mobile deployments
○ Time-series forecasting
Machine Learning Workflow
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
Key Platform Components
Key Components:
Feature Store & Feature Engineering
Problem
○ Hardest part of ML is finding good features
○ Same features are often used by different models built by different teams
Solution
○ Centralized feature store for collecting and sharing features
○ Platform team curates core set of widely applicable features
○ Modellers contribute more features as part of ongoing model building
○ Meta-data for each feature to track ownership, how computed, where used, etc
○ Modellers select features by name & join key. Offline & online pipelines
auto-configured
Feature Store (aka Palette)
DSL for Feature Engineering
Batch Training Job
(Spark)
Training
Algo
DSL
Batch Prediction Job
(Spark)
Trained
Model
DSL
Pure function expressions for
○ Feature selection
○ Feature transformations (for derived & composite features)
Standard set of accessor functions for
○ Feature store
○ Basis features
○ Column stats (min, max, mean, std-dev, etc)
Standard transformation functions + UDFs
Examples
○ @palette:store:orders:prep_time_avg_1week:rs_uuid
○ nFill(@basis:distance, mean(@basis:distance))
Pipeline for Offline Training with Feature Store
SPARK or SQL FEATURE DSL TRAINING ALGO
RAW DATA
BASIS
FEATURES
TRANSFORMED
FEATURES
MODEL
HIVE
FEATURE STORE
FEATURE STORE
FEATURES
HIVE
DATA LAKE
Pipeline for Online Serving with Feature Store
FEATURE DSL
SERVABLE
MODEL
BASIS
FEATURES
TRANSFORMED
FEATURES
CASSANDRA
FEATURE STORE
FEATURE STORE
FEATURES
CLIENT
SERVICE
Key Components:
Scalable Model Training
Large-scale distributed training (billions of samples)
○ Decision trees
○ Linear and logistic models
○ Unsupervised learning
○ Time series forecasting
○ Hyperparameter search for all model types
Smart pipeline management to balance speed and reliability
○ Fuse operators into single job for speed
○ Break operators into separate jobs to reliability
Distributed Training of Non-DL Models
○ Data-parallelism works best
when model is small enough to
fit on each GPU
○ Ring-allreduce is more efficient
than parameter servers for
averaging weights
○ Faster training and better GPU
utilization
○ Much simpler training scripts
○ More details at
http://eng.uber.com/horovod
Distributed Training of Deep Learning Models with Horovod
Key Components:
Partitioned Models
Problem
○ Often want to train a model per city or per product
○ Hard to train and deploy 100s or 1000s of individual models
Solution
○ Let users define hierarchical partitioning scheme
○ Automatically train model per partition
○ Manage and deploy as single logical model
Partitioned Models
GLOBAL
COUNTRY
CITY
Define partition scheme1
GLOBAL
COUNTRY
CITY
Make train / test split2
GLOBAL
COUNTRY
CITY
Keep same split and partition for each level3
M
M
M M M
M
M M M
GLOBAL
COUNTRY
CITY
Train model for every node4
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
Prune bad models5
M
M
M M
M
M M M
GLOBAL
COUNTRY
CITY
At serving time, route to best model for each node6
Key Components:
Model Visualization
Evaluate Models
Problem
○ It takes many iterations to produce a good model
○ Keeping track of how a model was built is important
○ Evaluating and comparing models is hard
With every trained model, we capture standard metadata and reports
○ Full model configuration, including train and test datasets
○ Training job metrics
○ Model accuracy metrics
○ Performance of model after deployment
Model Visualization - Regression Model
Model Visualization - Classification Model
Model Visualization - Feature Report
Model Visualization - Decision Tree
Key Components:
Sharded Deployment and Serving
Prediction Service
○ Thrift service container for one or more models
○ Scale out in Docker on Mesos
○ Single or multi-tenant deployments
○ Connection management and batched / parallelized queries to Cassandra
○ Monitoring & alerting
Deployment
○ Model & DSL packaged as JAR file
○ One click deploy across DCs via standard Uber deployment infrastructure
○ Health checks and rollback
Online Prediction Service
Realtime Predict Service
Deployed ModelDeployed Model
Client
Service
Deployed Model
Model
Cassandra Feature StoreRouting
Infra
DSLModel Manager1
2
3
4
5
Online Prediction Service
Online Prediction Service
Typical p95 latency from client service
○ ~5ms when all features from client service
○ ~10ms when joining pre-computed features from Cassandra
Peak prediction volume across current online deployments
○ 600k+ QPS
Problem
○ Prediction service can serve as many models as will fit into memory
○ Easy to run out of memory with large deployments of complex models
Solution
○ Organize serving cluster into number of physical shards
○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time
○ Virtual shards are mapped by system to physical shards
○ Models are loaded by service instances in the correct physical shard(s)
○ Gateway service routes to correct physical shard based on request header
Sharded Deployment
Client
Service
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
A B C
D E F
G H I
Unsharded Deployment
Client
Service
Gateway
Routing Table
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 2)
E F
G H I
Predict Service
Predict Service
Predict Service
Predict Service
Predict Service
(Shard 1)
A B C
D
Sharded Deployment
Key Components:
Deployment Labels
Problem
○ Multiple models per container (entirely different or multiple versions of same)
○ Support experimentation
○ Support automated retrain / redeploy
○ Cumbersome to have client service manage routing
Solution
○ Models deployed to 'label'
○ Labels can be used for experimentation or different use cases
○ Predict service routes request to most recent model w/ specified label
○ Labels have schema so deploys won't break
Deployment Labels
Key Components:
Live Model Performance Monitoring
Monitor Predictions
Problem
○ Models trained and evaluated against
historical data
○ Need to ensure deployed model is
making good predictions going forward
Solution
○ Log predictions & join to actual
outcomes
○ Publish metrics feature and prediction
distributions over time
○ Dashboards and alerts
System Architecture
Management
Monitor
API
Python / Java
What’s Next?
What’s Next?
○ Python ML for ease of use and broader algorithm support
○ Notebook-centered model building workflow
○ Online / continuous learning
○ AutoML to automate more of the modelling work
Thank you!
eng.uber.com/michelangelo
eng.uber.com/horovod
uber.com/careers
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

What's hot

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Data as a Product by Wayne Eckerson
Data as a Product by Wayne EckersonData as a Product by Wayne Eckerson
Data as a Product by Wayne EckersonZoomdata
 
Getting the Most Out of EPM: A deep dive into Account Reconciliation Manager
Getting the Most Out of EPM: A deep dive into Account Reconciliation ManagerGetting the Most Out of EPM: A deep dive into Account Reconciliation Manager
Getting the Most Out of EPM: A deep dive into Account Reconciliation Managerfinitsolutions
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptxTarekHamdi8
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at UberKarthik Murugesan
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AINing Jiang
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&MDatabricks
 

What's hot (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Observability
ObservabilityObservability
Observability
 
Data as a Product by Wayne Eckerson
Data as a Product by Wayne EckersonData as a Product by Wayne Eckerson
Data as a Product by Wayne Eckerson
 
Getting the Most Out of EPM: A deep dive into Account Reconciliation Manager
Getting the Most Out of EPM: A deep dive into Account Reconciliation ManagerGetting the Most Out of EPM: A deep dive into Account Reconciliation Manager
Getting the Most Out of EPM: A deep dive into Account Reconciliation Manager
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 

Similar to Michelangelo - Machine Learning Platform - 2018

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprisedoppenhe
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Karthik Murugesan
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016PAPIs.io
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on DatabricksDataScienceConferenc1
 
Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Naresh Sankapelly
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowEdunomica
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureMemory Fabric Forum
 

Similar to Michelangelo - Machine Learning Platform - 2018 (20)

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018Machine Learning Platform @Flipkart - Slash N Conference 2018
Machine Learning Platform @Flipkart - Slash N Conference 2018
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Microservices patterns
Microservices patternsMicroservices patterns
Microservices patterns
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling Infrastructure
 

More from Karthik Murugesan

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation PlatformKarthik Murugesan
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesKarthik Murugesan
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach Karthik Murugesan
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionKarthik Murugesan
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019Karthik Murugesan
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019Karthik Murugesan
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019Karthik Murugesan
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF EstimatorKarthik Murugesan
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Karthik Murugesan
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018Karthik Murugesan
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Karthik Murugesan
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Karthik Murugesan
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoveryKarthik Murugesan
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform Karthik Murugesan
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 

More from Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Michelangelo - Machine Learning Platform - 2018

  • 1. Michelangelo Jeremy Hermann, Machine Learning Platform @ Uber
  • 2. MISSION Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale.
  • 3. AGENDA ○ ML at Uber ○ Why Build an ML Platform? ○ Key Platform Components ○ System Architecture ○ ML as Software Engineering ○ What’s Next?
  • 5. ML at Uber ○ Uber Eats ○ ETAs ○ Autonomous Cars ○ Customer Support ○ Dispatch ○ Personalization ○ Demand Modeling ○ Dynamic Pricing ○ Forecasting ○ Maps ○ Fraud ○ Destination Predictions ○ Anomaly Detection ○ Capacity Planning ○ And many more...
  • 6. ML at Uber - ETAs ○ ETAs are core to customer experience and used by myriad internal systems ○ ETA are generated by route-based algorithm called Garafu ○ Garafu is often incorrect - but it’s incorrect in predictable ways ○ ML model predicts the Garafu error ○ Use the predicted error to correct the ETA ○ ETAs now dramatically more accurate
  • 7. ML at Uber - Eats ○ Models used for ○ Ranking of restaurants and dishes ○ Delivery times ○ Search ranking ○ 100s of ML models called to render Eats homepage
  • 8. ML at Uber - Autonomous Cars
  • 9. ML at Uber - Dispatch ○ Optimize matching of rider and driver ○ Predict if open rider app will make trip request
  • 10. ML at Uber - Map Making
  • 11. ML at Uber - Map Making
  • 12. ML at Uber - Map Making
  • 13. ML at Uber - Destination Prediction
  • 14. ML at Uber - Spatiotemporal Forecasting Supply ○ Available Drivers Demand ○ Open Apps Other ○ Request Times ○ Arrival Times ○ Airport Demand
  • 15. ML at Uber - Customer Support ○ 5 customer-agent communication channels ○ Hundreds of thousands of tickets surfacing daily on the platform across 400+ cities ○ NLP models classify tickets and suggest response templates ○ Reduce ticket resolution time by 10%+ with same or higher CSAT
  • 16. Why build an ML platform?
  • 17. Early challenges with machine learning ○ Limited scale with Python and R ○ Pipelines not reliable or reproducible ○ Many one-off production systems for serving Goals of platform ○ Standardize workflows and tools ○ Provide scalable support for end-to-end ML workflow ○ Democratize and accelerate machine learning through ease of use Motivation for Platform
  • 18. Same basic ML workflow & system requirements for ○ Traditional ML & deep learning ○ Supervised, unsupervised, & semi-supervised learning ○ Online or continuous learning ○ Batch, online, & mobile deployments ○ Time-series forecasting Machine Learning Workflow MANAGE DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 20. Key Components: Feature Store & Feature Engineering
  • 21. Problem ○ Hardest part of ML is finding good features ○ Same features are often used by different models built by different teams Solution ○ Centralized feature store for collecting and sharing features ○ Platform team curates core set of widely applicable features ○ Modellers contribute more features as part of ongoing model building ○ Meta-data for each feature to track ownership, how computed, where used, etc ○ Modellers select features by name & join key. Offline & online pipelines auto-configured Feature Store (aka Palette)
  • 22. DSL for Feature Engineering Batch Training Job (Spark) Training Algo DSL Batch Prediction Job (Spark) Trained Model DSL Pure function expressions for ○ Feature selection ○ Feature transformations (for derived & composite features) Standard set of accessor functions for ○ Feature store ○ Basis features ○ Column stats (min, max, mean, std-dev, etc) Standard transformation functions + UDFs Examples ○ @palette:store:orders:prep_time_avg_1week:rs_uuid ○ nFill(@basis:distance, mean(@basis:distance))
  • 23. Pipeline for Offline Training with Feature Store SPARK or SQL FEATURE DSL TRAINING ALGO RAW DATA BASIS FEATURES TRANSFORMED FEATURES MODEL HIVE FEATURE STORE FEATURE STORE FEATURES HIVE DATA LAKE
  • 24. Pipeline for Online Serving with Feature Store FEATURE DSL SERVABLE MODEL BASIS FEATURES TRANSFORMED FEATURES CASSANDRA FEATURE STORE FEATURE STORE FEATURES CLIENT SERVICE
  • 26. Large-scale distributed training (billions of samples) ○ Decision trees ○ Linear and logistic models ○ Unsupervised learning ○ Time series forecasting ○ Hyperparameter search for all model types Smart pipeline management to balance speed and reliability ○ Fuse operators into single job for speed ○ Break operators into separate jobs to reliability Distributed Training of Non-DL Models
  • 27. ○ Data-parallelism works best when model is small enough to fit on each GPU ○ Ring-allreduce is more efficient than parameter servers for averaging weights ○ Faster training and better GPU utilization ○ Much simpler training scripts ○ More details at http://eng.uber.com/horovod Distributed Training of Deep Learning Models with Horovod
  • 29. Problem ○ Often want to train a model per city or per product ○ Hard to train and deploy 100s or 1000s of individual models Solution ○ Let users define hierarchical partitioning scheme ○ Automatically train model per partition ○ Manage and deploy as single logical model Partitioned Models
  • 32. GLOBAL COUNTRY CITY Keep same split and partition for each level3
  • 33. M M M M M M M M M GLOBAL COUNTRY CITY Train model for every node4
  • 34. M M M M M M M M GLOBAL COUNTRY CITY Prune bad models5
  • 35. M M M M M M M M GLOBAL COUNTRY CITY At serving time, route to best model for each node6
  • 37. Evaluate Models Problem ○ It takes many iterations to produce a good model ○ Keeping track of how a model was built is important ○ Evaluating and comparing models is hard With every trained model, we capture standard metadata and reports ○ Full model configuration, including train and test datasets ○ Training job metrics ○ Model accuracy metrics ○ Performance of model after deployment
  • 38. Model Visualization - Regression Model
  • 39. Model Visualization - Classification Model
  • 40. Model Visualization - Feature Report
  • 41. Model Visualization - Decision Tree
  • 43. Prediction Service ○ Thrift service container for one or more models ○ Scale out in Docker on Mesos ○ Single or multi-tenant deployments ○ Connection management and batched / parallelized queries to Cassandra ○ Monitoring & alerting Deployment ○ Model & DSL packaged as JAR file ○ One click deploy across DCs via standard Uber deployment infrastructure ○ Health checks and rollback Online Prediction Service
  • 44. Realtime Predict Service Deployed ModelDeployed Model Client Service Deployed Model Model Cassandra Feature StoreRouting Infra DSLModel Manager1 2 3 4 5 Online Prediction Service
  • 45. Online Prediction Service Typical p95 latency from client service ○ ~5ms when all features from client service ○ ~10ms when joining pre-computed features from Cassandra Peak prediction volume across current online deployments ○ 600k+ QPS
  • 46. Problem ○ Prediction service can serve as many models as will fit into memory ○ Easy to run out of memory with large deployments of complex models Solution ○ Organize serving cluster into number of physical shards ○ Introduce client facing concept of ‘virtual shard’ that is specified at deploy time ○ Virtual shards are mapped by system to physical shards ○ Models are loaded by service instances in the correct physical shard(s) ○ Gateway service routes to correct physical shard based on request header Sharded Deployment
  • 47. Client Service Predict Service Predict Service Predict Service Predict Service Predict Service A B C D E F G H I Unsharded Deployment
  • 48. Client Service Gateway Routing Table Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 2) E F G H I Predict Service Predict Service Predict Service Predict Service Predict Service (Shard 1) A B C D Sharded Deployment
  • 50. Problem ○ Multiple models per container (entirely different or multiple versions of same) ○ Support experimentation ○ Support automated retrain / redeploy ○ Cumbersome to have client service manage routing Solution ○ Models deployed to 'label' ○ Labels can be used for experimentation or different use cases ○ Predict service routes request to most recent model w/ specified label ○ Labels have schema so deploys won't break Deployment Labels
  • 51. Key Components: Live Model Performance Monitoring
  • 52. Monitor Predictions Problem ○ Models trained and evaluated against historical data ○ Need to ensure deployed model is making good predictions going forward Solution ○ Log predictions & join to actual outcomes ○ Publish metrics feature and prediction distributions over time ○ Dashboards and alerts
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 61. What’s Next? ○ Python ML for ease of use and broader algorithm support ○ Notebook-centered model building workflow ○ Online / continuous learning ○ AutoML to automate more of the modelling work
  • 63. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.