This document summarizes Ahmed Kamal's presentation on scaling machine learning at Careem. Careem uses machine learning to enhance customer and driver experiences, ensure platform integrity through fraud detection and anomaly detection, and improve demand and supply forecasting. To scale ML usage across Careem's operations, the company built a machine learning platform to automate the end-to-end ML workflow from problem formulation to model deployment. The platform reduces costs, speeds up development and increased the impact of ML compared to doubling their data science team size. Automating tasks like dataset generation, model training and serving APIs allows more models to be deployed and non-experts to benefit from ML.
2. Who?
Ahmed Kamal
- Tech Lead @ Machine Learning Platform, Careem,
- Computer Engineer by training
I blog @ ahmedkamal.me
Find me on twitter @_akamal_
Presenting the work of my team and other awesome colleagues @ Careem
6. To simplify and improve the lives of people…
...and build an awesome organisation that inspires
7. Sneak Peek into ML @ Careem
● Enhances customers and captains Experience
○ ETAs & Accurate Prices
○ Cancellations & Captain Acceptance
8. Sneak Peek into ML @ Careem
● Enhances customers and captains Experience
○ ETAs & Accurate Prices
○ Cancellations & Captain Acceptance
● Platform Integrity
○ Fraud Prevention and Detection
○ Anomaly Detection
9. Sneak Peek into ML @ Careem
● Enhances customers and captains Experience
○ ETAs & Accurate Prices
○ Cancellations & Captain Acceptance
● Platform Integrity
○ Fraud Prevention and Detection
○ Anomaly Detection
● Ensure efficiency of our two-sided marketplace
○ Demand and Supply forecasting
○ Smart Dispatching & Peak
10. Building an AI
ecosystem
10
ML Infra
Scalable
Machine
Learning
Platforms
Data
Warehouses
Well governed,
trustworthy
and
documented
data
Big Data
capabilities
Easy and
reliable access
to large volumes
of data
Scalable AI Ecosystem
Know How
AI aware
colleagues
11. Expectation: (%)
Reality: (%)
Formulate the problem
Data selection and feature
engineering
ML model development
Model deployment,
integration and monitoring
ML Workflow - Challenges of ML at scale
13. ML Development Challenges
13
Lots amount of data to
process
Prepare Data
Training is very costly
and takes long time.
Reproducibility
Train a model
Development
environment mismatch
with production
environment.
Transfer to Prod
Models needs to be in
production
- Low Latency &
High Throughput
- Monitoring &
Alerting
- Fault Tolerance &
Auto scaling
Deploy
16. Post Deployment Challenges
Performance Monitoring and alertingContinuously refresh and updated
deployed models
A/B Testing between new/old or
new/new models
17. Post Deployment Challenges
Which cities are ready for ML ?
Performance Monitoring and alertingContinuously refresh and updated
deployed models
A/B Testing between new/old or
new/new models
18. Additional 100 models ?
Post Deployment Challenges
Which cities are ready for ML ?
Performance Monitoring and alertingContinuously refresh and updated
deployed models
A/B Testing between new/old or
new/new models
19. Additional 100 models ?
Post Deployment Challenges
Which cities are ready for ML ?
Performance Monitoring and alerting
Too many APIs ? Integration headache
Continuously refresh and updated
deployed models
A/B Testing between new/old or
new/new models
21. Formulate the
problem
Select data and
feature engineering
Deploy model to
production
Train and test
models
Monitor and improve
ML lifecycle
Tackling ML Life Cycle
25. Batch Serving
- Batch predictions generated offline for a target
dataset.
- Store predictions in different data-stores.
26. Realtime Serving
- Config Based Modular Serving Framework
- Inject Custom Feature Engineering Logic
- Access to external data & A/B testing support
- Prediction and performance logging
-
27. - Configuration Based Deployment service.
- Latency, integration and API tests
- Auto-Rollout Capabilities to smoothen model updates experience
One Click Deployment
Configs Production Level API
One Click !
28. A Glued Dynamic System
Data Generation Model Training Model Serving
29. URL => eta-service.careem.com/v1/101/eta
Request =>
[{"uuid": "9fdsaf9as9da9sd9", "assignment_time": "2019-03-14 14:09:46", "captain_lat": 30.0039,
"captain_long": 31.1422, "booking_lat": 30.0022, "booking_long": 31.1405}]
Response =>
{
"response": [{
"uuid": "9fdsaf9as9da9sd9",
"prediction": 2.5
}
],
"result": "ok"
}
Now you have an API
30. Impact
Infra
Cost
Reduced time needed by
a DS from days to
minutes per model.
Reduced time needed by
a DE from 2 weeks to 12
minutes per use-case
One-Click Serving API Deployment
Training pipelines and Auto-rollout
Dataset Generation
More impact
than doubling
the size of our
DS team
Serverless Job Training
Up to 90% saving on
training cost
We are able to have more
models on production and
have much higher
compounded impact
Model Reports (Visualizations +
Metrics over time)
Saving hours from DS
time
Reduced analysis and
evaluation time from
hours to minutes.
Productivity
33. From Scaling Infra to Scaling Usage
AI For Everyone
Auto Machine Learning
Custom AI Powered
Toolings
Generic Time Series
Forecasting
Supply
Forecasting
Demand
Forecasting
Campaign
Management
System
Customer
Care
34. - Enables experimenting with ML in short time.
- Lower the barrier for using ML for lots of people
AutoML
Auto Feature Selection/Engineering
Auto Hyperparameter Tuning
Auto Model Selection
Auto Model Ensembling
Auto Rollout
Rich Feature Store
35. - ML development is a cross functional work.
- Heavy investments in automation is the key for scaling ML.
- Design with different user segments in mind.
Learnings from the journey