Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning to
moderate Classifieds
Vaibhav Singh, Machine Learning Scientist
Content Moderation & Quality, OLX
Agenda
➔ Scale and Problem
➔ Feature generation
➔ Model Generation Pipeline
➔ Model Performance
➔ Architecture
➔ Model Val...
Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1)	Google	play	store;	shopping/lifestyle	categories	
Note...
●  Change title, description in a paid category so that they don’t need
to buy another ad post.
●  Duplicate Ads to get hi...
“Feature engineering is the process of transforming raw
data into features that better represent the underlying
problem to...
Data Leakage
➔  Remove obvious fields
eg: id, account numbers
➔  Remove variance and
standardize
➔  Cross Validation
➔  Ad...
Feature hashing
➔  Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔  Memory efficient
➔  ...
SVM Light Data Format
➔  Memory Efficient.
Features can be created
on one machine and
does not requires huge
clusters
➔  C...
Lessons Learnt
➔  Choose your tech dependent on
data size. Do not go for hype
driven development
➔  Spend time on Feature
...
Model Generation Pipeline
Lessons Learnt
➔  Automate and makes
things deterministic
➔  Airflow, Luigi and many
others are good choice
for Job depend...
Measuring Classifier Performance
➔  Accuracy not always the best metric
➔  PR good for measuring classifier performance
➔ ...
Architecture
Flask
API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
L...
Lessons Learnt
➔  Always Batch
Batching will reduce CPU Utilization and
the same machines would be able to
handle much mor...
Validating Models
➔  Sample predictions and
manually verify
➔  Measure error rate
➔  Modify thresholds to
achieve desired ...
Model Management
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh
PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

Download to read offline

PyParis 2017
http://pyparis.org

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

  1. 1. Machine Learning to moderate Classifieds Vaibhav Singh, Machine Learning Scientist Content Moderation & Quality, OLX
  2. 2. Agenda ➔ Scale and Problem ➔ Feature generation ➔ Model Generation Pipeline ➔ Model Performance ➔ Architecture ➔ Model Validation and Management
  3. 3. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at propor>onate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily •  2 houses •  2 cars •  3 fashion items •  2.5 mobile phones At OLX, are listed every second:
  4. 4. ●  Change title, description in a paid category so that they don’t need to buy another ad post. ●  Duplicate Ads to get higher ranking and also to get higher chances for selling ●  Add Phone numbers, Company information on image rather than in description ●  Create multiple accounts to bypass free ad per user limit ●  Try to sell forbidden items with a title and description that may evade keyword filters Problem with User Posted Ads
  5. 5. “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data” Feature Engineering
  6. 6. Data Leakage ➔  Remove obvious fields eg: id, account numbers ➔  Remove variance and standardize ➔  Cross Validation ➔  Add Noise
  7. 7. Feature hashing ➔  Good when dealing high dimensional, sparse features -- dimensionality reduction ➔  Memory efficient ➔  Cons - Getting back to feature names is difficult ➔  Cons - Hash collisions can have negative effects
  8. 8. SVM Light Data Format ➔  Memory Efficient. Features can be created on one machine and does not requires huge clusters ➔  Cons - Number of features is unknown
  9. 9. Lessons Learnt ➔  Choose your tech dependent on data size. Do not go for hype driven development ➔  Spend time on Feature Generation and selection ➔  Increase relevance and minimize redundancy ➔  Use the same Feature Generation pipeline for both training and prediction
  10. 10. Model Generation Pipeline
  11. 11. Lessons Learnt ➔  Automate and makes things deterministic ➔  Airflow, Luigi and many others are good choice for Job dependency management
  12. 12. Measuring Classifier Performance ➔  Accuracy not always the best metric ➔  PR good for measuring classifier performance ➔  Can use ROC for general classifier performance ➔  Choose one evaluation metric
  13. 13. Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  14. 14. Lessons Learnt ➔  Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests ➔  Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations ➔  Monitoring Use a monitoring service ➔  Choose simple and easy tech
  15. 15. Validating Models ➔  Sample predictions and manually verify ➔  Measure error rate ➔  Modify thresholds to achieve desired error rate
  16. 16. Model Management

PyParis 2017 http://pyparis.org

Views

Total views

297

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×