Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

135 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PyParis2017 / Machine learning to moderate classifieds, by Vaibhav Singh

  1. 1. Machine Learning to moderate Classifieds Vaibhav Singh, Machine Learning Scientist Content Moderation & Quality, OLX
  2. 2. Agenda ➔ Scale and Problem ➔ Feature generation ➔ Model Generation Pipeline ➔ Model Performance ➔ Architecture ➔ Model Validation and Management
  3. 3. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at propor>onate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily •  2 houses •  2 cars •  3 fashion items •  2.5 mobile phones At OLX, are listed every second:
  4. 4. ●  Change title, description in a paid category so that they don’t need to buy another ad post. ●  Duplicate Ads to get higher ranking and also to get higher chances for selling ●  Add Phone numbers, Company information on image rather than in description ●  Create multiple accounts to bypass free ad per user limit ●  Try to sell forbidden items with a title and description that may evade keyword filters Problem with User Posted Ads
  5. 5. “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data” Feature Engineering
  6. 6. Data Leakage ➔  Remove obvious fields eg: id, account numbers ➔  Remove variance and standardize ➔  Cross Validation ➔  Add Noise
  7. 7. Feature hashing ➔  Good when dealing high dimensional, sparse features -- dimensionality reduction ➔  Memory efficient ➔  Cons - Getting back to feature names is difficult ➔  Cons - Hash collisions can have negative effects
  8. 8. SVM Light Data Format ➔  Memory Efficient. Features can be created on one machine and does not requires huge clusters ➔  Cons - Number of features is unknown
  9. 9. Lessons Learnt ➔  Choose your tech dependent on data size. Do not go for hype driven development ➔  Spend time on Feature Generation and selection ➔  Increase relevance and minimize redundancy ➔  Use the same Feature Generation pipeline for both training and prediction
  10. 10. Model Generation Pipeline
  11. 11. Lessons Learnt ➔  Automate and makes things deterministic ➔  Airflow, Luigi and many others are good choice for Job dependency management
  12. 12. Measuring Classifier Performance ➔  Accuracy not always the best metric ➔  PR good for measuring classifier performance ➔  Can use ROC for general classifier performance ➔  Choose one evaluation metric
  13. 13. Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  14. 14. Lessons Learnt ➔  Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests ➔  Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations ➔  Monitoring Use a monitoring service ➔  Choose simple and easy tech
  15. 15. Validating Models ➔  Sample predictions and manually verify ➔  Measure error rate ➔  Modify thresholds to achieve desired error rate
  16. 16. Model Management

×