SlideShare a Scribd company logo
Machine Learning to moderate
ads in real world classified's
business
by Vaibhav Singh & Jaroslaw Szymczak
Agenda
● Moderation problem
● Offline model creation
○ feature generation
○ feature selection
○ data leakage
○ the algorithm
● Model evaluation
● Going live with the product
○ is your data really big?
○ automatic model creation pipeline
○ consistent development and production environments
○ platform architecture
○ performance monitoring
50+
countries
60+ million
new monthly listings
18+ million
unique monthly sellers
What do moderators look for?
Avoidance of payment
Sell another item in paid
listing by changing its
content
Flood site with duplicate
posts to increase
visibility
Create multiple accounts
to bypass free ad per
user limit
Violation of ToS
Add Phone numbers,
Company information on
image rather than in
description or dedicated
fields
Try to sell forbidden
items, very often with
title and description that
try to evade keyword
filters
Miscategorized listings
Item is placed in wrong
category
Item is coming from
legitimate business, but
is marked as coming
from individual
‘Seek’ problem in job
offers
Offline model creation
Feature
engineering...
… and selection
Feature selection:
● necessary for some
algorithms, for others -
not so much
● most important features
● avoiding leakage
Feature generation - one-hot-encoding
Feature generation - feature hashing
Feature hashing
➔ Good when dealing high
dimensional, sparse features --
dimensionality reduction
➔ Memory efficient
➔ Cons - Getting back to feature
names is difficult
➔ Cons - Hash collisions can have
negative effects
Data Leakage
➔ Remove obvious fields
e.g.: id, account numbers
➔ Check the importance of
the features for any
unusual observations
➔ Have hold-out set that you
do not process wrt. target
variable
➔ Closely monitor live
performance
The algorithm
Desired features:
● state-of-the-art structured
binary problems
● allowing reducing variance
errors (overfitting)
● allowing reducing bias errors
(underfitting)
● has efficient implementation
eXtreme Gradient Boosting (XGBoost)
Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Model evaluation
Beyond accuracy
● ROC AUC (Receiver-Operator Curve):
○ can be interpreted as concordance probability (i.e. random positive example has the
probability equal to AUC, that it’s score is higher)
○ it is too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PRC AUC (Precision-Recall Curve)
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ can be found using thresholding
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities
(then actually Precision @k is more accurate)
● choose one, and only one as your KPI and others as
constraints
Example ROC for moderation problem
Precision-recall curve example
Precision @recall
Recall @precision
Going live with the product
Is your data
really big?
SVM Light
Data Format
➔ Memory Efficient.
Features can be created
on one machine and do
not require huge clusters
➔ Cons - Number of
features is unknown,
store it separately
1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1
1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1
0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1
1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1
0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1
1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
Lessons Learnt
➔ Do not go for distributed learning if you
don’t need to
➔ Choose your tech dependent on data size.
Do not go for hype driven development
➔ Your machine does not limit, there’s cloud
➔ Ask yourself: What’s the most difficult
problem to scale ? → People
Model Generation Pipeline
Automatic
model creation
pipeline
● Automation makes things
deterministic
● Airflow, Luigi and many others
are good choice for Job
dependency management
Luigi Dashboard
Luigi Task Visualizer
Lessons Learnt
➔ when you use the output path on your own,
create your output at the very end of the
task
➔ you can dynamically create dependencies
by yielding the task
➔ adding workers parameter to your
command parallelizes task that are ready
to be run (e.g. python run.py Task …
--workers 15)
Consistent development
and production
environments
Model Serving Architecture
Flask API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
Image Model Serving Architecture
AWS Kinensis
Stream
Incoming
Pictures
Hash Generation
Country Specific Image
Moderation
General Moderation NSFW
Tag and Category
Prediction
Mongo
OLX Site
S3
Models
GPU Clusters
Learning Cluster
TF, Keras, MxNet
Performance monitoring
Model monitoring and management
Lessons Learnt
➔ Always Batch
Batching will reduce CPU Utilization and the same machines
would be able to handle much more requests
➔ Modularize, Dockerize and Orchestrate
Containerize your code so that it is transparent to Machine
configurations
➔ Monitoring
Use a monitoring service
➔ Choose simple and easy tech
Acknowledgements
● Andrzej Prałat
● Wojciech Rybicki
Vaibhav Singh
vaibhav.singh@olx.com
Jaroslaw Szymczak
jaroslaw.szymczak@olx.com
PYDATA BERLIN 2017
July 2nd
, 2017

More Related Content

Similar to Machine Learning to moderate ads in real world classified's business

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
Matouš Havlena
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Data Science Club
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
Michael Young
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
Strangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven ApproachStrangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven Approach
VMware Tanzu
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
Justin Basilico
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
sundharakumarkb1
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning Audit
Andrew Clark
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Aaron Saray
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Automatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław SzymczakAutomatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław Szymczak
Pôle Systematic Paris-Region
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
Andrew Clark
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
Songtao Guo
 
Introduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptxIntroduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptx
Rohit Radhakrishnan
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User Group
Mike Ruangutai
 
Using the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced ModelingUsing the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced Modeling
OutSystems
 

Similar to Machine Learning to moderate ads in real world classified's business (20)

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, ExponeaLive predictions with schemaless data at scale. MLMU Kosice, Exponea
Live predictions with schemaless data at scale. MLMU Kosice, Exponea
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Strangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven ApproachStrangle The Monolith: A Data Driven Approach
Strangle The Monolith: A Data Driven Approach
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning Audit
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
 
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
Transforming B2B Sales with Spark-Powered Sales Intelligence with Songtao Guo...
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceSpark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
 
Automatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław SzymczakAutomatic image moderation in classifieds, Jarosław Szymczak
Automatic image moderation in classifieds, Jarosław Szymczak
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
 
Transforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales IntelligenceTransforming B2B Sales with Spark Powered Sales Intelligence
Transforming B2B Sales with Spark Powered Sales Intelligence
 
Introduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptxIntroduction-To-RPA_1.pptx
Introduction-To-RPA_1.pptx
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User Group
 
Using the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced ModelingUsing the Business Process Technology Workflow Engine for Advanced Modeling
Using the Business Process Technology Workflow Engine for Advanced Modeling
 

Recently uploaded

Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

Machine Learning to moderate ads in real world classified's business

  • 1. Machine Learning to moderate ads in real world classified's business by Vaibhav Singh & Jaroslaw Szymczak
  • 2. Agenda ● Moderation problem ● Offline model creation ○ feature generation ○ feature selection ○ data leakage ○ the algorithm ● Model evaluation ● Going live with the product ○ is your data really big? ○ automatic model creation pipeline ○ consistent development and production environments ○ platform architecture ○ performance monitoring
  • 3. 50+ countries 60+ million new monthly listings 18+ million unique monthly sellers
  • 4. What do moderators look for? Avoidance of payment Sell another item in paid listing by changing its content Flood site with duplicate posts to increase visibility Create multiple accounts to bypass free ad per user limit Violation of ToS Add Phone numbers, Company information on image rather than in description or dedicated fields Try to sell forbidden items, very often with title and description that try to evade keyword filters Miscategorized listings Item is placed in wrong category Item is coming from legitimate business, but is marked as coming from individual ‘Seek’ problem in job offers
  • 6. Feature engineering... … and selection Feature selection: ● necessary for some algorithms, for others - not so much ● most important features ● avoiding leakage
  • 7. Feature generation - one-hot-encoding
  • 8. Feature generation - feature hashing
  • 9. Feature hashing ➔ Good when dealing high dimensional, sparse features -- dimensionality reduction ➔ Memory efficient ➔ Cons - Getting back to feature names is difficult ➔ Cons - Hash collisions can have negative effects
  • 10. Data Leakage ➔ Remove obvious fields e.g.: id, account numbers ➔ Check the importance of the features for any unusual observations ➔ Have hold-out set that you do not process wrt. target variable ➔ Closely monitor live performance
  • 11. The algorithm Desired features: ● state-of-the-art structured binary problems ● allowing reducing variance errors (overfitting) ● allowing reducing bias errors (underfitting) ● has efficient implementation
  • 12. eXtreme Gradient Boosting (XGBoost) Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  • 14.
  • 15. Beyond accuracy ● ROC AUC (Receiver-Operator Curve): ○ can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that it’s score is higher) ○ it is too abstract to use as a standalone quality metric ○ does not depend on classes ratio ● PRC AUC (Precision-Recall Curve) ○ Depends on data balance ○ Is not intuitively interpretable ● Precision @ fixed Recall, Recall @ fixed Precision: ○ can be found using thresholding ○ they heavily depend on data balance ○ they are the best to reflect the business requirements ○ and to take into account processing capabilities (then actually Precision @k is more accurate) ● choose one, and only one as your KPI and others as constraints
  • 16. Example ROC for moderation problem
  • 20. Going live with the product
  • 22. SVM Light Data Format ➔ Memory Efficient. Features can be created on one machine and do not require huge clusters ➔ Cons - Number of features is unknown, store it separately 1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1 1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1 0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1 1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1 0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1 1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
  • 23. Lessons Learnt ➔ Do not go for distributed learning if you don’t need to ➔ Choose your tech dependent on data size. Do not go for hype driven development ➔ Your machine does not limit, there’s cloud ➔ Ask yourself: What’s the most difficult problem to scale ? → People
  • 25. Automatic model creation pipeline ● Automation makes things deterministic ● Airflow, Luigi and many others are good choice for Job dependency management
  • 28. Lessons Learnt ➔ when you use the output path on your own, create your output at the very end of the task ➔ you can dynamically create dependencies by yielding the task ➔ adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15)
  • 30. Model Serving Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  • 31. Image Model Serving Architecture AWS Kinensis Stream Incoming Pictures Hash Generation Country Specific Image Moderation General Moderation NSFW Tag and Category Prediction Mongo OLX Site S3 Models GPU Clusters Learning Cluster TF, Keras, MxNet
  • 33. Model monitoring and management
  • 34. Lessons Learnt ➔ Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests ➔ Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations ➔ Monitoring Use a monitoring service ➔ Choose simple and easy tech
  • 35. Acknowledgements ● Andrzej Prałat ● Wojciech Rybicki Vaibhav Singh vaibhav.singh@olx.com Jaroslaw Szymczak jaroslaw.szymczak@olx.com PYDATA BERLIN 2017 July 2nd , 2017