Automatic image moderation in classifieds, Jarosław Szymczak

Pôle Systematic Paris-Region
Pôle Systematic Paris-RegionPôle Systematic Paris-Region
Automatic image moderation in
classifieds
By Jaroslaw Szymczak
PYDATA PARIS @ PYPARIS 2017
June 12, 2017
Agenda
● Image moderation problem
● Brief sketch of approach
● Machine learning foundations of the solution:
○ Image features
○ Listing features (and combination of both)
● Class imbalance problem:
○ proper training
○ proper testing
○ proper evaluation
● Going live with the product:
○ consistent development and production environments
○ batch model creation
○ live application
○ performance monitoring
Image moderation
problem
Scale of business at OLX
4.4
APP
RATING
#1 app
+22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categories
Note: excludes Letgo. Associates at proportionate share
→ People spend more than twice as long in
OLX apps versus competitors
became one of the top 3 classifieds app in US
less than a year after its launch
130 Countries
+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms;
77% of the total amount of cars manufactured!
+160,000 properties are listed daily
• 2 houses
• 2 cars
• 3 fashion items
• 2.5 mobile phones
At OLX, are listed every second:
✔ real photo of the phones
✔ selfie with a dress
✔ real shoes photo
✘ human on the picture (OLX India)
✘ stock photo (OLX Poland)
CALL 555-555-555
✘ contact details (all sites)
✘ NSFW (all sites)
Brief sketch of
approach
Binary image classification
Image features:
● CNN fine tuning
● transfer learning
● image represented as 1D vector
Classic features:
● category of listing
● is listing from business of a private
person
● what is the price?
All fed to
Why not more, e.g. title, description, user history?
Because of pragmatism, we don’t want to overcomplicate the model:
● CNN are state of the art for image recognition
● classical features help in improving accuracy, but having too many of them would
decrease significance of image features
Image features
Classic image features
And many others, more or less sophisticated methods of feature extraction...
Convolutional Neural Networks
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Fine tuning and transfer learning
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Inception network
Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/
Inception 21k
Trained on 21 841 classes on ImageNet set
Top-1 accuracy above 37%
Available for mxnet:
https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
VGG16 network
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
● used model from Keras
● easy to freeze arbitrary layers (layer.trainable = False )
Listing features
With eXtreme Gradient Boosting (XGBoost)
Feature preparation
After encoding the “classic features” they are concatenated with image ones
Adaptive Boosting
Gradient boosting?
● instead of weights update in each round you try to fit the weak learner to
residuals of pseudo-residuals
● similarly like in neural networks, shrinkage parameter is used when
updating the algorithm to compensate for loss function
eXtreme Gradient Boosting (XGBoost)
Source:
https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Class imbalance
problem
Class imbalance - proper training
● possibilities to deal with the problem:
○ undersampling majority class
○ oversampling minority class:
■ randomly
■ by creating artificial examples (SMOTE)
○ reweighting
● undersampling suits our needs the most
○ the general population of good images is not very much “hurt” by
undersampling
○ having training data size limitations we can train on more unique examples
of bad images
○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
Use real-life
ratio
Class imbalance - proper testing
Class imbalance - proper evaluation
● accuracy is useless measure in such case
● sensible measures are:
○ ROC AUC
○ PR AUC
○ Precision @ fixed Recall
○ Recall @ fixed Precision
● ROC AUC:
○ can be interpreted as concordance probability (i.e. random positive example has the probability
equal to AUC, that it’s score is higher)
○ it is though too abstract to use as a standalone quality metric
○ does not depend on classes ratio
● PR AUC
○ Depends on data balance
○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:
○ they heavily depend on data balance
○ they are the best to reflect the business requirements
○ and to take into account processing capabilities (then actually Precision @k is more accurate)
ROC AUC - inception-21k and vgg16
PR AUC - inception-21k
PR AUC - vgg16
Going live with the
product
Consistent
development and
production
environments
● ensure you have the drivers installed
nvidia-smi
● create docker image
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
...
ENV BUILD_OPTS "USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1"
RUN cd /home && git clone
https://github.com/dmlc/mxnet.git mxnet
--recursive --branch v0.10.0 --depth 1 
&& cd mxnet && make -j$(nproc) $BUILD_OPTS
...
RUN pip3 install tensorflow==1.1.0
RUN pip3 install tensorflow-gpu==1.1.0
RUN pip3 install keras==2.0
● use nvidia-docker-compose wrapper
Batch process
with use of Luigi framework
● re-usability of processing
● fully automated pipeline
● contenerized with Docker
Luigi Task
Luigi Dashboard
Luigi Task Visualizer
Luigi tips
● create your output at the very end of the task
● you can dynamically create dependencies by yielding the task
● adding workers parameter to your command parallelizes task that are ready to
be run (e.g. python run.py Task … --workers 15)
● for straightforward workflows inheritance comes handy:
class SimpleDependencyTask(luigi.Task):
def create_simple_dependency(self, predecessor_task_class,
additional_parameters_dict=None):
if additional_parameters_dict is None:
additional_parameters_dict = {}
result_dict = {k: v for k, v in self.__dict__.items() if
k in
predecessor_task_class.get_param_names()}
result_dict.update(additional_parameters_dict)
return predecessor_task_class(**result_dict)
ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code,
effective_current_date)
Live process
with use of Flask
● hosted in AWS
● horizontally scaled
● contenerized with Docker
Live service architecture
Performance
monitoring
Performance monitoring (with Grafana)
Acknowledgements
● Vaibhav Singh
● Jaydeep De
● Andrzej Prałat
By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017
June 12, 2017
1 of 37

Recommended

From Python to smartphones: neural nets @ Saint-Gobain, François Sausset by
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetPôle Systematic Paris-Region
404 views7 slides
Designing and coding for cloud-native applications using Python, Harjinder Mi... by
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Pôle Systematic Paris-Region
653 views33 slides
PyQt: rapid application development by
PyQt: rapid application developmentPyQt: rapid application development
PyQt: rapid application developmentDeveler S.r.l.
3.5K views14 slides
How to approach building GUIs using PyQT by
How to approach building GUIs using PyQTHow to approach building GUIs using PyQT
How to approach building GUIs using PyQTJerlyn Manohar
127 views13 slides
Exploring French Job Ads, Lynn Cherny by
Exploring French Job Ads, Lynn ChernyExploring French Job Ads, Lynn Cherny
Exploring French Job Ads, Lynn ChernyPôle Systematic Paris-Region
590 views58 slides
Writing native Linux desktop apps with JavaScript by
Writing native Linux desktop apps with JavaScriptWriting native Linux desktop apps with JavaScript
Writing native Linux desktop apps with JavaScriptIgalia
203 views48 slides

More Related Content

What's hot

Hidden Dragons of CGO by
Hidden Dragons of CGOHidden Dragons of CGO
Hidden Dragons of CGOAll Things Open
201 views40 slides
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview by
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewMachine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewModulabs
1.5K views53 slides
What every C++ programmer should know about modern compilers (w/ comments, AC... by
What every C++ programmer should know about modern compilers (w/ comments, AC...What every C++ programmer should know about modern compilers (w/ comments, AC...
What every C++ programmer should know about modern compilers (w/ comments, AC...Sławomir Zborowski
551 views37 slides
About OpenGL and Vulkan interoperability (XDC 2020) by
About OpenGL and Vulkan interoperability (XDC 2020)About OpenGL and Vulkan interoperability (XDC 2020)
About OpenGL and Vulkan interoperability (XDC 2020)Igalia
114 views19 slides
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020) by
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)Igalia
80 views28 slides
Task Parallel Library (TPL) by
Task Parallel Library (TPL)Task Parallel Library (TPL)
Task Parallel Library (TPL)Muhammad Zaid Sarfraz
1K views8 slides

What's hot(20)

Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview by Modulabs
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewMachine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Modulabs1.5K views
What every C++ programmer should know about modern compilers (w/ comments, AC... by Sławomir Zborowski
What every C++ programmer should know about modern compilers (w/ comments, AC...What every C++ programmer should know about modern compilers (w/ comments, AC...
What every C++ programmer should know about modern compilers (w/ comments, AC...
About OpenGL and Vulkan interoperability (XDC 2020) by Igalia
About OpenGL and Vulkan interoperability (XDC 2020)About OpenGL and Vulkan interoperability (XDC 2020)
About OpenGL and Vulkan interoperability (XDC 2020)
Igalia114 views
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020) by Igalia
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)
Overview of the open source Vulkan driver for Raspberry Pi 4 (XDC 2020)
Igalia80 views
Embedding Chromium into AGL demo platform with WAM by Igalia
Embedding Chromium into AGL demo platform with WAMEmbedding Chromium into AGL demo platform with WAM
Embedding Chromium into AGL demo platform with WAM
Igalia312 views
How volkswagen used microservices and automation to develop self service solu... by Marcos Entenza Garcia
How volkswagen used microservices and automation to develop self service solu...How volkswagen used microservices and automation to develop self service solu...
How volkswagen used microservices and automation to develop self service solu...
Meetup React Sanca - 29/11/18 - React Testing by Augusto Lazaro
Meetup React Sanca - 29/11/18 - React TestingMeetup React Sanca - 29/11/18 - React Testing
Meetup React Sanca - 29/11/18 - React Testing
Augusto Lazaro223 views
What linq is about by LeTesteur
What linq is aboutWhat linq is about
What linq is about
LeTesteur552 views
Building a Data Ingestion & Processing Pipeline with Spark & Airflow by Tom Lous
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous3K views
welcome to gopherlabs - why go (golang)? by sangam biradar
welcome to gopherlabs - why go (golang)?welcome to gopherlabs - why go (golang)?
welcome to gopherlabs - why go (golang)?
sangam biradar410 views
Swift for back end: A new generation of full stack languages? by Koombea
Swift for back end: A new generation of full stack languages?Swift for back end: A new generation of full stack languages?
Swift for back end: A new generation of full stack languages?
Koombea1.3K views
Powerlang: a Vehicle for Lively Implementing Programming Languages by ESUG
Powerlang: a Vehicle for Lively Implementing Programming LanguagesPowerlang: a Vehicle for Lively Implementing Programming Languages
Powerlang: a Vehicle for Lively Implementing Programming Languages
ESUG284 views
Growing up new PostgreSQL developers (pgcon.org 2018) by Aleksander Alekseev
Growing up new PostgreSQL developers (pgcon.org 2018)Growing up new PostgreSQL developers (pgcon.org 2018)
Growing up new PostgreSQL developers (pgcon.org 2018)
Aleksander Alekseev1.6K views
High Productivity Web Development Workflow by Vũ Nguyễn
High Productivity Web Development WorkflowHigh Productivity Web Development Workflow
High Productivity Web Development Workflow
Vũ Nguyễn7.2K views

Similar to Automatic image moderation in classifieds, Jarosław Szymczak

Project report by
Project reportProject report
Project reportAbhinavRawat47
41 views15 slides
Workshop About Software Engineering Skills 2019 by
Workshop About Software Engineering Skills 2019Workshop About Software Engineering Skills 2019
Workshop About Software Engineering Skills 2019PhuocNT (Fresher.VN)
657 views27 slides
Machine Learning to moderate ads in real world classified's business by
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessJaroslaw Szymczak
969 views35 slides
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ... by
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...Dataconomy Media
739 views50 slides
Deep Learning on AWS (November 2016) by
Deep Learning on AWS (November 2016)Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)Julien SIMON
1.3K views17 slides
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas... by
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
97 views75 slides

Similar to Automatic image moderation in classifieds, Jarosław Szymczak(20)

Machine Learning to moderate ads in real world classified's business by Jaroslaw Szymczak
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's business
Jaroslaw Szymczak969 views
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ... by Dataconomy Media
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
Dataconomy Media739 views
Deep Learning on AWS (November 2016) by Julien SIMON
Deep Learning on AWS (November 2016)Deep Learning on AWS (November 2016)
Deep Learning on AWS (November 2016)
Julien SIMON1.3K views
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas... by Embarcados
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Embarcados97 views
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides by DiUS
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesRise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary Slides
DiUS1.8K views
Android Overview by atomi
Android OverviewAndroid Overview
Android Overview
atomi6K views
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual... by Mikhail Fominykh
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...
Sticky Notes - a tool for supporting collaborative activities in a 3D virtual...
Mikhail Fominykh1.2K views
mloc.js 2014 - JavaScript and the browser as a platform for game development by David Galeano
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
David Galeano1.2K views
Developing Spatial Applications with CARTO for React v1.1 by CARTO
Developing Spatial Applications with CARTO for React v1.1Developing Spatial Applications with CARTO for React v1.1
Developing Spatial Applications with CARTO for React v1.1
CARTO309 views
Devoxx : being productive with JHipster by Julien Dubois
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
Julien Dubois14.8K views
Pitfalls of machine learning in production by Antoine Sauray
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
Antoine Sauray299 views
Kubernetes Deployments: A "Hands-off" Approach by Rodrigo Reis
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" Approach
Rodrigo Reis1.7K views
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam... by Codemotion
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion263 views
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018 by Nacho Foche Pérez
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018
QuestMark Framework for Dhis2 Android Apps - Dhis2 symposium 2018
Nacho Foche Pérez242 views
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene... by Andrey Sadovykh
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
Andrey Sadovykh428 views
Secure software supply chain on a shoestring budget by Lars Albertsson
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson268 views

More from Pôle Systematic Paris-Region

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na... by
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...Pôle Systematic Paris-Region
686 views39 slides
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ... by
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...Pôle Systematic Paris-Region
293 views24 slides
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ... by
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...Pôle Systematic Paris-Region
349 views38 slides
OSIS19_Cloud : Performance and power management in virtualized data centers, ... by
OSIS19_Cloud : Performance and power management in virtualized data centers, ...OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...Pôle Systematic Paris-Region
288 views27 slides
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ... by
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...Pôle Systematic Paris-Region
271 views30 slides
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt... by
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...Pôle Systematic Paris-Region
229 views9 slides

More from Pôle Systematic Paris-Region(20)

Recently uploaded

Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
210 views15 slides
The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
80 views65 slides
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueShapeBlue
179 views7 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
103 views23 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
80 views38 slides
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
222 views23 slides

Recently uploaded(20)

Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue210 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li80 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue179 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue103 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue222 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue94 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software385 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue123 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue154 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson156 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue132 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views

Automatic image moderation in classifieds, Jarosław Szymczak

  • 1. Automatic image moderation in classifieds By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017
  • 2. Agenda ● Image moderation problem ● Brief sketch of approach ● Machine learning foundations of the solution: ○ Image features ○ Listing features (and combination of both) ● Class imbalance problem: ○ proper training ○ proper testing ○ proper evaluation ● Going live with the product: ○ consistent development and production environments ○ batch model creation ○ live application ○ performance monitoring
  • 4. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at proportionate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily • 2 houses • 2 cars • 3 fashion items • 2.5 mobile phones At OLX, are listed every second:
  • 5. ✔ real photo of the phones ✔ selfie with a dress ✔ real shoes photo ✘ human on the picture (OLX India) ✘ stock photo (OLX Poland) CALL 555-555-555 ✘ contact details (all sites) ✘ NSFW (all sites)
  • 7. Binary image classification Image features: ● CNN fine tuning ● transfer learning ● image represented as 1D vector Classic features: ● category of listing ● is listing from business of a private person ● what is the price? All fed to Why not more, e.g. title, description, user history? Because of pragmatism, we don’t want to overcomplicate the model: ● CNN are state of the art for image recognition ● classical features help in improving accuracy, but having too many of them would decrease significance of image features
  • 9. Classic image features And many others, more or less sophisticated methods of feature extraction...
  • 10. Convolutional Neural Networks Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  • 11. Fine tuning and transfer learning Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  • 12. Inception network Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/ Inception 21k Trained on 21 841 classes on ImageNet set Top-1 accuracy above 37% Available for mxnet: https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
  • 13. VGG16 network Source: https://www.cs.toronto.edu/~frossard/post/vgg16/ ● used model from Keras ● easy to freeze arbitrary layers (layer.trainable = False )
  • 14. Listing features With eXtreme Gradient Boosting (XGBoost)
  • 15. Feature preparation After encoding the “classic features” they are concatenated with image ones
  • 17. Gradient boosting? ● instead of weights update in each round you try to fit the weak learner to residuals of pseudo-residuals ● similarly like in neural networks, shrinkage parameter is used when updating the algorithm to compensate for loss function
  • 18. eXtreme Gradient Boosting (XGBoost) Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  • 20. Class imbalance - proper training ● possibilities to deal with the problem: ○ undersampling majority class ○ oversampling minority class: ■ randomly ■ by creating artificial examples (SMOTE) ○ reweighting ● undersampling suits our needs the most ○ the general population of good images is not very much “hurt” by undersampling ○ having training data size limitations we can train on more unique examples of bad images ○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
  • 22. Class imbalance - proper evaluation ● accuracy is useless measure in such case ● sensible measures are: ○ ROC AUC ○ PR AUC ○ Precision @ fixed Recall ○ Recall @ fixed Precision ● ROC AUC: ○ can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that it’s score is higher) ○ it is though too abstract to use as a standalone quality metric ○ does not depend on classes ratio ● PR AUC ○ Depends on data balance ○ Is not intuitively interpretable ● Precision @ fixed Recall, Recall @ fixed Precision: ○ they heavily depend on data balance ○ they are the best to reflect the business requirements ○ and to take into account processing capabilities (then actually Precision @k is more accurate)
  • 23. ROC AUC - inception-21k and vgg16
  • 24. PR AUC - inception-21k
  • 25. PR AUC - vgg16
  • 26. Going live with the product
  • 27. Consistent development and production environments ● ensure you have the drivers installed nvidia-smi ● create docker image FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 ... ENV BUILD_OPTS "USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1" RUN cd /home && git clone https://github.com/dmlc/mxnet.git mxnet --recursive --branch v0.10.0 --depth 1 && cd mxnet && make -j$(nproc) $BUILD_OPTS ... RUN pip3 install tensorflow==1.1.0 RUN pip3 install tensorflow-gpu==1.1.0 RUN pip3 install keras==2.0 ● use nvidia-docker-compose wrapper
  • 28. Batch process with use of Luigi framework ● re-usability of processing ● fully automated pipeline ● contenerized with Docker
  • 32. Luigi tips ● create your output at the very end of the task ● you can dynamically create dependencies by yielding the task ● adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15) ● for straightforward workflows inheritance comes handy: class SimpleDependencyTask(luigi.Task): def create_simple_dependency(self, predecessor_task_class, additional_parameters_dict=None): if additional_parameters_dict is None: additional_parameters_dict = {} result_dict = {k: v for k, v in self.__dict__.items() if k in predecessor_task_class.get_param_names()} result_dict.update(additional_parameters_dict) return predecessor_task_class(**result_dict) ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code, effective_current_date)
  • 33. Live process with use of Flask ● hosted in AWS ● horizontally scaled ● contenerized with Docker
  • 37. Acknowledgements ● Vaibhav Singh ● Jaydeep De ● Andrzej Prałat By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017