Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic image moderation in classifieds, Jarosław Szymczak

250 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Automatic image moderation in classifieds, Jarosław Szymczak

  1. 1. Automatic image moderation in classifieds By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017
  2. 2. Agenda ● Image moderation problem ● Brief sketch of approach ● Machine learning foundations of the solution: ○ Image features ○ Listing features (and combination of both) ● Class imbalance problem: ○ proper training ○ proper testing ○ proper evaluation ● Going live with the product: ○ consistent development and production environments ○ batch model creation ○ live application ○ performance monitoring
  3. 3. Image moderation problem
  4. 4. Scale of business at OLX 4.4 APP RATING #1 app +22 COUNTRIES (1) 1) Google play store; shopping/lifestyle categories Note: excludes Letgo. Associates at proportionate share → People spend more than twice as long in OLX apps versus competitors became one of the top 3 classifieds app in US less than a year after its launch 130 Countries +60 million monthly listings +18 million monthly sellers +52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured! +160,000 properties are listed daily • 2 houses • 2 cars • 3 fashion items • 2.5 mobile phones At OLX, are listed every second:
  5. 5. ✔ real photo of the phones ✔ selfie with a dress ✔ real shoes photo ✘ human on the picture (OLX India) ✘ stock photo (OLX Poland) CALL 555-555-555 ✘ contact details (all sites) ✘ NSFW (all sites)
  6. 6. Brief sketch of approach
  7. 7. Binary image classification Image features: ● CNN fine tuning ● transfer learning ● image represented as 1D vector Classic features: ● category of listing ● is listing from business of a private person ● what is the price? All fed to Why not more, e.g. title, description, user history? Because of pragmatism, we don’t want to overcomplicate the model: ● CNN are state of the art for image recognition ● classical features help in improving accuracy, but having too many of them would decrease significance of image features
  8. 8. Image features
  9. 9. Classic image features And many others, more or less sophisticated methods of feature extraction...
  10. 10. Convolutional Neural Networks Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  11. 11. Fine tuning and transfer learning Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
  12. 12. Inception network Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/ Inception 21k Trained on 21 841 classes on ImageNet set Top-1 accuracy above 37% Available for mxnet: https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
  13. 13. VGG16 network Source: https://www.cs.toronto.edu/~frossard/post/vgg16/ ● used model from Keras ● easy to freeze arbitrary layers (layer.trainable = False )
  14. 14. Listing features With eXtreme Gradient Boosting (XGBoost)
  15. 15. Feature preparation After encoding the “classic features” they are concatenated with image ones
  16. 16. Adaptive Boosting
  17. 17. Gradient boosting? ● instead of weights update in each round you try to fit the weak learner to residuals of pseudo-residuals ● similarly like in neural networks, shrinkage parameter is used when updating the algorithm to compensate for loss function
  18. 18. eXtreme Gradient Boosting (XGBoost) Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  19. 19. Class imbalance problem
  20. 20. Class imbalance - proper training ● possibilities to deal with the problem: ○ undersampling majority class ○ oversampling minority class: ■ randomly ■ by creating artificial examples (SMOTE) ○ reweighting ● undersampling suits our needs the most ○ the general population of good images is not very much “hurt” by undersampling ○ having training data size limitations we can train on more unique examples of bad images ○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
  21. 21. Use real-life ratio Class imbalance - proper testing
  22. 22. Class imbalance - proper evaluation ● accuracy is useless measure in such case ● sensible measures are: ○ ROC AUC ○ PR AUC ○ Precision @ fixed Recall ○ Recall @ fixed Precision ● ROC AUC: ○ can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that it’s score is higher) ○ it is though too abstract to use as a standalone quality metric ○ does not depend on classes ratio ● PR AUC ○ Depends on data balance ○ Is not intuitively interpretable ● Precision @ fixed Recall, Recall @ fixed Precision: ○ they heavily depend on data balance ○ they are the best to reflect the business requirements ○ and to take into account processing capabilities (then actually Precision @k is more accurate)
  23. 23. ROC AUC - inception-21k and vgg16
  24. 24. PR AUC - inception-21k
  25. 25. PR AUC - vgg16
  26. 26. Going live with the product
  27. 27. Consistent development and production environments ● ensure you have the drivers installed nvidia-smi ● create docker image FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 ... ENV BUILD_OPTS "USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1" RUN cd /home && git clone https://github.com/dmlc/mxnet.git mxnet --recursive --branch v0.10.0 --depth 1 && cd mxnet && make -j$(nproc) $BUILD_OPTS ... RUN pip3 install tensorflow==1.1.0 RUN pip3 install tensorflow-gpu==1.1.0 RUN pip3 install keras==2.0 ● use nvidia-docker-compose wrapper
  28. 28. Batch process with use of Luigi framework ● re-usability of processing ● fully automated pipeline ● contenerized with Docker
  29. 29. Luigi Task
  30. 30. Luigi Dashboard
  31. 31. Luigi Task Visualizer
  32. 32. Luigi tips ● create your output at the very end of the task ● you can dynamically create dependencies by yielding the task ● adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15) ● for straightforward workflows inheritance comes handy: class SimpleDependencyTask(luigi.Task): def create_simple_dependency(self, predecessor_task_class, additional_parameters_dict=None): if additional_parameters_dict is None: additional_parameters_dict = {} result_dict = {k: v for k, v in self.__dict__.items() if k in predecessor_task_class.get_param_names()} result_dict.update(additional_parameters_dict) return predecessor_task_class(**result_dict) ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code, effective_current_date)
  33. 33. Live process with use of Flask ● hosted in AWS ● horizontally scaled ● contenerized with Docker
  34. 34. Live service architecture
  35. 35. Performance monitoring
  36. 36. Performance monitoring (with Grafana)
  37. 37. Acknowledgements ● Vaibhav Singh ● Jaydeep De ● Andrzej Prałat By Jaroslaw Szymczak PYDATA PARIS @ PYPARIS 2017 June 12, 2017

×