SlideShare a Scribd company logo
Alexey Grigorev
Team ololobhi (Abhishek & ololo)
Data set
● ~3 mln train pairs, ~1 mln test pairs
● ~10.8 mln images (~45 gb) Target
Evaluation
metric: AUC
Title
Category_ID
Pictures
Price
Description
locationID
attrsJSON
No seller
data
Process Overview
mongo
- text comparisons
- image comparisons
- text cleaning
- image raw features
CSVs with pair features
(than train the models on it)
ItemInfo train/test
images
Preprocessing Computing features
features computed
in small batches
(4k-20k rows in batch)
Features 1
● Simple Features
○ CategoryID (plain, no OHE)
○ Number of images
○ Absolute price difference
● Simple Text Features
○ Num of Rus/Eng/Digits chars
○ Length of Title, Description
○ 2-4 ngram similarity on char level
○ Fuzzy string matches (via FuzzyWuzzy)
● Simple Picture Features
○ Channel statistics (min, mean, max, etc)
○ File size differences
○ Geometry matches
○ Num of exact matches via md5 hash
● Simple GEO Features
○ MetroID
○ LocationID
○ Euclidean distance
|10 - 1| |5 - 1| |15 - 1|
|10 - 50| |5 - 50| |15 - 50|
9 4 14 40 45 35
reshape
stats
1
50
10 5 15
Features 2: Attributes
● Regularized Jaccard of values and key=value pairs
● Number of fields both ads didn’t fill
● TF-IDF on key=value pairs
○ dot product in TF-IDF space (norm=None) was better than cosine
● Cosine in SVD of TF-IDF
Features 3: Text
● Jaccard & Cosine on digits only and on English tokens only
● Russian chars in English words
○ E.g. “о” in “iphоne” is Cyrillic
● Cosine in TF, TF-IDF, BM25, cosine of SVD of them
● Common tokens & differences:
○ Text1: “продам iphone”, Text2: “продам айфон”
○ Common: {продам}, Difference: {iphone, айфон}
○ Cosine in TF (binary), SVD of it
● Word2Vec & GloVe
○ Cosine and manhattan b/w average title vectors
○ Stats of pairwise cosines between all tokens excluding the same ones
○ Tokens from title, description, title + description, nouns only
Features 3 cont’d: Word’s Mover Distance
● “True” WMD is complex and slow
● “Poor Man’s” WMD is faster:
○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them
○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A)
figure from https://github.com/mkusner/wmd
Features 3 cont’d: Misspellings
● Idea: same author can make same types of mistakes
○ No space after dot/comma (“продам айфон.дешево”)
○ Morphological errors
○ And others
● Represent ads as “Bag of Misspellings”
● Use Regularized Jaccard and Cosine
● Misspellings extracted with languagetool.org
● Stuff that everybody used
○ Image hashes from imagehash library and forums
○ Chi2 & Bhattacharyya on histograms (with openIMAJ)
○ SIFT keypoints + matching (with openIMAJ)
○ Structural Similarity (computed with pyssim)
● Perceptive hashes computed with imagemagick
○ hashes computed on each channel separately and on the mean channel
● Image moments: Centroids (“Ellipses” in imagemagick)
○ Centroids = centers of masses of each channel
○ Distances between image centroids in each channel
● Image moment invariants (imagemagick)
○ 7 moments, invariant to translation, scale and rotation
○ Put all 7 invariants in a vector, compute cosine and distance
Features 4: Images
Features 5: GEO
● Reverse (lat, lon) code to location
● Features like same_region, same_city, same_zip
● |zip1 - zip2|
(53.87, 27.66) (region, city, zip)
OpenStreetMap
Feature Selection
● Correlation
○ A lot of features. Many correlated ones
○ Find feature groups of 0.90 correlation
○ Keep only one of the features
● XGBoost Feature Importance
○ https://github.com/Far0n/xgbfi
○ Run xgb on a sample with 100 trees
○ Use xgbfi to extract most important features
● Combined
○ In a correlated group, choose the most important feature using xgbfi output
Most Important Features
SVM fit on common & diff tokens
Models & Ensembling
● Parameter tuning
○ Random search
● My best model: 0.939 public LB
○ XGB with depth=8 and 2.5k trees
○ Trained a few days
● Ensembling:
○ Sample group of features
○ Randomly choose the parameters
○ Build ETs and XGBs
○ Stack with Log Reg (L2 regularization with low C)
● Our final model:
○ Neural network on ETs and XGBs outputs + some selected 1st level features
Lessons Learned
● It’s important to get CV right
○ My scheme: shuffle 3 fold (leaky)
○ Couldn’t use some nice features because of it
○ CV score of the ensemble was too good
○ Result: CV via LB
○ The right one: by connected components
● A lot of features is not always good
○ Computed too many features
○ Had hard time managing to use them all
○ Had to start stacking early
That’s a graph!
https://github.com/alexeygrigorev/avito-duplicates-kaggle
Questions?

More Related Content

What's hot

Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
Jon Lederman
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
Lstm
LstmLstm
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
Christian Robert
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Knoldus Inc.
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Sangwoo Mo
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Jie-Han Chen
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
Real-time Object Detection with YOLO v5, Hands-on-Lab
Real-time Object Detection with YOLO v5, Hands-on-LabReal-time Object Detection with YOLO v5, Hands-on-Lab
Real-time Object Detection with YOLO v5, Hands-on-Lab
JongHyunKim78
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
EuroIoTa
 
An introduction on normalizing flows
An introduction on normalizing flowsAn introduction on normalizing flows
An introduction on normalizing flows
Grigoris C
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Amazon Web Services
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Simplilearn
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
Donghyeon Kim
 
머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian ProcessesJungkyu Lee
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
Bushra Jbawi
 
Optuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization FrameworkOptuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization Framework
Preferred Networks
 

What's hot (20)

Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Lstm
LstmLstm
Lstm
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Real-time Object Detection with YOLO v5, Hands-on-Lab
Real-time Object Detection with YOLO v5, Hands-on-LabReal-time Object Detection with YOLO v5, Hands-on-Lab
Real-time Object Detection with YOLO v5, Hands-on-Lab
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
 
An introduction on normalizing flows
An introduction on normalizing flowsAn introduction on normalizing flows
An introduction on normalizing flows
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
 
머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Optuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization FrameworkOptuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization Framework
 

Viewers also liked

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
Alexey Grigorev
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism Detection
Alexey Grigorev
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
infoblog
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
Outspoken Media
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
Vipin Kp
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
Alexey Grigorev
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 

Viewers also liked (7)

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism Detection
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 

Similar to Avito Duplicate Ads Detection @ kaggle

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
Databricks
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
Linaro
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntax
University of York
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
ArangoDB Database
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
Jorg Janke
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
Joshua Shinavier
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tatu Saloranta
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
dsfajkh
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle
Nasuka Sumino
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Institute of Contemporary Sciences
 
Golang
GolangGolang
Golang
Felipe Mamud
 
Why go ?
Why go ?Why go ?
Why go ?
Mailjet
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
Druid
DruidDruid
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
lemonlemon20
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
Matteo Grella
 

Similar to Avito Duplicate Ads Detection @ kaggle (20)

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntax
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processor
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and Golang
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
 
Golang
GolangGolang
Golang
 
Why go ?
Why go ?Why go ?
Why go ?
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Druid
DruidDruid
Druid
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 

More from Alexey Grigorev

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Alexey Grigorev
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Alexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
Alexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
Alexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Alexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
Alexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Alexey Grigorev
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
Alexey Grigorev
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
Alexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Alexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
Alexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
Alexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
Alexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
Alexey Grigorev
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
Alexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
Alexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Recently uploaded

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 

Recently uploaded (20)

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 

Avito Duplicate Ads Detection @ kaggle

  • 1. Alexey Grigorev Team ololobhi (Abhishek & ololo)
  • 2. Data set ● ~3 mln train pairs, ~1 mln test pairs ● ~10.8 mln images (~45 gb) Target Evaluation metric: AUC
  • 4. Process Overview mongo - text comparisons - image comparisons - text cleaning - image raw features CSVs with pair features (than train the models on it) ItemInfo train/test images Preprocessing Computing features features computed in small batches (4k-20k rows in batch)
  • 5. Features 1 ● Simple Features ○ CategoryID (plain, no OHE) ○ Number of images ○ Absolute price difference ● Simple Text Features ○ Num of Rus/Eng/Digits chars ○ Length of Title, Description ○ 2-4 ngram similarity on char level ○ Fuzzy string matches (via FuzzyWuzzy) ● Simple Picture Features ○ Channel statistics (min, mean, max, etc) ○ File size differences ○ Geometry matches ○ Num of exact matches via md5 hash ● Simple GEO Features ○ MetroID ○ LocationID ○ Euclidean distance
  • 6. |10 - 1| |5 - 1| |15 - 1| |10 - 50| |5 - 50| |15 - 50| 9 4 14 40 45 35 reshape stats 1 50 10 5 15
  • 7. Features 2: Attributes ● Regularized Jaccard of values and key=value pairs ● Number of fields both ads didn’t fill ● TF-IDF on key=value pairs ○ dot product in TF-IDF space (norm=None) was better than cosine ● Cosine in SVD of TF-IDF
  • 8. Features 3: Text ● Jaccard & Cosine on digits only and on English tokens only ● Russian chars in English words ○ E.g. “о” in “iphоne” is Cyrillic ● Cosine in TF, TF-IDF, BM25, cosine of SVD of them ● Common tokens & differences: ○ Text1: “продам iphone”, Text2: “продам айфон” ○ Common: {продам}, Difference: {iphone, айфон} ○ Cosine in TF (binary), SVD of it ● Word2Vec & GloVe ○ Cosine and manhattan b/w average title vectors ○ Stats of pairwise cosines between all tokens excluding the same ones ○ Tokens from title, description, title + description, nouns only
  • 9. Features 3 cont’d: Word’s Mover Distance ● “True” WMD is complex and slow ● “Poor Man’s” WMD is faster: ○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them ○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A) figure from https://github.com/mkusner/wmd
  • 10. Features 3 cont’d: Misspellings ● Idea: same author can make same types of mistakes ○ No space after dot/comma (“продам айфон.дешево”) ○ Morphological errors ○ And others ● Represent ads as “Bag of Misspellings” ● Use Regularized Jaccard and Cosine ● Misspellings extracted with languagetool.org
  • 11. ● Stuff that everybody used ○ Image hashes from imagehash library and forums ○ Chi2 & Bhattacharyya on histograms (with openIMAJ) ○ SIFT keypoints + matching (with openIMAJ) ○ Structural Similarity (computed with pyssim) ● Perceptive hashes computed with imagemagick ○ hashes computed on each channel separately and on the mean channel ● Image moments: Centroids (“Ellipses” in imagemagick) ○ Centroids = centers of masses of each channel ○ Distances between image centroids in each channel ● Image moment invariants (imagemagick) ○ 7 moments, invariant to translation, scale and rotation ○ Put all 7 invariants in a vector, compute cosine and distance Features 4: Images
  • 12. Features 5: GEO ● Reverse (lat, lon) code to location ● Features like same_region, same_city, same_zip ● |zip1 - zip2| (53.87, 27.66) (region, city, zip) OpenStreetMap
  • 13. Feature Selection ● Correlation ○ A lot of features. Many correlated ones ○ Find feature groups of 0.90 correlation ○ Keep only one of the features ● XGBoost Feature Importance ○ https://github.com/Far0n/xgbfi ○ Run xgb on a sample with 100 trees ○ Use xgbfi to extract most important features ● Combined ○ In a correlated group, choose the most important feature using xgbfi output
  • 14. Most Important Features SVM fit on common & diff tokens
  • 15. Models & Ensembling ● Parameter tuning ○ Random search ● My best model: 0.939 public LB ○ XGB with depth=8 and 2.5k trees ○ Trained a few days ● Ensembling: ○ Sample group of features ○ Randomly choose the parameters ○ Build ETs and XGBs ○ Stack with Log Reg (L2 regularization with low C) ● Our final model: ○ Neural network on ETs and XGBs outputs + some selected 1st level features
  • 16. Lessons Learned ● It’s important to get CV right ○ My scheme: shuffle 3 fold (leaky) ○ Couldn’t use some nice features because of it ○ CV score of the ensemble was too good ○ Result: CV via LB ○ The right one: by connected components ● A lot of features is not always good ○ Computed too many features ○ Had hard time managing to use them all ○ Had to start stacking early That’s a graph!