SlideShare a Scribd company logo
1 of 18
Download to read offline
Alexey Grigorev
Team ololobhi (Abhishek & ololo)
Data set
● ~3 mln train pairs, ~1 mln test pairs
● ~10.8 mln images (~45 gb) Target
Evaluation
metric: AUC
Title
Category_ID
Pictures
Price
Description
locationID
attrsJSON
No seller
data
Process Overview
mongo
- text comparisons
- image comparisons
- text cleaning
- image raw features
CSVs with pair features
(than train the models on it)
ItemInfo train/test
images
Preprocessing Computing features
features computed
in small batches
(4k-20k rows in batch)
Features 1
● Simple Features
○ CategoryID (plain, no OHE)
○ Number of images
○ Absolute price difference
● Simple Text Features
○ Num of Rus/Eng/Digits chars
○ Length of Title, Description
○ 2-4 ngram similarity on char level
○ Fuzzy string matches (via FuzzyWuzzy)
● Simple Picture Features
○ Channel statistics (min, mean, max, etc)
○ File size differences
○ Geometry matches
○ Num of exact matches via md5 hash
● Simple GEO Features
○ MetroID
○ LocationID
○ Euclidean distance
|10 - 1| |5 - 1| |15 - 1|
|10 - 50| |5 - 50| |15 - 50|
9 4 14 40 45 35
reshape
stats
1
50
10 5 15
Features 2: Attributes
● Regularized Jaccard of values and key=value pairs
● Number of fields both ads didn’t fill
● TF-IDF on key=value pairs
○ dot product in TF-IDF space (norm=None) was better than cosine
● Cosine in SVD of TF-IDF
Features 3: Text
● Jaccard & Cosine on digits only and on English tokens only
● Russian chars in English words
○ E.g. “о” in “iphоne” is Cyrillic
● Cosine in TF, TF-IDF, BM25, cosine of SVD of them
● Common tokens & differences:
○ Text1: “продам iphone”, Text2: “продам айфон”
○ Common: {продам}, Difference: {iphone, айфон}
○ Cosine in TF (binary), SVD of it
● Word2Vec & GloVe
○ Cosine and manhattan b/w average title vectors
○ Stats of pairwise cosines between all tokens excluding the same ones
○ Tokens from title, description, title + description, nouns only
Features 3 cont’d: Word’s Mover Distance
● “True” WMD is complex and slow
● “Poor Man’s” WMD is faster:
○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them
○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A)
figure from https://github.com/mkusner/wmd
Features 3 cont’d: Misspellings
● Idea: same author can make same types of mistakes
○ No space after dot/comma (“продам айфон.дешево”)
○ Morphological errors
○ And others
● Represent ads as “Bag of Misspellings”
● Use Regularized Jaccard and Cosine
● Misspellings extracted with languagetool.org
● Stuff that everybody used
○ Image hashes from imagehash library and forums
○ Chi2 & Bhattacharyya on histograms (with openIMAJ)
○ SIFT keypoints + matching (with openIMAJ)
○ Structural Similarity (computed with pyssim)
● Perceptive hashes computed with imagemagick
○ hashes computed on each channel separately and on the mean channel
● Image moments: Centroids (“Ellipses” in imagemagick)
○ Centroids = centers of masses of each channel
○ Distances between image centroids in each channel
● Image moment invariants (imagemagick)
○ 7 moments, invariant to translation, scale and rotation
○ Put all 7 invariants in a vector, compute cosine and distance
Features 4: Images
Features 5: GEO
● Reverse (lat, lon) code to location
● Features like same_region, same_city, same_zip
● |zip1 - zip2|
(53.87, 27.66) (region, city, zip)
OpenStreetMap
Feature Selection
● Correlation
○ A lot of features. Many correlated ones
○ Find feature groups of 0.90 correlation
○ Keep only one of the features
● XGBoost Feature Importance
○ https://github.com/Far0n/xgbfi
○ Run xgb on a sample with 100 trees
○ Use xgbfi to extract most important features
● Combined
○ In a correlated group, choose the most important feature using xgbfi output
Most Important Features
SVM fit on common & diff tokens
Models & Ensembling
● Parameter tuning
○ Random search
● My best model: 0.939 public LB
○ XGB with depth=8 and 2.5k trees
○ Trained a few days
● Ensembling:
○ Sample group of features
○ Randomly choose the parameters
○ Build ETs and XGBs
○ Stack with Log Reg (L2 regularization with low C)
● Our final model:
○ Neural network on ETs and XGBs outputs + some selected 1st level features
Lessons Learned
● It’s important to get CV right
○ My scheme: shuffle 3 fold (leaky)
○ Couldn’t use some nice features because of it
○ CV score of the ensemble was too good
○ Result: CV via LB
○ The right one: by connected components
● A lot of features is not always good
○ Computed too many features
○ Had hard time managing to use them all
○ Had to start stacking early
That’s a graph!
https://github.com/alexeygrigorev/avito-duplicates-kaggle
Questions?

More Related Content

What's hot

IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unityJung Suk Ko
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)찬희 이
 
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기Madumpa Park
 
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019devCAT Studio, NEXON
 
[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shadingMinGeun Park
 
나만의 엔진 개발하기
나만의 엔진 개발하기나만의 엔진 개발하기
나만의 엔진 개발하기YEONG-CHEON YOU
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
 
[0122 구경원]게임에서의 충돌처리
[0122 구경원]게임에서의 충돌처리[0122 구경원]게임에서의 충돌처리
[0122 구경원]게임에서의 충돌처리KyeongWon Koo
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The SurgeMichele Giacalone
 
PBR 기초 이론과 사용되는 맵들 Vol.3
PBR 기초 이론과 사용되는 맵들 Vol.3PBR 기초 이론과 사용되는 맵들 Vol.3
PBR 기초 이론과 사용되는 맵들 Vol.3Jooyoung Yi
 
물리 기반 셰이더의 이해
물리 기반 셰이더의 이해물리 기반 셰이더의 이해
물리 기반 셰이더의 이해tartist
 
매쉬 베이크 마스터리
매쉬 베이크 마스터리매쉬 베이크 마스터리
매쉬 베이크 마스터리Jooyoung Yi
 
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018devCAT Studio, NEXON
 
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑[0107 박민근] 쉽게 배우는 hdr과 톤맵핑
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑MinGeun Park
 
모바일 게임 최적화
모바일 게임 최적화 모바일 게임 최적화
모바일 게임 최적화 tartist
 
Ndc12 이창희 render_pipeline
Ndc12 이창희 render_pipelineNdc12 이창희 render_pipeline
Ndc12 이창희 render_pipelinechangehee lee
 
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기강 민우
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemGuerrilla
 
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019UnityTechnologiesJapan002
 

What's hot (20)

IndirectDraw with unity
IndirectDraw with unityIndirectDraw with unity
IndirectDraw with unity
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)
 
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
[NDC19] 모바일에서 사용가능한 유니티 커스텀 섭스턴스 PBR 셰이더 만들기
 
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
 
[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading
 
나만의 엔진 개발하기
나만의 엔진 개발하기나만의 엔진 개발하기
나만의 엔진 개발하기
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
[0122 구경원]게임에서의 충돌처리
[0122 구경원]게임에서의 충돌처리[0122 구경원]게임에서의 충돌처리
[0122 구경원]게임에서의 충돌처리
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
PBR 기초 이론과 사용되는 맵들 Vol.3
PBR 기초 이론과 사용되는 맵들 Vol.3PBR 기초 이론과 사용되는 맵들 Vol.3
PBR 기초 이론과 사용되는 맵들 Vol.3
 
물리 기반 셰이더의 이해
물리 기반 셰이더의 이해물리 기반 셰이더의 이해
물리 기반 셰이더의 이해
 
매쉬 베이크 마스터리
매쉬 베이크 마스터리매쉬 베이크 마스터리
매쉬 베이크 마스터리
 
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018
심예람, <프로젝트DH> AI 내비게이션 시스템, NDC2018
 
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑[0107 박민근] 쉽게 배우는 hdr과 톤맵핑
[0107 박민근] 쉽게 배우는 hdr과 톤맵핑
 
모바일 게임 최적화
모바일 게임 최적화 모바일 게임 최적화
모바일 게임 최적화
 
Ndc12 이창희 render_pipeline
Ndc12 이창희 render_pipelineNdc12 이창희 render_pipeline
Ndc12 이창희 render_pipeline
 
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기
[IGC 2016] 넷게임즈 김영희 - Unreal4를 사용해 모바일 프로젝트 제작하기
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
【Unite Tokyo 2019】Unityプログレッシブライトマッパー2019
 

Viewers also liked

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click PredictionAlexey Grigorev
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionAlexey Grigorev
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForOutspoken Media
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingAlexey Grigorev
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016MLconf
 

Viewers also liked (7)

Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
 
WSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism DetectionWSDM Cup 2017: Vandalism Detection
WSDM Cup 2017: Vandalism Detection
 
Spot Sigs
Spot SigsSpot Sigs
Spot Sigs
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 

Similar to Avito Duplicate Ads Detection @ kaggle

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxUniversity of York
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015Jorg Janke
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTatu Saloranta
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdfdsfajkh
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts SummaryMohamed Magdy
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on KaggleNasuka Sumino
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicInstitute of Contemporary Sciences
 
Why go ?
Why go ?Why go ?
Why go ?Mailjet
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOMatteo Grella
 

Similar to Avito Duplicate Ads Detection @ kaggle (20)

Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
A DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntaxA DSTL to bridge concrete and abstract syntax
A DSTL to bridge concrete and abstract syntax
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
 
Tour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processorTour de Jackson: Forgotten Features of Jackson JSON processor
Tour de Jackson: Forgotten Features of Jackson JSON processor
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and Golang
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle51st solution of Avito demand prediction competition on Kaggle
51st solution of Avito demand prediction competition on Kaggle
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes DeumicMachine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
 
Golang
GolangGolang
Golang
 
Why go ?
Why go ?Why go ?
Why go ?
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Druid
DruidDruid
Druid
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 

More from Alexey Grigorev

Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour KaressliAlexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAlexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesAlexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data ScienceAlexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningAlexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningAlexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentAlexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationAlexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationAlexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursAlexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAlexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectAlexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Avito Duplicate Ads Detection @ kaggle

  • 1. Alexey Grigorev Team ololobhi (Abhishek & ololo)
  • 2. Data set ● ~3 mln train pairs, ~1 mln test pairs ● ~10.8 mln images (~45 gb) Target Evaluation metric: AUC
  • 4. Process Overview mongo - text comparisons - image comparisons - text cleaning - image raw features CSVs with pair features (than train the models on it) ItemInfo train/test images Preprocessing Computing features features computed in small batches (4k-20k rows in batch)
  • 5. Features 1 ● Simple Features ○ CategoryID (plain, no OHE) ○ Number of images ○ Absolute price difference ● Simple Text Features ○ Num of Rus/Eng/Digits chars ○ Length of Title, Description ○ 2-4 ngram similarity on char level ○ Fuzzy string matches (via FuzzyWuzzy) ● Simple Picture Features ○ Channel statistics (min, mean, max, etc) ○ File size differences ○ Geometry matches ○ Num of exact matches via md5 hash ● Simple GEO Features ○ MetroID ○ LocationID ○ Euclidean distance
  • 6. |10 - 1| |5 - 1| |15 - 1| |10 - 50| |5 - 50| |15 - 50| 9 4 14 40 45 35 reshape stats 1 50 10 5 15
  • 7. Features 2: Attributes ● Regularized Jaccard of values and key=value pairs ● Number of fields both ads didn’t fill ● TF-IDF on key=value pairs ○ dot product in TF-IDF space (norm=None) was better than cosine ● Cosine in SVD of TF-IDF
  • 8. Features 3: Text ● Jaccard & Cosine on digits only and on English tokens only ● Russian chars in English words ○ E.g. “о” in “iphоne” is Cyrillic ● Cosine in TF, TF-IDF, BM25, cosine of SVD of them ● Common tokens & differences: ○ Text1: “продам iphone”, Text2: “продам айфон” ○ Common: {продам}, Difference: {iphone, айфон} ○ Cosine in TF (binary), SVD of it ● Word2Vec & GloVe ○ Cosine and manhattan b/w average title vectors ○ Stats of pairwise cosines between all tokens excluding the same ones ○ Tokens from title, description, title + description, nouns only
  • 9. Features 3 cont’d: Word’s Mover Distance ● “True” WMD is complex and slow ● “Poor Man’s” WMD is faster: ○ WMD(A, B): For each term in doc A take distance to closest term in doc B, sum over them ○ WMD_sym(A, B) = WMD(A, B) + WMD(B, A) figure from https://github.com/mkusner/wmd
  • 10. Features 3 cont’d: Misspellings ● Idea: same author can make same types of mistakes ○ No space after dot/comma (“продам айфон.дешево”) ○ Morphological errors ○ And others ● Represent ads as “Bag of Misspellings” ● Use Regularized Jaccard and Cosine ● Misspellings extracted with languagetool.org
  • 11. ● Stuff that everybody used ○ Image hashes from imagehash library and forums ○ Chi2 & Bhattacharyya on histograms (with openIMAJ) ○ SIFT keypoints + matching (with openIMAJ) ○ Structural Similarity (computed with pyssim) ● Perceptive hashes computed with imagemagick ○ hashes computed on each channel separately and on the mean channel ● Image moments: Centroids (“Ellipses” in imagemagick) ○ Centroids = centers of masses of each channel ○ Distances between image centroids in each channel ● Image moment invariants (imagemagick) ○ 7 moments, invariant to translation, scale and rotation ○ Put all 7 invariants in a vector, compute cosine and distance Features 4: Images
  • 12. Features 5: GEO ● Reverse (lat, lon) code to location ● Features like same_region, same_city, same_zip ● |zip1 - zip2| (53.87, 27.66) (region, city, zip) OpenStreetMap
  • 13. Feature Selection ● Correlation ○ A lot of features. Many correlated ones ○ Find feature groups of 0.90 correlation ○ Keep only one of the features ● XGBoost Feature Importance ○ https://github.com/Far0n/xgbfi ○ Run xgb on a sample with 100 trees ○ Use xgbfi to extract most important features ● Combined ○ In a correlated group, choose the most important feature using xgbfi output
  • 14. Most Important Features SVM fit on common & diff tokens
  • 15. Models & Ensembling ● Parameter tuning ○ Random search ● My best model: 0.939 public LB ○ XGB with depth=8 and 2.5k trees ○ Trained a few days ● Ensembling: ○ Sample group of features ○ Randomly choose the parameters ○ Build ETs and XGBs ○ Stack with Log Reg (L2 regularization with low C) ● Our final model: ○ Neural network on ETs and XGBs outputs + some selected 1st level features
  • 16. Lessons Learned ● It’s important to get CV right ○ My scheme: shuffle 3 fold (leaky) ○ Couldn’t use some nice features because of it ○ CV score of the ensemble was too good ○ Result: CV via LB ○ The right one: by connected components ● A lot of features is not always good ○ Computed too many features ○ Had hard time managing to use them all ○ Had to start stacking early That’s a graph!