SlideShare a Scribd company logo
WSDM Cup 2017:
Vandalism Detection
Alexey Grigorev
21.01.2017
https://www.wikidata.org/wiki/Q2539
Pseudoscience ← Misinformation!
WSDM Cup 2017: Vandalism Detection
● http://www.wsdm-cup-2017.org/vandalism-detection.html
● Vandalism in KBs is very bad
● Goal of the competition:
● Predict if a Wikidata revision should be rolled back or not
● Solutions ranked by AUC
Not a usual competition
● No forum
● Code to put on VM
○ VM constraints: 1 core, 4 Gb RAM
● Evaluation
○ Read data from the socket
○ Transform the data, make a prediction
○ Write the prediction back to the socket
● Test data not available
● Results of test runs aren’t shown, only validation runs
● No leaderboard
● Only one attempt at the end - by default, last successful run on test
Data
● Wikidata dump with edits
○ From 2012 to 2016
○ 72 mln revisions
○ 24.8 Gb compressed
○ 400 Gb uncompressed
● Very skewed:
○ 127k rollbacks (out of 72m)
○ i.e. 0.0025 - fraction of positives
● For each revision know:
○ Title of page, timestamp
○ Comment and json data
○ Username or IP + geo information
○ Was reverted or not
<page>
<title>Q5704066</title>
<ns>0</ns>
<id>5468191</id>
<revision>
<id>185142906</id>
<parentid>185051367</parentid>
<timestamp>2015-01-01</timestamp>
<contributor>
<username>username</username>
<id>52267</id>
</contributor>
<comment>/* wbsetdescription-add:1|es */
futbolista irlandes</comment>
<model>wikibase-item</model>
<format>application/json</format>
<text xml:space="preserve">{{PAGE_JSON}}</text>
</revision>
</page>
IP if anonymous &
geo info in meta file
Cross-Validation
Train 2012-10-29 to 2016-02-29 65 mln
Validation 2016-03-01 to 2016-04-30 7.2 mln
Test from 2016-05-01 10.4 mln
Train 2015-01-01 to 2015-31-31 27.9 mln
Validation 2016-01-01 to 2016-02-29 9.3 mln
Test 2016-03-01 to 2016-04-30 7.2 mln
← Not available
← Available from start
← Available towards end
Features
3 groups:
● Page feature: page title
● User features
● Comment features
Not considered:
● Timestamps
● Counters
● JSON Content
Problematic due to 4Gb constraint
User Features
● Username, if logged in, if not “anonymous=True”
● IP: 90.219.230.105 → 90, 90_219, 90_219_230, 90_219_230_105
● Geo information from the meta file:
○ country_code=GB
○ continent_code=EU
○ time_zone=GMT
○ regio_code=EN
○ city_name=LEEDS
○ county_name=WEST_YORKSHIRE
● OHE: put this together as one string
○ anonymous 90 90_219 90_219_230 90_219_230_105 country_code=GB ...
○ Use CountVectorizer
Comment Features
● “Structured” part: inside /* */ - split on “:” and “|”:
○ wbsetdescription-add 1 es
○ wbsetsitelink-add 1 idwiki
○ clientsitelink-remove 1 frwiki
● Wiki-Links: [[Property:P31]], [[Q5]]
● Free text: outside of /* */:
○ futbolista irlandes
○ autolist2
○ origyn web browser
/* wbsetdescription-add:1|es */ futbolista irlandes
/* wbcreateclaim-create:1| */ [[Property:P31]]: [[Q5]], #autolist2
/* wbsetsitelink-add:1|idwiki */ Megaloharpya
/* clientsitelink-remove:1||frwiki */ Origyn Web Browser
Models
● LinearSVM in primal with L1 regularization
● Stacking: XGBoost and L2-LogReg
○ Both good on validation, but overfit testing
● Online Learning: worse than SVM on one year
● Upsampling and Undersampling: always worse than as-is
Final Model
● Put all features together as one string
● title=Q123 username=user wbsetdescription-add 1 es P31 Q5 …
● OHE?
○ CountVectorizer - dictionary too large
○ HashingVectorizer with 10m columns
Final model:
● OHE with HashingVectorizer (no memory)
● SVM on OHE matrix (300 mb)
● ~ 0.96 AUC on my test
Final Results
That’s me :-)
Conclusions
New things I learned/tried:
● Feather - very fast!
● Twisted
Lessons Learned:
● Trust your CV
● Sometimes ensembling does not work
● Prefer simpler models to avoid overfitting
Links and Further Info
● Competition website: http://www.wsdm-cup-2017.org/
● Competition platform: http://tira.io/
● My solution:
○ https://github.com/alexeygrigorev/wsdmcup17-vandalism-detection
Thank you
Questions?

More Related Content

What's hot

Functional Programming In JS
Functional Programming In JSFunctional Programming In JS
Functional Programming In JS
Damian Łabas
 
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Nima Mahmoudi
 
Unit test demo for calculatechinesenamenumber
Unit test demo for calculatechinesenamenumberUnit test demo for calculatechinesenamenumber
Unit test demo for calculatechinesenamenumberJuggernaut Liu
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Formalising Graph Pattern Matching Gremlin traversals in Graph AlegraFormalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Harsh Thakkar
 
Grails workshops
Grails workshopsGrails workshops
Grails workshops
Łukasz Tenerowicz
 
[Question Paper] Object Oriented Programming (Old Course) [April / 2014]
[Question Paper] Object Oriented Programming (Old Course) [April / 2014][Question Paper] Object Oriented Programming (Old Course) [April / 2014]
[Question Paper] Object Oriented Programming (Old Course) [April / 2014]
Mumbai B.Sc.IT Study
 
Go bei der 4Com GmbH & Co. KG
Go bei der 4Com GmbH & Co. KGGo bei der 4Com GmbH & Co. KG
Go bei der 4Com GmbH & Co. KG
Jonas Riedel
 
Demo the reactive jargons
Demo the reactive jargonsDemo the reactive jargons
Demo the reactive jargonsThoughtworks
 
Globe Infographics
Globe InfographicsGlobe Infographics
Globe Infographics
LINE Corporation
 
Green Custard Friday Talk 8: GraphQL
Green Custard Friday Talk 8: GraphQLGreen Custard Friday Talk 8: GraphQL
Green Custard Friday Talk 8: GraphQL
Green Custard
 
Implementation
ImplementationImplementation
Implementation
Syed Zaid Irshad
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Keisuke Hosaka
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking VN
 
D422 7-2 string hadeling
D422 7-2  string hadelingD422 7-2  string hadeling
D422 7-2 string hadeling
Omkar Rane
 

What's hot (18)

Functional Programming In JS
Functional Programming In JSFunctional Programming In JS
Functional Programming In JS
 
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6
 
Unit test demo for calculatechinesenamenumber
Unit test demo for calculatechinesenamenumberUnit test demo for calculatechinesenamenumber
Unit test demo for calculatechinesenamenumber
 
Toy Model Overview
Toy Model OverviewToy Model Overview
Toy Model Overview
 
Case Study
Case Study Case Study
Case Study
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Formalising Graph Pattern Matching Gremlin traversals in Graph AlegraFormalising Graph Pattern Matching Gremlin traversals in Graph Alegra
Formalising Graph Pattern Matching Gremlin traversals in Graph Alegra
 
Grails workshops
Grails workshopsGrails workshops
Grails workshops
 
Build web server
Build web serverBuild web server
Build web server
 
[Question Paper] Object Oriented Programming (Old Course) [April / 2014]
[Question Paper] Object Oriented Programming (Old Course) [April / 2014][Question Paper] Object Oriented Programming (Old Course) [April / 2014]
[Question Paper] Object Oriented Programming (Old Course) [April / 2014]
 
Go bei der 4Com GmbH & Co. KG
Go bei der 4Com GmbH & Co. KGGo bei der 4Com GmbH & Co. KG
Go bei der 4Com GmbH & Co. KG
 
Demo the reactive jargons
Demo the reactive jargonsDemo the reactive jargons
Demo the reactive jargons
 
Globe Infographics
Globe InfographicsGlobe Infographics
Globe Infographics
 
Green Custard Friday Talk 8: GraphQL
Green Custard Friday Talk 8: GraphQLGreen Custard Friday Talk 8: GraphQL
Green Custard Friday Talk 8: GraphQL
 
Implementation
ImplementationImplementation
Implementation
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
D422 7-2 string hadeling
D422 7-2  string hadelingD422 7-2  string hadeling
D422 7-2 string hadeling
 

Similar to WSDM Cup 2017: Vandalism Detection

Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
Vinoth Chandar
 
Sprint 53
Sprint 53Sprint 53
Sprint 53
ManageIQ
 
Tools and libraries for common problems (Early Draft)
Tools and libraries for common problems (Early Draft)Tools and libraries for common problems (Early Draft)
Tools and libraries for common problems (Early Draft)
rc2209
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
Marcelo Altmann
 
Web Assembly
Web AssemblyWeb Assembly
Web Assembly
Valerio Como
 
AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8
Phil Eaton
 
Tips & tricks for Firebase storage and Firebase functions
Tips & tricks for Firebase storage and Firebase functionsTips & tricks for Firebase storage and Firebase functions
Tips & tricks for Firebase storage and Firebase functions
GameCamp
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
PingCAP
 
Java 9 - Part1: New Features (Not Jigsaw Modules)
Java 9 - Part1: New Features (Not Jigsaw Modules)Java 9 - Part1: New Features (Not Jigsaw Modules)
Java 9 - Part1: New Features (Not Jigsaw Modules)
Simone Bordet
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
Safe Software
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
Linaro
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
NerdWalletHQ
 
Tips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudTips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloud
Severalnines
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
Lieven Lemiengre
 
Google Cloud Platform Special Training
Google Cloud Platform Special TrainingGoogle Cloud Platform Special Training
Google Cloud Platform Special Training
Simon Su
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
Lars Albertsson
 
Parallel programing in web applications - public.pptx
Parallel programing in web applications - public.pptxParallel programing in web applications - public.pptx
Parallel programing in web applications - public.pptx
Guy Bary
 

Similar to WSDM Cup 2017: Vandalism Detection (20)

Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Sprint 53
Sprint 53Sprint 53
Sprint 53
 
Tools and libraries for common problems (Early Draft)
Tools and libraries for common problems (Early Draft)Tools and libraries for common problems (Early Draft)
Tools and libraries for common problems (Early Draft)
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
Web Assembly
Web AssemblyWeb Assembly
Web Assembly
 
AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8
 
Tips & tricks for Firebase storage and Firebase functions
Tips & tricks for Firebase storage and Firebase functionsTips & tricks for Firebase storage and Firebase functions
Tips & tricks for Firebase storage and Firebase functions
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Java 9 - Part1: New Features (Not Jigsaw Modules)
Java 9 - Part1: New Features (Not Jigsaw Modules)Java 9 - Part1: New Features (Not Jigsaw Modules)
Java 9 - Part1: New Features (Not Jigsaw Modules)
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
 
Tips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudTips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloud
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
 
Google Cloud Platform Special Training
Google Cloud Platform Special TrainingGoogle Cloud Platform Special Training
Google Cloud Platform Special Training
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
nebulaconf
nebulaconfnebulaconf
nebulaconf
 
Parallel programing in web applications - public.pptx
Parallel programing in web applications - public.pptxParallel programing in web applications - public.pptx
Parallel programing in web applications - public.pptx
 

More from Alexey Grigorev

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Alexey Grigorev
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Alexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
Alexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
Alexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Alexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
Alexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Alexey Grigorev
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
Alexey Grigorev
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
Alexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Alexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
Alexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
Alexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
Alexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
Alexey Grigorev
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
Alexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
Alexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Recently uploaded

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 

Recently uploaded (20)

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 

WSDM Cup 2017: Vandalism Detection

  • 1. WSDM Cup 2017: Vandalism Detection Alexey Grigorev 21.01.2017
  • 3. WSDM Cup 2017: Vandalism Detection ● http://www.wsdm-cup-2017.org/vandalism-detection.html ● Vandalism in KBs is very bad ● Goal of the competition: ● Predict if a Wikidata revision should be rolled back or not ● Solutions ranked by AUC
  • 4. Not a usual competition ● No forum ● Code to put on VM ○ VM constraints: 1 core, 4 Gb RAM ● Evaluation ○ Read data from the socket ○ Transform the data, make a prediction ○ Write the prediction back to the socket ● Test data not available ● Results of test runs aren’t shown, only validation runs ● No leaderboard ● Only one attempt at the end - by default, last successful run on test
  • 5.
  • 6. Data ● Wikidata dump with edits ○ From 2012 to 2016 ○ 72 mln revisions ○ 24.8 Gb compressed ○ 400 Gb uncompressed ● Very skewed: ○ 127k rollbacks (out of 72m) ○ i.e. 0.0025 - fraction of positives ● For each revision know: ○ Title of page, timestamp ○ Comment and json data ○ Username or IP + geo information ○ Was reverted or not <page> <title>Q5704066</title> <ns>0</ns> <id>5468191</id> <revision> <id>185142906</id> <parentid>185051367</parentid> <timestamp>2015-01-01</timestamp> <contributor> <username>username</username> <id>52267</id> </contributor> <comment>/* wbsetdescription-add:1|es */ futbolista irlandes</comment> <model>wikibase-item</model> <format>application/json</format> <text xml:space="preserve">{{PAGE_JSON}}</text> </revision> </page> IP if anonymous & geo info in meta file
  • 7. Cross-Validation Train 2012-10-29 to 2016-02-29 65 mln Validation 2016-03-01 to 2016-04-30 7.2 mln Test from 2016-05-01 10.4 mln Train 2015-01-01 to 2015-31-31 27.9 mln Validation 2016-01-01 to 2016-02-29 9.3 mln Test 2016-03-01 to 2016-04-30 7.2 mln ← Not available ← Available from start ← Available towards end
  • 8. Features 3 groups: ● Page feature: page title ● User features ● Comment features Not considered: ● Timestamps ● Counters ● JSON Content Problematic due to 4Gb constraint
  • 9. User Features ● Username, if logged in, if not “anonymous=True” ● IP: 90.219.230.105 → 90, 90_219, 90_219_230, 90_219_230_105 ● Geo information from the meta file: ○ country_code=GB ○ continent_code=EU ○ time_zone=GMT ○ regio_code=EN ○ city_name=LEEDS ○ county_name=WEST_YORKSHIRE ● OHE: put this together as one string ○ anonymous 90 90_219 90_219_230 90_219_230_105 country_code=GB ... ○ Use CountVectorizer
  • 10. Comment Features ● “Structured” part: inside /* */ - split on “:” and “|”: ○ wbsetdescription-add 1 es ○ wbsetsitelink-add 1 idwiki ○ clientsitelink-remove 1 frwiki ● Wiki-Links: [[Property:P31]], [[Q5]] ● Free text: outside of /* */: ○ futbolista irlandes ○ autolist2 ○ origyn web browser /* wbsetdescription-add:1|es */ futbolista irlandes /* wbcreateclaim-create:1| */ [[Property:P31]]: [[Q5]], #autolist2 /* wbsetsitelink-add:1|idwiki */ Megaloharpya /* clientsitelink-remove:1||frwiki */ Origyn Web Browser
  • 11. Models ● LinearSVM in primal with L1 regularization ● Stacking: XGBoost and L2-LogReg ○ Both good on validation, but overfit testing ● Online Learning: worse than SVM on one year ● Upsampling and Undersampling: always worse than as-is
  • 12. Final Model ● Put all features together as one string ● title=Q123 username=user wbsetdescription-add 1 es P31 Q5 … ● OHE? ○ CountVectorizer - dictionary too large ○ HashingVectorizer with 10m columns Final model: ● OHE with HashingVectorizer (no memory) ● SVM on OHE matrix (300 mb) ● ~ 0.96 AUC on my test
  • 14. Conclusions New things I learned/tried: ● Feather - very fast! ● Twisted Lessons Learned: ● Trust your CV ● Sometimes ensembling does not work ● Prefer simpler models to avoid overfitting
  • 15.
  • 16. Links and Further Info ● Competition website: http://www.wsdm-cup-2017.org/ ● Competition platform: http://tira.io/ ● My solution: ○ https://github.com/alexeygrigorev/wsdmcup17-vandalism-detection