SlideShare a Scribd company logo
1 of 12
UVA DSI Capstone Project
Wikimedia Foundation, Trust & Safety
Cyber Harassment Classification and Prediction of User Blocks in Wikipedia
Students - Charu Rawat , Arnab Sarkar, Sameer Singh
Faculty Advisor – Rafael Alvarado
University of Virginia Master’s in Data Science Program
Clients – Patrick Earley , Sydney Poore
Advisor – Lane Rasberry
Trust and Safety Team, Wikimedia Foundation
1
The English Wikipedia over the last few years…
• Popularity: Alexa Rank
#5
• Total Edits: 868,717,569
• All Pages: 46,612,179
• Articles 5,767,464
• Registered Users:
35,185,308
• Administrators: ~1,193
https://commons.wikimedia.org/wiki/File:Wikidata_Items_Map_2014_%E2%80%93_2017_gif.gif
2
41% have experienced
personal attacks
66% observed attacks
directed towards others
47% reported a decrease in
contribution and engagement levels
3
Online Harassment at Wikipedia
Pew Research Centre Survey (America centric)
Human Driven
Process
Failure to identify
behavior
Bias to overlook
behavior
Inability to
understand context
and scope of
Our goal was to use machine learning to develop a model that can detect abusive content as well as predict
problematic users in the community.
To that end, we leveraged a variety of data to analyse the prevalence and nature of online harassment at scale.
There is a need for a more robust process to combat
harassment, one that can scale well as theWikipedia
community continues to grow in its size and diversity.
Problem
• ~1.02 million unique
users have been blocked
Wikipedia till Oct 2018
• 91.5% registered users
• 8.5% are anonymous
users
• Increase in user bocks in
2017, 2018
5
Block User Trends in Wikipedia
6
For what reasons are users getting blocked over the years?
User Comment Text
Data Repository
on AWS
User Blocks
Identified blocked
users
SQL Queries
Extracted user
comments
Python web scraper
ORES score data
User Activity
Extracted activity
statistics
SQL Queries
Wiki automated
edit quality scores
Python web scraper
Google Ex: Machina
Human annotated wiki
user comments
CSV files
Data Pipeline
Model Building Methodology
• Corpus Aggregation + Annotations
• Feature Extraction
 Orthographical Features
 Char/Word n-gram
 NLTK Sentiment Analyzer
 Topic Modelling
 GloVe Word Embedding
 FastText
• Implementing Machine Learning Algorithms
 XGBoost – AUC 84% (Best)
• Model Tweaking
8
9
Toxicity Score Evolution
User Risk Model - Early Detection of Problematic Users
FEATURES
Machine
Learning
XGBoost
Wikipedia is
full of
morons. I’m
done with
this!
Thanks! This
is helpful
I agree with
him on the
editing trend
F%$@ I will
delete
everything
you post!!!
I don’t think
this is
relevant
How is this
going chicken
shit
0.5
0.3
0.8
0.6
0.1
0.0
WEEKLY STATS
edit count: 5
avg length: 500
active days : 4
WEEKLY STATS
edit count: 7
avg length: 1230
active days : 2
WEEKLY STATS
edit count: 9
avg length: 30
active days : 10
WEEKLY STATS
edit count: 4
avg length: 740
active days : 7
WEEKLY STATS
edit count: 3
avg length: 100
active days : 4
WEEKLY STATS
edit count: 30
avg length: 510
active days : 14
11
Thank you!
12

More Related Content

What's hot

UCCSC Sauter Award for Profiles
UCCSC Sauter Award for ProfilesUCCSC Sauter Award for Profiles
UCCSC Sauter Award for Profiles
ericmeeks
 
Accessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two ApproachesAccessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two Approaches
NASIG
 

What's hot (10)

Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015
 
Scholarometer Altmetrics2014
Scholarometer Altmetrics2014Scholarometer Altmetrics2014
Scholarometer Altmetrics2014
 
UCCSC Sauter Award for Profiles
UCCSC Sauter Award for ProfilesUCCSC Sauter Award for Profiles
UCCSC Sauter Award for Profiles
 
UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles
 
Accessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two ApproachesAccessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two Approaches
 
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
 
Research Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxResearch Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool Box
 
5 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 20155 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 2015
 
Enhance your rese​arch impact through open science
Enhance your rese​arch impact through open scienceEnhance your rese​arch impact through open science
Enhance your rese​arch impact through open science
 
International publication flash course
International publication flash courseInternational publication flash course
International publication flash course
 

Similar to Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia

SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...
Sandra Gesing
 

Similar to Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia (20)

RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Ucsd research-it-09-11-18
Ucsd research-it-09-11-18Ucsd research-it-09-11-18
Ucsd research-it-09-11-18
 
SC11 Science Gateway Group Overview
SC11 Science Gateway Group OverviewSC11 Science Gateway Group Overview
SC11 Science Gateway Group Overview
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana ShafiqueWeb 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open ScienceOSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
 
SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...
 
Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18
 
Assessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other ToolsAssessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other Tools
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...
 
IGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working GroupIGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working Group
 
Ir1
Ir1Ir1
Ir1
 
ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4
 
Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...
 
NISO April 30th RA21 Webinar
NISO April 30th RA21 WebinarNISO April 30th RA21 Webinar
NISO April 30th RA21 Webinar
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
 
Clinical Anatomy 9566
Clinical Anatomy 9566Clinical Anatomy 9566
Clinical Anatomy 9566
 

More from Melissa Moody

More from Melissa Moody (15)

Connecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy changeConnecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy change
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Plans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data SciencePlans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data Science
 
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
 
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data Revolution
 
Steering Model Selection with Visual Diagnostics
Steering Model Selection with Visual DiagnosticsSteering Model Selection with Visual Diagnostics
Steering Model Selection with Visual Diagnostics
 
Assessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studiesAssessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studies
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
 
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional DistrictsA Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
 
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 

Recently uploaded (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 

Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia

  • 1. UVA DSI Capstone Project Wikimedia Foundation, Trust & Safety Cyber Harassment Classification and Prediction of User Blocks in Wikipedia Students - Charu Rawat , Arnab Sarkar, Sameer Singh Faculty Advisor – Rafael Alvarado University of Virginia Master’s in Data Science Program Clients – Patrick Earley , Sydney Poore Advisor – Lane Rasberry Trust and Safety Team, Wikimedia Foundation 1
  • 2. The English Wikipedia over the last few years… • Popularity: Alexa Rank #5 • Total Edits: 868,717,569 • All Pages: 46,612,179 • Articles 5,767,464 • Registered Users: 35,185,308 • Administrators: ~1,193 https://commons.wikimedia.org/wiki/File:Wikidata_Items_Map_2014_%E2%80%93_2017_gif.gif 2
  • 3. 41% have experienced personal attacks 66% observed attacks directed towards others 47% reported a decrease in contribution and engagement levels 3 Online Harassment at Wikipedia Pew Research Centre Survey (America centric)
  • 4. Human Driven Process Failure to identify behavior Bias to overlook behavior Inability to understand context and scope of Our goal was to use machine learning to develop a model that can detect abusive content as well as predict problematic users in the community. To that end, we leveraged a variety of data to analyse the prevalence and nature of online harassment at scale. There is a need for a more robust process to combat harassment, one that can scale well as theWikipedia community continues to grow in its size and diversity. Problem
  • 5. • ~1.02 million unique users have been blocked Wikipedia till Oct 2018 • 91.5% registered users • 8.5% are anonymous users • Increase in user bocks in 2017, 2018 5 Block User Trends in Wikipedia
  • 6. 6 For what reasons are users getting blocked over the years?
  • 7. User Comment Text Data Repository on AWS User Blocks Identified blocked users SQL Queries Extracted user comments Python web scraper ORES score data User Activity Extracted activity statistics SQL Queries Wiki automated edit quality scores Python web scraper Google Ex: Machina Human annotated wiki user comments CSV files Data Pipeline
  • 8. Model Building Methodology • Corpus Aggregation + Annotations • Feature Extraction  Orthographical Features  Char/Word n-gram  NLTK Sentiment Analyzer  Topic Modelling  GloVe Word Embedding  FastText • Implementing Machine Learning Algorithms  XGBoost – AUC 84% (Best) • Model Tweaking 8
  • 10. User Risk Model - Early Detection of Problematic Users FEATURES Machine Learning XGBoost
  • 11. Wikipedia is full of morons. I’m done with this! Thanks! This is helpful I agree with him on the editing trend F%$@ I will delete everything you post!!! I don’t think this is relevant How is this going chicken shit 0.5 0.3 0.8 0.6 0.1 0.0 WEEKLY STATS edit count: 5 avg length: 500 active days : 4 WEEKLY STATS edit count: 7 avg length: 1230 active days : 2 WEEKLY STATS edit count: 9 avg length: 30 active days : 10 WEEKLY STATS edit count: 4 avg length: 740 active days : 7 WEEKLY STATS edit count: 3 avg length: 100 active days : 4 WEEKLY STATS edit count: 30 avg length: 510 active days : 14 11

Editor's Notes

  1. Users who indulge in problematic behavior blocked in Wikipedia. About 1.2 mil users have been blocked from 2004 to 2018 , infact the number of blocked users reached an all time high in 2018. And only ~9% of those blocked users are anonymous contradicting a widespread assumption that anonymity is a major factor when it comes to users doing such things. Wikipedia vs wikimedia Fellow students
  2. If we look at the reasons that lead to users getting blocked or the kind of toxic behavior they indulge in that leads to them getting blocked – we can see that there are some dominant ones like sock puppetry and spam, These reasons have been changing through time, Also goes on to show that harassment or toxic behavior can take multiple forms
  3. To uncover such insights and build upon them, we analyzed a variety of data made publicly available by Wikipedia. Ranging from structured data pertaining to information about blocked users and general user activity to big data textual in nature relating to user conversations on the platform. As you can see we used many methods to access the these data sources for our analysis and model building.