SlideShare a Scribd company logo
UVA DSI Capstone Project
Wikimedia Foundation, Trust & Safety
Cyber Harassment Classification and Prediction of User Blocks in Wikipedia
Students - Charu Rawat , Arnab Sarkar, Sameer Singh
Faculty Advisor – Rafael Alvarado
University of Virginia Master’s in Data Science Program
Clients – Patrick Earley , Sydney Poore
Advisor – Lane Rasberry
Trust and Safety Team, Wikimedia Foundation
1
The English Wikipedia over the last few years…
• Popularity: Alexa Rank
#5
• Total Edits: 868,717,569
• All Pages: 46,612,179
• Articles 5,767,464
• Registered Users:
35,185,308
• Administrators: ~1,193
https://commons.wikimedia.org/wiki/File:Wikidata_Items_Map_2014_%E2%80%93_2017_gif.gif
2
41% have experienced
personal attacks
66% observed attacks
directed towards others
47% reported a decrease in
contribution and engagement levels
3
Online Harassment at Wikipedia
Pew Research Centre Survey (America centric)
Human Driven
Process
Failure to identify
behavior
Bias to overlook
behavior
Inability to
understand context
and scope of
Our goal was to use machine learning to develop a model that can detect abusive content as well as predict
problematic users in the community.
To that end, we leveraged a variety of data to analyse the prevalence and nature of online harassment at scale.
There is a need for a more robust process to combat
harassment, one that can scale well as theWikipedia
community continues to grow in its size and diversity.
Problem
• ~1.02 million unique
users have been blocked
Wikipedia till Oct 2018
• 91.5% registered users
• 8.5% are anonymous
users
• Increase in user bocks in
2017, 2018
5
Block User Trends in Wikipedia
6
For what reasons are users getting blocked over the years?
User Comment Text
Data Repository
on AWS
User Blocks
Identified blocked
users
SQL Queries
Extracted user
comments
Python web scraper
ORES score data
User Activity
Extracted activity
statistics
SQL Queries
Wiki automated
edit quality scores
Python web scraper
Google Ex: Machina
Human annotated wiki
user comments
CSV files
Data Pipeline
Model Building Methodology
• Corpus Aggregation + Annotations
• Feature Extraction
 Orthographical Features
 Char/Word n-gram
 NLTK Sentiment Analyzer
 Topic Modelling
 GloVe Word Embedding
 FastText
• Implementing Machine Learning Algorithms
 XGBoost – AUC 84% (Best)
• Model Tweaking
8
9
Toxicity Score Evolution
User Risk Model - Early Detection of Problematic Users
FEATURES
Machine
Learning
XGBoost
Wikipedia is
full of
morons. I’m
done with
this!
Thanks! This
is helpful
I agree with
him on the
editing trend
F%$@ I will
delete
everything
you post!!!
I don’t think
this is
relevant
How is this
going chicken
shit
0.5
0.3
0.8
0.6
0.1
0.0
WEEKLY STATS
edit count: 5
avg length: 500
active days : 4
WEEKLY STATS
edit count: 7
avg length: 1230
active days : 2
WEEKLY STATS
edit count: 9
avg length: 30
active days : 10
WEEKLY STATS
edit count: 4
avg length: 740
active days : 7
WEEKLY STATS
edit count: 3
avg length: 100
active days : 4
WEEKLY STATS
edit count: 30
avg length: 510
active days : 14
11
Thank you!
12

More Related Content

What's hot

Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015
lesliey
 
Scholarometer Altmetrics2014
Scholarometer Altmetrics2014Scholarometer Altmetrics2014
Scholarometer Altmetrics2014
Jasleen Kaur
 
UCCSC Sauter Award for Profiles
UCCSC Sauter Award for ProfilesUCCSC Sauter Award for Profiles
UCCSC Sauter Award for Profilesericmeeks
 
UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles
lesliey
 
Accessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two ApproachesAccessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two Approaches
NASIG
 
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
lesliey
 
Research Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxResearch Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool Box
Shawna Reibling
 
5 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 20155 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 2015
anirvanchatterjee
 
Enhance your rese​arch impact through open science
Enhance your rese​arch impact through open scienceEnhance your rese​arch impact through open science
Enhance your rese​arch impact through open science
London School of Hygiene and Tropical Medicine
 

What's hot (10)

Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015Research Networking SEO state of the union 2015
Research Networking SEO state of the union 2015
 
Scholarometer Altmetrics2014
Scholarometer Altmetrics2014Scholarometer Altmetrics2014
Scholarometer Altmetrics2014
 
UCCSC Sauter Award for Profiles
UCCSC Sauter Award for ProfilesUCCSC Sauter Award for Profiles
UCCSC Sauter Award for Profiles
 
UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles UCCSC 2013 Presentation on UCSF Profiles
UCCSC 2013 Presentation on UCSF Profiles
 
Accessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two ApproachesAccessibility Compliance: One State, Two Approaches
Accessibility Compliance: One State, Two Approaches
 
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...10 simple ways UCSF Profiles has been used to win funding, find collaborators...
10 simple ways UCSF Profiles has been used to win funding, find collaborators...
 
Research Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxResearch Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool Box
 
5 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 20155 Secrets of a Successful UCSF Profiles page 2015
5 Secrets of a Successful UCSF Profiles page 2015
 
Enhance your rese​arch impact through open science
Enhance your rese​arch impact through open scienceEnhance your rese​arch impact through open science
Enhance your rese​arch impact through open science
 
International publication flash course
International publication flash courseInternational publication flash course
International publication flash course
 

Similar to Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia

RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation
National Information Standards Organization (NISO)
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC
 
Ucsd research-it-09-11-18
Ucsd research-it-09-11-18Ucsd research-it-09-11-18
Ucsd research-it-09-11-18
Nancy Wilkins-Diehr
 
SC11 Science Gateway Group Overview
SC11 Science Gateway Group OverviewSC11 Science Gateway Group Overview
SC11 Science Gateway Group Overview
marpierc
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
Kristi Holmes
 
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana ShafiqueWeb 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Ata Rehman
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open ScienceOSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
Open Science Fair
 
SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...
Sandra Gesing
 
Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18
Nancy Wilkins-Diehr
 
Assessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other ToolsAssessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other Tools
Rachel Vacek
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
Sri Ambati
 
The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...
Dario Taraborelli
 
IGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working GroupIGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working Group
Masud Khokhar
 
ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4
Miriam Fernandez
 
Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...
Melissa Moody
 
NISO April 30th RA21 Webinar
NISO April 30th RA21 WebinarNISO April 30th RA21 Webinar
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
OCLC
 
Clinical Anatomy 9566
Clinical Anatomy 9566Clinical Anatomy 9566
Clinical Anatomy 9566
Robin Featherstone
 

Similar to Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia (20)

RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Ucsd research-it-09-11-18
Ucsd research-it-09-11-18Ucsd research-it-09-11-18
Ucsd research-it-09-11-18
 
SC11 Science Gateway Group Overview
SC11 Science Gateway Group OverviewSC11 Science Gateway Group Overview
SC11 Science Gateway Group Overview
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana ShafiqueWeb 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
Web 2.0 and its Implications for Libraries by Ata ur Rehman & Farzana Shafique
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open ScienceOSFair2017 Workshop | Frontiers’ Ambition for Open Science
OSFair2017 Workshop | Frontiers’ Ambition for Open Science
 
SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...
 
Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18
 
Assessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other ToolsAssessing Your Library Website: Using User Research Methods and Other Tools
Assessing Your Library Website: Using User Research Methods and Other Tools
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...
 
IGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working GroupIGeLU 2014 - Interoperability Special Interest Working Group
IGeLU 2014 - Interoperability Special Interest Working Group
 
Ir1
Ir1Ir1
Ir1
 
ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4ESWC 2014 Tutorial Part 4
ESWC 2014 Tutorial Part 4
 
Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...
 
NISO April 30th RA21 Webinar
NISO April 30th RA21 WebinarNISO April 30th RA21 Webinar
NISO April 30th RA21 Webinar
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
 
Clinical Anatomy 9566
Clinical Anatomy 9566Clinical Anatomy 9566
Clinical Anatomy 9566
 

More from Melissa Moody

Connecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy changeConnecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy change
Melissa Moody
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
Melissa Moody
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Melissa Moody
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Melissa Moody
 
Plans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data SciencePlans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data Science
Melissa Moody
 
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Melissa Moody
 
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Melissa Moody
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
Melissa Moody
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data Revolution
Melissa Moody
 
Steering Model Selection with Visual Diagnostics
Steering Model Selection with Visual DiagnosticsSteering Model Selection with Visual Diagnostics
Steering Model Selection with Visual Diagnostics
Melissa Moody
 
Assessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studiesAssessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studies
Melissa Moody
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Melissa Moody
 
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
Melissa Moody
 
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional DistrictsA Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
Melissa Moody
 
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
Melissa Moody
 

More from Melissa Moody (15)

Connecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy changeConnecting citizens with public data to drive policy change
Connecting citizens with public data to drive policy change
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Plans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data SciencePlans for the University of Virginia School of Data Science
Plans for the University of Virginia School of Data Science
 
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...
 
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data Revolution
 
Steering Model Selection with Visual Diagnostics
Steering Model Selection with Visual DiagnosticsSteering Model Selection with Visual Diagnostics
Steering Model Selection with Visual Diagnostics
 
Assessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studiesAssessing the reproducibility of DNA microarray studies
Assessing the reproducibility of DNA microarray studies
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...
 
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional DistrictsA Modified K-Means Clustering Approach to Redrawing US Congressional Districts
A Modified K-Means Clustering Approach to Redrawing US Congressional Districts
 
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...Joining Separate Paradigms:  Text Mining & Deep Neural Networks  to Character...
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 

Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia

  • 1. UVA DSI Capstone Project Wikimedia Foundation, Trust & Safety Cyber Harassment Classification and Prediction of User Blocks in Wikipedia Students - Charu Rawat , Arnab Sarkar, Sameer Singh Faculty Advisor – Rafael Alvarado University of Virginia Master’s in Data Science Program Clients – Patrick Earley , Sydney Poore Advisor – Lane Rasberry Trust and Safety Team, Wikimedia Foundation 1
  • 2. The English Wikipedia over the last few years… • Popularity: Alexa Rank #5 • Total Edits: 868,717,569 • All Pages: 46,612,179 • Articles 5,767,464 • Registered Users: 35,185,308 • Administrators: ~1,193 https://commons.wikimedia.org/wiki/File:Wikidata_Items_Map_2014_%E2%80%93_2017_gif.gif 2
  • 3. 41% have experienced personal attacks 66% observed attacks directed towards others 47% reported a decrease in contribution and engagement levels 3 Online Harassment at Wikipedia Pew Research Centre Survey (America centric)
  • 4. Human Driven Process Failure to identify behavior Bias to overlook behavior Inability to understand context and scope of Our goal was to use machine learning to develop a model that can detect abusive content as well as predict problematic users in the community. To that end, we leveraged a variety of data to analyse the prevalence and nature of online harassment at scale. There is a need for a more robust process to combat harassment, one that can scale well as theWikipedia community continues to grow in its size and diversity. Problem
  • 5. • ~1.02 million unique users have been blocked Wikipedia till Oct 2018 • 91.5% registered users • 8.5% are anonymous users • Increase in user bocks in 2017, 2018 5 Block User Trends in Wikipedia
  • 6. 6 For what reasons are users getting blocked over the years?
  • 7. User Comment Text Data Repository on AWS User Blocks Identified blocked users SQL Queries Extracted user comments Python web scraper ORES score data User Activity Extracted activity statistics SQL Queries Wiki automated edit quality scores Python web scraper Google Ex: Machina Human annotated wiki user comments CSV files Data Pipeline
  • 8. Model Building Methodology • Corpus Aggregation + Annotations • Feature Extraction  Orthographical Features  Char/Word n-gram  NLTK Sentiment Analyzer  Topic Modelling  GloVe Word Embedding  FastText • Implementing Machine Learning Algorithms  XGBoost – AUC 84% (Best) • Model Tweaking 8
  • 10. User Risk Model - Early Detection of Problematic Users FEATURES Machine Learning XGBoost
  • 11. Wikipedia is full of morons. I’m done with this! Thanks! This is helpful I agree with him on the editing trend F%$@ I will delete everything you post!!! I don’t think this is relevant How is this going chicken shit 0.5 0.3 0.8 0.6 0.1 0.0 WEEKLY STATS edit count: 5 avg length: 500 active days : 4 WEEKLY STATS edit count: 7 avg length: 1230 active days : 2 WEEKLY STATS edit count: 9 avg length: 30 active days : 10 WEEKLY STATS edit count: 4 avg length: 740 active days : 7 WEEKLY STATS edit count: 3 avg length: 100 active days : 4 WEEKLY STATS edit count: 30 avg length: 510 active days : 14 11

Editor's Notes

  1. Users who indulge in problematic behavior blocked in Wikipedia. About 1.2 mil users have been blocked from 2004 to 2018 , infact the number of blocked users reached an all time high in 2018. And only ~9% of those blocked users are anonymous contradicting a widespread assumption that anonymity is a major factor when it comes to users doing such things. Wikipedia vs wikimedia Fellow students
  2. If we look at the reasons that lead to users getting blocked or the kind of toxic behavior they indulge in that leads to them getting blocked – we can see that there are some dominant ones like sock puppetry and spam, These reasons have been changing through time, Also goes on to show that harassment or toxic behavior can take multiple forms
  3. To uncover such insights and build upon them, we analyzed a variety of data made publicly available by Wikipedia. Ranging from structured data pertaining to information about blocked users and general user activity to big data textual in nature relating to user conversations on the platform. As you can see we used many methods to access the these data sources for our analysis and model building.