A presentation by UVA Data Science Institute MSDS 2019 students Charu Rawat, Arnab Sarkar, and Sameer Singh, advised by DSI professor Raf Alvarado and researcher Lane Rasberry, at the 2019 Tom Tom Applied Machine Learning Conference in Charlottesville, VA.
Empirical user studies in Semantic Web contextsCatia Pesquita
My presentation at EKAW 2018 for our position paper that argues better user studies and their reporting are needed in the Semantic Web community, and proposes a framework to design and report empirical studies.
Read the paper at: http://steffen-lohmann.de/publications/2018_EKAW_user_studies_semweb.pdf
In collaboration with participating faculty at California State University Northridge, Oviatt Library Research Fellows have developed a pilot that facilitates both course and information literacy objectives. Using Scalar, an open source semantic Web-authoring program, the Guided Resource Inquiry Tool (GRI) combines assignment prompts, digitized primary sources and digital information literacy media into a single online interface. The GRI leverages renewed pedagogical interest in ‘knowledge synthesis’ by placing documentary evidence as the point of departure for critical inquiry. The tool incorporates materials from archival and digital collections repositories to supply content for course assignments. The GRI contains links to existing archives and information literacy resources that teach students as they complete the assignment. Specific benefits include: 1) archives and information literacy instruction within an applied research context; 2) course learning through critical analysis of primary documents, 3) extended outreach for digital collections and archival repositories; 4) 24/7 access to the GRI assignment online; 5) no limits on class size, 6) integrated exposure to finding aid and secondary source databases as well as Special Collections patron protocols.
Presenter: Steve Kutay
This presentation was provided by Todd Digby and Robert Phillips of the University of Florida during the NISO Virtual Conference held on Feb 15, 2017, entitled Institutional Repositories: Ensuring Yours is Populated, Useful and Thriving.
Outline of the UCSF approach to Research Networking, which focuses on rapid iterations of adding new data sources and features to see what works, and abandon what doesn't work.
Empirical user studies in Semantic Web contextsCatia Pesquita
My presentation at EKAW 2018 for our position paper that argues better user studies and their reporting are needed in the Semantic Web community, and proposes a framework to design and report empirical studies.
Read the paper at: http://steffen-lohmann.de/publications/2018_EKAW_user_studies_semweb.pdf
In collaboration with participating faculty at California State University Northridge, Oviatt Library Research Fellows have developed a pilot that facilitates both course and information literacy objectives. Using Scalar, an open source semantic Web-authoring program, the Guided Resource Inquiry Tool (GRI) combines assignment prompts, digitized primary sources and digital information literacy media into a single online interface. The GRI leverages renewed pedagogical interest in ‘knowledge synthesis’ by placing documentary evidence as the point of departure for critical inquiry. The tool incorporates materials from archival and digital collections repositories to supply content for course assignments. The GRI contains links to existing archives and information literacy resources that teach students as they complete the assignment. Specific benefits include: 1) archives and information literacy instruction within an applied research context; 2) course learning through critical analysis of primary documents, 3) extended outreach for digital collections and archival repositories; 4) 24/7 access to the GRI assignment online; 5) no limits on class size, 6) integrated exposure to finding aid and secondary source databases as well as Special Collections patron protocols.
Presenter: Steve Kutay
This presentation was provided by Todd Digby and Robert Phillips of the University of Florida during the NISO Virtual Conference held on Feb 15, 2017, entitled Institutional Repositories: Ensuring Yours is Populated, Useful and Thriving.
Outline of the UCSF approach to Research Networking, which focuses on rapid iterations of adding new data sources and features to see what works, and abandon what doesn't work.
Scholarometer (scholarometer.indiana.edu) is a citation-based impact analysis tool. This work was presented at Altmetrics 2014 held at Websci 2014 conference at Bloomington, Indiana.
Accessibility Compliance: One State, Two ApproachesNASIG
Accessibility compliance is a growing concern for academic institutions as it pertains to instructional materials on websites, course management systems, and in course documents. This extends to materials provided by academic libraries such as electronic resources. This presentation will discuss the approaches that both systems governing Tennessee public colleges and universities are using to ensure that vendors are compliant with standards as described in WCAG 2.0, EPUB 3, and Section 508 of the Rehabilitation Act of 1973.
The session will be divided into three parts as follows:
Introduction to the difference between accessibility and accommodation. Discussion of the types of disabilities of which librarians should be aware when acquiring and assessing different electronic resources. Brief mention of the laws and standards related to accessibility compliance.
An overview of the University of Tennessee System’s approach to encouraging accessibility compliance by incorporating detailed conformance language into licenses with the vendors and publishers of electronic and information technology.
A discussion of the Tennessee Board of Regents system’s approach to encouraging accessibility compliance by conducting an accessibility audit of resources held in common among the system’s libraries and through a collaborative process of compliance document collection from vendors/publishers and sharing in an AIMT (Accessible Instructional Materials and Technology) database. An introduction to the different types of documents and their content: Accessibility Statement, Voluntary Product Accessibility Template (VPAT), WCAG 2.0 (Web Content Accessibility Guidelines) Checklist, EPUB 3 Accessibility Checklist, and a Conformance and Remediation Form.
Stephanie J. Adams
Electronic Resources Librarian, Tennessee Tech University
Ms. Adams is the Electronic Resources Librarian at Tennessee Tech University where she is responsible for the acquisition and set-up of all electronic resources at the Volpe Library.
Corey S. Halaychik
The University of Tennessee, Knoxville
Licensing guy, negotiator of master agreements at the University of Tennessee Libraries, and co-chair of The Collective, I work to make libraries more efficient, saving time and money for institutions and the people they serve.
Jennifer Mezick
Pellissippi State Community College
Acquisitions and Collection Development Librarian at Pellissippi State Community College in Knoxville, TN. In addition to these roles, I manage the libraries' electronic resources and website, and provide instruction and research support to students and faculty.
Research Summaries: An Evolving Tool in the KMb Tool BoxShawna Reibling
Presented by Anne Bergen, Shasta Carr-Harris Jason Guriel, Michael Johnny, Shawna Reibling at the KMb Forum 2013.
This presentation discusses the experiences using the research snapshot format at four different institutions
Presentation slides on Open Science and research reproducibility. Presented by Gareth Knight (LSHTM Research Data Manager) on 18th September 2018, as part of an Open Science event for LSHTM Week 2018.
Presentation slides from Charleston Library Conference, November 10, 2017 on the Resource Access in the 21st Century Initiative #RA21 presented by Todd Carpenter, Robert Kelshian, Don Hemparian and Ann Gabrail.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC
Rachel Frick, OCLC Executive Director of the OCLC Research Library Partnership, reviews some of the broad agenda items and recent publications related to the work of OCLC Research. Rachel is then joined for two presentations on specific research topics. First, Sharon Streams (OCLC Director of WebJunction) and Monika Sengul-Jones (OCLC Wikipedian-in-Residence) present on “Public Libraries and Wikipedia.” Next, Kenning Arlitsch (Dean, Montana State University Library) and Jeff Mixter (OCLC Senior Software Engineer) share their findings on “Accurate Institutional Repository Download Measurement using RAMP, the Repository Analytics and Metrics Portal.”
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
OSFair2017 Workshop | Frontiers’ Ambition for Open ScienceOpen Science Fair
Frederick Fenter presenting the Frontiers' platform | OSFair2017 Workshop
Workshop title: Open Access Models & Platforms
Workshop overview:
What are the emerging models of Open Access for publications? Who should be involved? How are costs distributed over the stakeholders involved? How can OA platforms innovate further to embrace Open Science? This workshop will discuss and showcase the range of models available, including their costs and organisational aspects, to discuss their relative strengths and weaknesses in different academic contexts.
When: DAY 1 - PARALLEL SESSION 1 & 2
SGCI - The Science Gateways Community Institute: International Collaboration ...Sandra Gesing
Science gateways - also called virtual research environments or virtual labs - allow science and engineering communities to access shared data, software, computing services, instruments, and other resources specific to their disciplines. The US Science Gateways Community Institute (SGCI), opened in August 2016, provides free resources, services, experts, and ideas for creating and sustaining science gateways. It offers five areas of services to the science gateway developer and user communities: the Incubator, Extended Developer Support, the Scientific Software Collaborative, Community Engagement and Exchange, and Workforce Development. While all these services are available to US-based communities, the Incubator, the Scientific Software Collaborative and the Community Engagement and Exchange serve also the international communities. SGCI aims at supporting beyond borders on international scale with diverse measures and to form and deepen collaborations with partner organizations and coalitions beneficial and/or related to the science gateways community. Research topics are independent of national borders and researchers spread worldwide can benefit from each other’s research results, software, data and from lessons learned — via online materials and publications or at international events. The gateway community has benefitted from this type of exchange for years and one mission of SGCI is to support the international community. This talk will present related work describing the benefits of international collaborations generally, and specifically as they relate to science gateways. It will go into detail regarding SGCI’s ongoing work on an international scale and SGCI's work planned in the near future to foster collaborations under consideration of challenges such as different timezones and long distances between collaborators.
Assessing Your Library Website: Using User Research Methods and Other ToolsRachel Vacek
This is a presentation given to the Oklahoma chapter of the Association of College and Research Libraries. It's about using web analytics and content audits as well as a variety of user research methods to better understand your users and assess and improve your website.
Scholarometer (scholarometer.indiana.edu) is a citation-based impact analysis tool. This work was presented at Altmetrics 2014 held at Websci 2014 conference at Bloomington, Indiana.
Accessibility Compliance: One State, Two ApproachesNASIG
Accessibility compliance is a growing concern for academic institutions as it pertains to instructional materials on websites, course management systems, and in course documents. This extends to materials provided by academic libraries such as electronic resources. This presentation will discuss the approaches that both systems governing Tennessee public colleges and universities are using to ensure that vendors are compliant with standards as described in WCAG 2.0, EPUB 3, and Section 508 of the Rehabilitation Act of 1973.
The session will be divided into three parts as follows:
Introduction to the difference between accessibility and accommodation. Discussion of the types of disabilities of which librarians should be aware when acquiring and assessing different electronic resources. Brief mention of the laws and standards related to accessibility compliance.
An overview of the University of Tennessee System’s approach to encouraging accessibility compliance by incorporating detailed conformance language into licenses with the vendors and publishers of electronic and information technology.
A discussion of the Tennessee Board of Regents system’s approach to encouraging accessibility compliance by conducting an accessibility audit of resources held in common among the system’s libraries and through a collaborative process of compliance document collection from vendors/publishers and sharing in an AIMT (Accessible Instructional Materials and Technology) database. An introduction to the different types of documents and their content: Accessibility Statement, Voluntary Product Accessibility Template (VPAT), WCAG 2.0 (Web Content Accessibility Guidelines) Checklist, EPUB 3 Accessibility Checklist, and a Conformance and Remediation Form.
Stephanie J. Adams
Electronic Resources Librarian, Tennessee Tech University
Ms. Adams is the Electronic Resources Librarian at Tennessee Tech University where she is responsible for the acquisition and set-up of all electronic resources at the Volpe Library.
Corey S. Halaychik
The University of Tennessee, Knoxville
Licensing guy, negotiator of master agreements at the University of Tennessee Libraries, and co-chair of The Collective, I work to make libraries more efficient, saving time and money for institutions and the people they serve.
Jennifer Mezick
Pellissippi State Community College
Acquisitions and Collection Development Librarian at Pellissippi State Community College in Knoxville, TN. In addition to these roles, I manage the libraries' electronic resources and website, and provide instruction and research support to students and faculty.
Research Summaries: An Evolving Tool in the KMb Tool BoxShawna Reibling
Presented by Anne Bergen, Shasta Carr-Harris Jason Guriel, Michael Johnny, Shawna Reibling at the KMb Forum 2013.
This presentation discusses the experiences using the research snapshot format at four different institutions
Presentation slides on Open Science and research reproducibility. Presented by Gareth Knight (LSHTM Research Data Manager) on 18th September 2018, as part of an Open Science event for LSHTM Week 2018.
Presentation slides from Charleston Library Conference, November 10, 2017 on the Resource Access in the 21st Century Initiative #RA21 presented by Todd Carpenter, Robert Kelshian, Don Hemparian and Ann Gabrail.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC
Rachel Frick, OCLC Executive Director of the OCLC Research Library Partnership, reviews some of the broad agenda items and recent publications related to the work of OCLC Research. Rachel is then joined for two presentations on specific research topics. First, Sharon Streams (OCLC Director of WebJunction) and Monika Sengul-Jones (OCLC Wikipedian-in-Residence) present on “Public Libraries and Wikipedia.” Next, Kenning Arlitsch (Dean, Montana State University Library) and Jeff Mixter (OCLC Senior Software Engineer) share their findings on “Accurate Institutional Repository Download Measurement using RAMP, the Repository Analytics and Metrics Portal.”
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
OSFair2017 Workshop | Frontiers’ Ambition for Open ScienceOpen Science Fair
Frederick Fenter presenting the Frontiers' platform | OSFair2017 Workshop
Workshop title: Open Access Models & Platforms
Workshop overview:
What are the emerging models of Open Access for publications? Who should be involved? How are costs distributed over the stakeholders involved? How can OA platforms innovate further to embrace Open Science? This workshop will discuss and showcase the range of models available, including their costs and organisational aspects, to discuss their relative strengths and weaknesses in different academic contexts.
When: DAY 1 - PARALLEL SESSION 1 & 2
SGCI - The Science Gateways Community Institute: International Collaboration ...Sandra Gesing
Science gateways - also called virtual research environments or virtual labs - allow science and engineering communities to access shared data, software, computing services, instruments, and other resources specific to their disciplines. The US Science Gateways Community Institute (SGCI), opened in August 2016, provides free resources, services, experts, and ideas for creating and sustaining science gateways. It offers five areas of services to the science gateway developer and user communities: the Incubator, Extended Developer Support, the Scientific Software Collaborative, Community Engagement and Exchange, and Workforce Development. While all these services are available to US-based communities, the Incubator, the Scientific Software Collaborative and the Community Engagement and Exchange serve also the international communities. SGCI aims at supporting beyond borders on international scale with diverse measures and to form and deepen collaborations with partner organizations and coalitions beneficial and/or related to the science gateways community. Research topics are independent of national borders and researchers spread worldwide can benefit from each other’s research results, software, data and from lessons learned — via online materials and publications or at international events. The gateway community has benefitted from this type of exchange for years and one mission of SGCI is to support the international community. This talk will present related work describing the benefits of international collaborations generally, and specifically as they relate to science gateways. It will go into detail regarding SGCI’s ongoing work on an international scale and SGCI's work planned in the near future to foster collaborations under consideration of challenges such as different timezones and long distances between collaborators.
Assessing Your Library Website: Using User Research Methods and Other ToolsRachel Vacek
This is a presentation given to the Oklahoma chapter of the Association of College and Research Libraries. It's about using web analytics and content audits as well as a variety of user research methods to better understand your users and assess and improve your website.
Automatic detection of online abuse and analysis of problematic users in wiki...Melissa Moody
For their 2019 capstone project, DSI Master of Science in Data Science students Charu Rawat, Arnab Sarkar, and Sameer Singh proposed a framework to understand and detect such abuse in the English Wikipedia community.
Rawat, Sarkar, and Singh received the award for Best Paper in the Data Science for Society category at the 2019 Systems & Information Design Symposium (SIEDS). In "Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia," the team presented an analysis of user misconduct in Wikipedia and a system for the automated early detection of inappropriate behavior.
This presentation was provided during the NISO Working Group Connection Live event held on April 30, 2019. Speakers include Todd Carpenter, Ralph Youngen, and Chris Shillum.
Connecting citizens with public data to drive policy changeMelissa Moody
UVA Data Science Institute Master of Science in Data Science researchers Lucas Beane and Elena Gillis undertook a capstone project to investigate possible reasons for the stagnation of the Charlottesville Open Data Portal.
Data Collection Methods for Building a Free Response Training SimulationMelissa Moody
Master of Science in Data Science capstone project researchers Vaibhav Sharma, Beni Shpringer, and Michael Yang, along with UVA School of Engineering M.S. student Martin Bolger and Ph.D. students Sodiq Adewole and Erfaneh Gharavi, sought to develop new methods for collecting, generating, and labeling data to aid in the creation of educational, free-input dialogue simulations.
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
UVA Data Science Institute Master of Science in Data Science students Sean Mullane, Ruoyan Chen and Sri Vaishnavi Vemulapalli were motivated to apply data science tools and techniques to the problem, and see if protein structures can be quantitatively described, compared and otherwise analyzed in a more robust, efficient and automated manner. Potential applications include more effectively designed drugs to inhibit disease-related proteins, or even newly engineered ones.
The researchers received the award for Best Paper in the Data Science for Health category at the 2019 Systems & Information Design Symposium (SIEDS) meeting. Their project, "Machine Learning for Classification of Protein Helix Capping Motifs," focused on small segments of a protein called secondary structural elements. These structural elements are the basic molecular-scale building blocks that all proteins—and therefore life—build upon.
Plans for the University of Virginia School of Data ScienceMelissa Moody
The University of Virginia, through the largest gift in the University’s history, has the opportunity to play a national and international leadership role in data science training, research, and service by expanding the already successful Data Science Institute (DSI) to become a School of Data Science (SDS). When first presented to then President-elect James Ryan, he pointed out that a gift alone does not make a school. Particular concerns were sustainability and the impact on other schools of the University. Throughout 2018 and early 2019, we have crafted a proposal for the SDS that is financially and academically sustainable and that works in concert with all schools to enrich every student’s experience at a time when our society is increasingly data driven.
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in De...Melissa Moody
A presentation by UVA Data Science Institute 2019-20 Presidential Fellow in Data Science Tianlu Wang, at the 2019 Tom Tom Applied Machine Learning Conference in Charlottesville, VA. Learn more at datascience.virginia.edu.
Collective Biographies of Women: A Deep Learning Approach to Paragraph Annota...Melissa Moody
A presentation by UVA Data Science Institute MSDS 2019 students Sakshi Jawarani, Murugesan Ramakrishnan, and Varshini Sriram, advised by MSDS Program Director and professor Rafael Alvarado, at the 2019 Tom Tom Applied Machine Learning Conference in Charlottesville, VA.
Ethical Priniciples for the All Data RevolutionMelissa Moody
A presentation by Stephanie Shipp, from the Research Highlights session at the 2019 Women in Data Science Charlottesville Conference. Hosted by the UVA Data Science Institute.
Assessing the reproducibility of DNA microarray studiesMelissa Moody
A presentation by Eva Lancaster, from the Research Highlights session at the 2019 Women in Data Science Charlottesville Conference. Hosted by the UVA Data Science Institute.
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksMelissa Moody
A presentation by Gizem Korkmaz, from the Research Highlights session at the 2019 Women in Data Science Charlottesville Conference. Hosted by the UVA Data Science Institute.
How to Beat the House: Predicting Football Results with Hyperparameter Optimi...Melissa Moody
UVA Data Science Institute MSDS student Abhimanyu Roy ('18) presented a talk at the 2018 Tom Tom Applied Machine Learning Conference in Charlottesville, Va. His presentation highlights how data science can be used to predict results in sporting events.
Learn more about Abhimanyu at https://dsi.virginia.edu/people/abhimanyu-roy.
A Modified K-Means Clustering Approach to Redrawing US Congressional DistrictsMelissa Moody
UVA Data Science Institute MSDS student Jack Prominski ('18) presented a talk at the 2018 Tom Tom Applied Machine Learning Conference in Charlottesville, Va. His talk highlights how data science can create a more equitable redistricting process.
Learn more about Jack at https://dsi.virginia.edu/people/jack-prominski.
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Character...Melissa Moody
UVA Data Science Institute MSDS students Caitlin Dreisbach ('18), Morgan Wall ('18), and Ali Zaidi ('18) presented a talk based on their capstone research project, part of the MSDS program, at the 2018 Tom Tom Applied Machine Learning Conference in Charlottesville, Va.
Learn more about the project at https://dsi.virginia.edu/projects/connecting-mind-and-body.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Wikimedia Foundation, Trust & Safety: Cyber Harassment Classification and Prediction of User Blocks in Wikipedia
1. UVA DSI Capstone Project
Wikimedia Foundation, Trust & Safety
Cyber Harassment Classification and Prediction of User Blocks in Wikipedia
Students - Charu Rawat , Arnab Sarkar, Sameer Singh
Faculty Advisor – Rafael Alvarado
University of Virginia Master’s in Data Science Program
Clients – Patrick Earley , Sydney Poore
Advisor – Lane Rasberry
Trust and Safety Team, Wikimedia Foundation
1
2. The English Wikipedia over the last few years…
• Popularity: Alexa Rank
#5
• Total Edits: 868,717,569
• All Pages: 46,612,179
• Articles 5,767,464
• Registered Users:
35,185,308
• Administrators: ~1,193
https://commons.wikimedia.org/wiki/File:Wikidata_Items_Map_2014_%E2%80%93_2017_gif.gif
2
3. 41% have experienced
personal attacks
66% observed attacks
directed towards others
47% reported a decrease in
contribution and engagement levels
3
Online Harassment at Wikipedia
Pew Research Centre Survey (America centric)
4. Human Driven
Process
Failure to identify
behavior
Bias to overlook
behavior
Inability to
understand context
and scope of
Our goal was to use machine learning to develop a model that can detect abusive content as well as predict
problematic users in the community.
To that end, we leveraged a variety of data to analyse the prevalence and nature of online harassment at scale.
There is a need for a more robust process to combat
harassment, one that can scale well as theWikipedia
community continues to grow in its size and diversity.
Problem
5. • ~1.02 million unique
users have been blocked
Wikipedia till Oct 2018
• 91.5% registered users
• 8.5% are anonymous
users
• Increase in user bocks in
2017, 2018
5
Block User Trends in Wikipedia
7. User Comment Text
Data Repository
on AWS
User Blocks
Identified blocked
users
SQL Queries
Extracted user
comments
Python web scraper
ORES score data
User Activity
Extracted activity
statistics
SQL Queries
Wiki automated
edit quality scores
Python web scraper
Google Ex: Machina
Human annotated wiki
user comments
CSV files
Data Pipeline
8. Model Building Methodology
• Corpus Aggregation + Annotations
• Feature Extraction
Orthographical Features
Char/Word n-gram
NLTK Sentiment Analyzer
Topic Modelling
GloVe Word Embedding
FastText
• Implementing Machine Learning Algorithms
XGBoost – AUC 84% (Best)
• Model Tweaking
8
10. User Risk Model - Early Detection of Problematic Users
FEATURES
Machine
Learning
XGBoost
11. Wikipedia is
full of
morons. I’m
done with
this!
Thanks! This
is helpful
I agree with
him on the
editing trend
F%$@ I will
delete
everything
you post!!!
I don’t think
this is
relevant
How is this
going chicken
shit
0.5
0.3
0.8
0.6
0.1
0.0
WEEKLY STATS
edit count: 5
avg length: 500
active days : 4
WEEKLY STATS
edit count: 7
avg length: 1230
active days : 2
WEEKLY STATS
edit count: 9
avg length: 30
active days : 10
WEEKLY STATS
edit count: 4
avg length: 740
active days : 7
WEEKLY STATS
edit count: 3
avg length: 100
active days : 4
WEEKLY STATS
edit count: 30
avg length: 510
active days : 14
11
Users who indulge in problematic behavior blocked in Wikipedia. About 1.2 mil users have been blocked from 2004 to 2018 , infact the number of blocked users
reached an all time high in 2018. And only ~9% of those blocked users are anonymous contradicting a widespread assumption that anonymity is a major factor when it comes to users doing such things.
Wikipedia vs wikimedia
Fellow students
If we look at the reasons that lead to users getting blocked or the kind of toxic behavior they indulge in that leads to them getting blocked – we can see that
there are some dominant ones like sock puppetry and spam, These reasons have been changing through time,
Also goes on to show that harassment or toxic behavior can take multiple forms
To uncover such insights and build upon them, we analyzed a variety of data made publicly available by Wikipedia. Ranging from structured data pertaining to information about blocked users and general user activity to big data textual in nature relating to user conversations on the platform.
As you can see we used many methods to access the these data sources for our analysis and model building.