Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Fine-Grained Controversy Detection in Wikipedia
Siarhei Bykau (Purdue University)
Flip Korn (Google Research)
Divesh Sriva...
Siarhei Bykau 2
Wikipedia: The Wisdom of Crowds
● Collaborative
Content Creation [Giles
2005]
– Up-to-date
– Pluralistic
–...
Siarhei Bykau 3
Controversy
● A prolonged dispute by a number of people on the same topic *
● Should be distingueshed from...
Siarhei Bykau 4
Arab-Israeli Conflic
●
Sensitive page, rife with controversial content
– Number of casualties, Israeli per...
Siarhei Bykau 5
The Beatles
● Non-sensitive page, with controversial content
– Should it be “The Beatles” or “the Beatles”?
Siarhei Bykau 6
Caesar salad
Siarhei Bykau 7
Controversy Detection:
Related Work
● machine learning [Kittur et al 2007]
● # of revisions, # of unique a...
Siarhei Bykau 8
Controversy Detection:
Related Work
● None of these methods work to fine-grained controversies
– WHERE a c...
Siarhei Bykau 9
Caesar salad
● Previous work only detects that the Caesar salad page is controversial
The history of this ...
Siarhei Bykau 10
Challenge: Fine-grained Controversies
● Controversies are typically expressed via substitutions
– Not Ins...
Siarhei Bykau 11
Challenge: Track Topic across Revisions
● Positions of edits change significantly across revisions
● Text...
Siarhei Bykau 12
Challenge: Distinguish from Other Edits
● Cardinality
– # of edits
● Duration
– Lifespan of a controversy...
Siarhei Bykau 13
Challenge: Variability of Text Content
● sequence of wiki links, not words
– Link -> semantic concept
– W...
Siarhei Bykau 14
Challenge: Large Number of Revisions
● 4.5 million content pages, about 100 million revs, 7 TB of data
● ...
Siarhei Bykau 15
Experimental Evaluation Setup
● Dataset: English-language Wikipedia dump from December
2013
– 4.5 million...
Siarhei Bykau 16
Sources of Controversy
● Wikipedia Provided Controversies (WPC)
– Metrics:
● Recall
● User surveys
– Metr...
Siarhei Bykau 17
Recall on selected WPC
● Baseline – adapted [Kittur et al 2007]
● Text model has higher recall than link ...
Siarhei Bykau 18
Recall on full WPC using Text Model
● Text model can retrieve 117 out of 263 WPCs in top-10
result
– Clea...
Siarhei Bykau 19
New Previously Unknown Controversies
page WPC New controversies
Chopin nationality birthday, photo, name
...
Siarhei Bykau 20
Precision
● Link model has considerably higher precision than
text model
– For many (cardinality, duratio...
Siarhei Bykau 21
Subsititutions vs Insertions/Delitions
metric link text link ins/del text ins/del baseline
noise/signal 0...
Siarhei Bykau 22
Experiment Takeaways
● Text model with substitutions has a higher recall
– Able to retrieve 23% more cont...
Siarhei Bykau 23
Conclusions
● Detection of fine-grained controversies in Wikipedia
– answer Where, What, Who and When que...
Upcoming SlideShare
Loading in …5
×

Fine-Grained Controversy Detection on Wikipedia

311 views

Published on

The advent of Web 2.0 gave birth to a new kind of applications where content is generated through the collaborative contribution of many different users. This form of content generation is believed to generate data of higher quality since the “wisdom of the crowds” makes its way into the data. However, as it is generally the case in real life, there are many issues for which there is no generally accepted opinion. These issues are characterised as controversial. Knowing these issues when reading the user generated content is of major importance in understanding the quality of the data and the trust that should be given to them. In this work we describe a technique that finds these controversial issues by analyzing the edits that have been performed on the data over time. We apply our technique on Wikipedia, the world’s largest known collaboratively generated database and we report our findings.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Fine-Grained Controversy Detection on Wikipedia

  1. 1. Fine-Grained Controversy Detection in Wikipedia Siarhei Bykau (Purdue University) Flip Korn (Google Research) Divesh Srivastava (AT&T Labs-Research) Yannis Velegrakis (University of Trento)
  2. 2. Siarhei Bykau 2 Wikipedia: The Wisdom of Crowds ● Collaborative Content Creation [Giles 2005] – Up-to-date – Pluralistic – Neutral point of view ● Data Quality Problems: – Reputation&Trust [Adler and Alfaro 2007, Adler et al. 2008] – Vandalism [Chin et al. 2010, Potthast et al. 2008, Smeth et al. 2008] – Stability [Druck et al. 2008] – Controversy
  3. 3. Siarhei Bykau 3 Controversy ● A prolonged dispute by a number of people on the same topic * ● Should be distingueshed from: – regular edits – vandalism ● Help in – preserve neutral point of view (NPOV) – requesting supporting evidences * http://en.wikipedia.org/wiki/Controversy
  4. 4. Siarhei Bykau 4 Arab-Israeli Conflic ● Sensitive page, rife with controversial content – Number of casualties, Israeli per-capita GDP, etc.
  5. 5. Siarhei Bykau 5 The Beatles ● Non-sensitive page, with controversial content – Should it be “The Beatles” or “the Beatles”?
  6. 6. Siarhei Bykau 6 Caesar salad
  7. 7. Siarhei Bykau 7 Controversy Detection: Related Work ● machine learning [Kittur et al 2007] ● # of revisions, # of unique authors, page length ● mutual reinforcement principle [Vuong et al 2008] ● content is more controversial if page’s controversy is low ● bipolarities in the edit graph [Sepehri Rad and Barbosa 2011] ● nodes = authors ● edges = one author deletes/reverts content written by another ● revert statistics [Yasseri et al. 2012] ● number of authors who revert an article back to a previous version
  8. 8. Siarhei Bykau 8 Controversy Detection: Related Work ● None of these methods work to fine-grained controversies – WHERE a controversy is located – WHO is involved into a controversy – WHEN a controversy occurred – WHAT are the arguments of a controversy
  9. 9. Siarhei Bykau 9 Caesar salad ● Previous work only detects that the Caesar salad page is controversial The history of this popular salad is a controversial issue, even in the spelling of the name. There is a widely held misconception that it is named after [[Julius Caesar]], but the salad's creation is generally attributed to restaurateur '''[[Cesar Cardini]]''' (an [[Italy|Italian]]-born Mexican). As his daughter Rosa (1928– 2003) reported,[2] her father invented the dish when a Fourth of July 1924 rush depleted the kitchen's supplies. Cardini made do with what he had, adding the dramatic flair of the table-side tossing "by the chef". The history of this popular salad is a controversial issue, even in the spelling of the name. There is a widely held misconception that it is named after '''[[Cesar Cardini]]''', but the salad's creation is generally attributed to [[Julius Caesar]] (an [[Italy|Italian]]-born emperor). As his daughter Rosa (1928–2003) reported,[2] her father invented the dish when a Fourth of July 1924 rush depleted the kitchen's supplies. Cardini made do with what he had, adding the dramatic flair of the table-side tossing "by the chef". - What are diffirent alternatives? - When the controversy occured? - Who created the salad? - After whom it is named?
  10. 10. Siarhei Bykau 10 Challenge: Fine-grained Controversies ● Controversies are typically expressed via substitutions – Not Insertions/Deletions – Alternating content ...There is a widely held misconception that it is named after [[Julius Caesar]], but the salad's creation is generally attributed to restaurateur '''[[Cesar Cardini]]''' (an [[Italy|Italian]]-born Mexican). As his daughter Rosa (1928–2003) reported,... ...There is a widely held misconception that it is named after '''[[Cesar Cardini]]''', but the salad's creation is generally attributed to [[Julius Caesar]] (an [[Italy|Italian]]- born emperor). As his daughter Rosa (1928–2003) reported,..
  11. 11. Siarhei Bykau 11 Challenge: Track Topic across Revisions ● Positions of edits change significantly across revisions ● Text is ambiguous ● Surrounding context of edit clarifies semantics – Edits with same or similar context likely refer to the same topic ...There is a widely held misconception that it is named after [[Julius Caesar]], but the salad's creation is generally attributed to restaurateur '''[[Cesar Cardini]]''' (an [[Italy|Italian]]-born Mexican). As his daughter Rosa (1928–2003) reported,... ...There is a widely held misconception that it is named after '''[[Cesar Cardini]]''', but the salad's creation is generally attributed to [[Julius Caesar]] (an [[Italy|Italian]]-born emperor). As his daughter Rosa (1928–2003) reported,..
  12. 12. Siarhei Bykau 12 Challenge: Distinguish from Other Edits ● Cardinality – # of edits ● Duration – Lifespan of a controversy ● Plurality – # of distinct authors
  13. 13. Siarhei Bykau 13 Challenge: Variability of Text Content ● sequence of wiki links, not words – Link -> semantic concept – Wikipedia encourages to have a high density of wiki links olive oil Worcestershire sauce Julius Caesar Cesar Cardini Italy Mexican Hollywood olive oil Worcestershire sauce Caesar Cadini Julius Caesar Caesar Cadini Italy Hollywood
  14. 14. Siarhei Bykau 14 Challenge: Large Number of Revisions ● 4.5 million content pages, about 100 million revs, 7 TB of data ● scalable controversy detection algorithm (CDA) ● Input: a Wikipedia page with its revision history – Edit extraction // use Myer’s algorithm, find substitutions – Eliminate edits with low user support – Cluster edits based on context // use DBSCAN for efficiency – Cluster and merge the sets of edits based on the subject ● Output: ranked clusters of edits which represent controversies
  15. 15. Siarhei Bykau 15 Experimental Evaluation Setup ● Dataset: English-language Wikipedia dump from December 2013 – 4.5 million content pages, about 100 million revisions, 7 TB of data ● Implemented CDA in Java, used JWPL parser to discover links – Baseline identifies controversies based on the number of revisions Parameter Range Default Value model link, text link radius of context 2, 4, 6, 8 8 max tokens in substituion 1, 2, 3, 4, 5 2 context similarity [0...1] 0.75 number of authors 1, 2, 3, 4, 5 2
  16. 16. Siarhei Bykau 16 Sources of Controversy ● Wikipedia Provided Controversies (WPC) – Metrics: ● Recall ● User surveys – Metrics: ● noise/signal ratio ● Top1 Precision ● # of distinct controversies
  17. 17. Siarhei Bykau 17 Recall on selected WPC ● Baseline – adapted [Kittur et al 2007] ● Text model has higher recall than link model, baseline is worst
  18. 18. Siarhei Bykau 18 Recall on full WPC using Text Model ● Text model can retrieve 117 out of 263 WPCs in top-10 result – Clean controversies doesn't have irrelevant substitutions
  19. 19. Siarhei Bykau 19 New Previously Unknown Controversies page WPC New controversies Chopin nationality birthday, photo, name Avril Lavigne song spelling music genre, birthplace, religion Bolzano name spelling language Futurama verb spelling TV, seasons, channel Freddie Mercury origin name spelling, image
  20. 20. Siarhei Bykau 20 Precision ● Link model has considerably higher precision than text model – For many (cardinality, duration, plurality) ranking functions Link Model Text Model
  21. 21. Siarhei Bykau 21 Subsititutions vs Insertions/Delitions metric link text link ins/del text ins/del baseline noise/signal 0.19 0.25 0.64 0.57 0.75 # of dist contr 65 80 29 25 17 ● Link model with substitutions has lowest noise/signal ratio ● Models with insertions/deletions have very high noise/signal ratio ● Text model with substitutions find highest # of controversies ● Models with insertions/deletions find low number of controversies
  22. 22. Siarhei Bykau 22 Experiment Takeaways ● Text model with substitutions has a higher recall – Able to retrieve 23% more controversies among WPC ● Link model with substitutions has a much higher precision – Use of semantic concepts in wiki links doubles the precision ● Cardinality, duration, plurality – good ranking functions – Validates the definition of controversy
  23. 23. Siarhei Bykau 23 Conclusions ● Detection of fine-grained controversies in Wikipedia – answer Where, What, Who and When questions ● Link model generates more semantically meaningful controversies then text model ● Experimental evaluation shows the efficiency and effectiveness of the proposed solutions

×