1. WikiSym 2012
Mutual Evaluation of Editors and Texts
for Assessing Quality of Wikipedia Articles
Yu Suzuki Nagoya University, Japan
Masatoshi Yoshikawa Kyoto University, Japan
1
2. Have you ever use Wikipedia?
1.0
Wikipedia blog
0.8
Percentage Usage of
0.6
0.4
0.2
Oxford university - SPIRE Project
Results and analysis of Web2.0 services survey
http://spire.conted.ox.ac.uk/
0
-18 18-24 25-34 35-44 45-54 55-64 65-74
Age (years old)
2
3. Have you ever use Wikipedia?
1.0
Wikipedia blog
0.8
Percentage Usage of
0.6
0.4
Less than 18 years and more than 65 years old users
0.2 = novice users
use Wikipedia frequently. Oxford university - SPIRE Project
Results and analysis of Web2.0 services survey
http://spire.conted.ox.ac.uk/
0
-18 18-24 25-34 35-44 45-54 55-64 65-74
Age (years old)
2
4. What is the main purpose?
56% of users use
for work and study.
But really?
3
5. What is the main purpose?
Never heard
8%
For Work
Never used
8% 20% 56% of users use
for work and study.
Wikipedia is trusted by
For Fun many users.
28%
For Study
36%
But really?
3
6. Are Wikipedia articles high quality?
7000.00
80% of
all artic
5250.00
les are
low qua
# of Articles
l i t y.
値タイトル
3500.00
1750.00
0
1
Quality degree
カテゴリタイトル
low high
4
(calculated using our proposed method)
7. Objectives
• Calculate quality values for articles automatically, accurately.
• For readers: Readers may believe which articles are high quality or not.
→ Readers can assume which articles are high quality.
• For editors: Editors can decide which articles need to be edited.
• For administrators: Administrators can decide which articles are not
appropriate for Wikipedia, for keeping the quality of articles.
5
8. Output of Our proposed system
Quality Value: 40%
High quality part
Low quality part
6
9. What is quality?
From Dictionary
【Quality】the degree of excellence of something
【Credibility】the quality of being treated and believed in
From Psychology (Fogg 2003)
Trustworthiness: How many users believe something
Expertise: expert’s opinion
We use “trustworthiness” as the definition of quality
Quality is not True or False but How many users
believe.
7
10. Related Work
Link analysisquality articles using[Bellomi 2005, Chin 2011]
Identify high based method HITS, PageRank.
This method can easily identify major articles, but cannot identify minor but high
quality articles.
Using editor reputation [Adler2007, Wiklinson 2007]
We use this method.
Identify which articles are high quality using reputation of editors by editors
themselves
Good Point: These methods can calculate accurate quality.
Because, editors or viewers do not directly decide text quality.
Bad Point: Vandals (bad editor) can easily change text quality.
8
11. Plan for Calculating Quality
Who evaluate?
・reader (voting)
Who evaluate?
・reader themselves (personalization)
・editor (reputation-based)
What quality we measure? How to evaluate?
・whole article
What quality we ・reader’s voting
・a part measure?
of article How to evaluate?
・article analysis
・editor ・article edit history
9
12. Plan for Calculating Quality
Who evaluate?
・reader (voting)
・reader themselves (personalization)
・editor (reputation-based)
What quality we measure? How to evaluate?
・whole article
What quality we ・reader’s voting
・a part measure?
of article How to evaluate?
・article analysis
・editor ・article edit history
9
13. Plan for Calculating Quality
Who evaluate?
・reader (voting)
・reader themselves (personalization)
・editor (reputation-based)
What quality we measure? How to evaluate?
・whole article ・reader’s voting
・a part of article How to evaluate?
・article analysis
・editor ・article edit history
9
14. Plan for Calculating Quality
Who evaluate?
・reader (voting)
・reader themselves (personalization)
・editor (reputation-based)
What quality we measure? How to evaluate?
・whole article ・reader’s voting
・a part of article ・article analysis
・editor ・article edit history
9
15. Plan to Measure Quality
• Why we use reputation-based approach?
• Users voting are not always true.
• In YouTube, almost all votes are 5 stars (highest scores).
• Why we calculate editor’s quality?
• We assume that same editor writes same quality of articles.
• Why we use edit history?
• Our proposed system should language independent.
10
16. Overview
Quality degree 55%
5. 1. Identify editors of articles.
2. Get edit history of each editor.
3. Calculate text’s Quality Value; QV.
4. Calculate editor’s QV.
5. Calculate article’s QV.
QV of = 70%
QV of = 40%
QV= 60%
Editor:A Edit history 4.
Editor:B 3.
1. 2.
11
17. Key Idea
High quality texts survive beyond
Editor A multiple edits
add
・if a text remain - QV of the text ↑
Editor B ・if a text is deleted - QV of the text ↓
delete
Editor C
12
18. Calculate Text’s quality values
•A writes 100 letters
write A
• Texts of A do not gain QV
100
100
80
deleted by B •A cannot evaluate A herself
75 20↓
•B deletes 20 letters
deleted by C
# of letters
•B remain A’s 80 letters
50
•B evaluate A’s 80 letters is good
25
•C deletes 60 letters
60↓
20 20
•C remain A’s 20 letters
0 •C evaluate A’s 20 letters is good
1 2 3 4
version number
• A’s text QV = log80 + log20
13
19. Problem
• Editor’s quality is not considered.
•C deletes A’s text.
A
• A’s QV decreases.
add
B • If C has low quality, C may delete high quality texts.
delete • A’s QV should NOT be decreased.
C • If C has high quality, C should delete low quality texts.
• A’s QV should be decreased.
14
20. Use editor’s QV for text’s QV
write A
without editor’s QV
100
100
with editor’s QV • If B’s QV is 100%
100
80
80
deleted by B •B should deletes low quality texts.
75 20↓
• A’s text is deleted 25 letters by B.
deleted by C
# of letters
• If C’s QV is 50%
50 50
50
•C may delete 50% of high quality
25
60↓ texts.
20 20
0 • A’stext is deleted 30 letters
1 2 3 4 (60 letters × 50%) by C.
version number
15
21. Chicken-or-the-egg problem
QV of = 70%
QV of = 40%
= 60%
Editor:A Edit history 4.
Editor:B 3. use ’s CV
1. 2.
• Text’s QV is calculated by both edit history and editor’s QV.
• Editor’s QV is calculated by text’s QV.
• Editor’s QV ⇆ text’s QV are a chicken-or-the-egg problem.
Mutually calculate editor’s and text’s QV until converge.
16
22. Our proposed method
1. Identify editors of articles.
2. Get edit history of each editor.
3. Calculate Text’s QV using editor’s QV.
• When first time, all editor’s QV is considered as 1 (highest value).
4. Calculate editor’s QV.
5. If text’s QVs and editor’s QVs are not converged, return 3.
6. Calculate article’s QV.
QV of = 70%
QV of = 40%
QV = 60%
Editor:A Edit history 4.
Editor:B 3.
1. 2. 5.
17
23. Experimental Setup
• Data set
• Japanese Wikipedia edit history data (at Nov. 2, 2010)
• 1,889,129 articles, 2,178,003 editors (w/ bots, anonymous IP user)
• High quality articles (Correct Dataset)
• “Featured articles” and “Good articles” selected by Wikipedians.
• Evaluation measure
• 11-pt interpolated Recall-Precision graph
18
24. Experimental result
0.10
with editor’s QV • Precision improves about 10%.
without editor’s QV
0.09
• At recall 0 to 0.5, precision improves about 20%, whereas
0.08
precision does not improve At recall 0.6 to 1.
0.07
• When an article is about current events and is high quality,
Precision
0.06 our system can decide as high quality, but not in featured
article.
0.04
0.03
• When one editor writes excellent texts, and the other editors
do not edit, the article is “featured” but do not decided as
0.02 high quality.
0.01 • Text’s and editor’s QV converges when we calculate QVs
20 times each.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall 19
25. Conclusion
• Calculate texts’ quality values using editor’s QV.
• Relation of texts’ quality values and editors’ quality values is chicken-or-the-egg.
• Mutually calculate text’s quality values and editor’s quality values until converge.
• Improved averaging precision ratio is about 10%.
• At low recall ratio, precision ratio improves about 20%.
• Future Work
• Confidence of quality values
• When A edits 100 articles many times, B edits only ONE article once, and A and B has same QV,
qualities of A and B are decided as the same by the system. But, this should be different because
confidence is different.
• Other effective assumption
• When high quality editor confirms a text, the text should be high quality even if the text is written by
low quality editor. 20
26. Open problem
• Using contents analysis
• Estimate terms which appear frequently in high quality articles, but do not
appear in low quality articles.
• Using multiple language articles
• If
an article in Japanese is similar to that in English, the article is high
quality?
• For Web documents, SNS, ...
• How to calculate quality degrees without edit history?
21
I am Yu Suzuki, in the information technology center at Nagoya University. Title of today’s presentation is quality assessment of wikipedia articles using edit history. The purpose of this presentation is how to calculate quality values to Wikipedia articles.\n
I show the data about age of users and percentage usage of services. This question is done by SPIRE Project by Oxford University. Red bar shows Wikipedia, and Blue bar shows blogs. From this graph, less than 18 years old and more than 65 years old users use Wikipedia frequently than the other Web services. These users may not have enough knowledge, then if there is a wrong story in Wikipedia, these users will believe. This is a problem.\n
I show another graph about the purpose of Wikipedia. From this graph, more than 56 percent of users use Wikipedia for work and study. This shows that Wikipedia is trusted by many users, at least 56 percent of users trust. However, do you think Wikipedia is reliable?\n
I show another graph about the purpose of Wikipedia. From this graph, more than 56 percent of users use Wikipedia for work and study. This shows that Wikipedia is trusted by many users, at least 56 percent of users trust. However, do you think Wikipedia is reliable?\n
This graph showing the relationship between quality degrees and number of articles. This quality is calculated by our proposed system which I will talk later. From this graph, if our system calculates quality values, about 80% of all articles are not credible. This means that almost all users trust Wikipedia, whereas almost all articles are not credible. So I think quality values is important for many users to prevent believing wrong articles.\n
The objectives of this study is to calculate quality degrees automatically, speedy, and accurately. This quality degree is useful for readers, editors, and administrators. Readers may believe which articles are credible or not. Editors can decide which articles need to be edited. And Administrators can decide which articles are not appropriate for Wikipedia for keeping the quality of articles. \n
This is the output of our proposed system. In our system, original Wikipedia article is overlaid with two kinds of color lines. Blue line shows credible part, and red line shows not credible part. Left-upper part shows overall quality degrees, and blue and red bar show ratio of credible, and not credible parts.\n
First, we should define what is quality. This is a very difficult question, but from dictionary, quality is defined as the degrees of excellence of something. Credibility is defined as the quality of being treated and believed in. But this definition is ambiguous, so I carried the definition from psychology. Fogg said that quality is defined as two meanings such as trustworthiness and expertise. Trustworthiness is how many users believe something, and expertise is expert’s opinion. In our study, we use trustworthiness as the definition of quality. Therefore, quality is not true or false but how many users believe.\n
Next, we introduce several related works. There are two approach, link analysis based method and editor reputation based method. Link analysis method is to identify high quality articles using link analysis method such as HITS or PageRank. This method can easily identify major articles, but cannot identify minor but high quality articles. Another method is using editor reputation. In this method, reputation of editors by the other editors. Our method is based on this method. Good point of these methods is these methods can calculate accurate quality, because editors or viewers of articles do not directly decide text quality, but using implicit decision. But bad point is that vandals, bad editor, can easily change text quality. \n
To calculate quality values, I should define quality measurement method. To define, I should consider three agendas, such as who evaluate articles, what quality we measure, and how to evaluate articles. Reader’s decision is used such as voting, personalization. In our system, I select editor’s reputation, because I think this method is fair. Next, I measure editor’s quality instead of measuring articles or parts of articles, because I think same users write same quality of articles. And I evaluate using edit history, because this method is simple and effective.\n
To calculate quality values, I should define quality measurement method. To define, I should consider three agendas, such as who evaluate articles, what quality we measure, and how to evaluate articles. Reader’s decision is used such as voting, personalization. In our system, I select editor’s reputation, because I think this method is fair. Next, I measure editor’s quality instead of measuring articles or parts of articles, because I think same users write same quality of articles. And I evaluate using edit history, because this method is simple and effective.\n
To calculate quality values, I should define quality measurement method. To define, I should consider three agendas, such as who evaluate articles, what quality we measure, and how to evaluate articles. Reader’s decision is used such as voting, personalization. In our system, I select editor’s reputation, because I think this method is fair. Next, I measure editor’s quality instead of measuring articles or parts of articles, because I think same users write same quality of articles. And I evaluate using edit history, because this method is simple and effective.\n
I talk again about a plan to measure quality. I used reputation-based approach because user’s voting is not always true. In You Tube, almost all votes are highest scores. I used editor’s quality because we assume that same editor writes same quality of articles. I used edit history because this method is simple, and our proposed system should be language independent. If I use linguistic analysis, the system should be language dependent.\n
This is a overview of our proposed system. First, when I analyze an article, and identify editors. In this example, I identified editor A and B from edit history. Next, I get edit history of the editors for the other articles. Then I analyze this edit history, and calculate text’s and editor’s quality values. I calculate quality value of A is 70% and B is 40%. Finally, for combining these two editor’s quality values, I calculate article quality values. In this case, this article’s quality degree is 55%.\n
The key idea is the remain ratio of texts. This means if a part of articles are high quality, the part is not deleted by the other editors. If a part of articles are low quality, the part is soon deleted or replaced. I give the situation that Editor A writes this part, and editor B adds this part, and editor C delete editor A’s part and replace this part. In this case, Editor B remain Editor A’s part, Editor B decide Editor A’s part is high quality. Editor C remain Editor B’s part, Editor C decide Editor B’s part is high quality. However, Editor C delete Editor A’s part, Editor C decide Editor A’s part is low quality.\n
I explain how to calculate quality value of texts. First, A writes 100 letters to an article, then B deletes 20 letters from A’s text, then editor C deletes 60 letters from A’s text. In this case, at version 1, A cannot gain quality value because A cannot evaluate A herself. Next, at version 2, B remain A’s 80 letters, then B evaluates that A’s 80 letters are good, then A gain 80 positive evaluation from A. Next, at version 3, C remain A’s 20 letters, then C evaluates that A’s 20 letters are good, then A gain 20 positive evaluation from A. As a result, from this edit history, editor A gains log 80 plus log 20 quality values from editor B and C.\n
However, the problem is that this system does not consider editor’s quality. In this case C deletes A’s text. Then, Our system decreases A’s quality value. However, If C has low quality, C may delete high quality texts. In this case, A’s quality value should not be decreased. But if C has high quality, C should delete only low quality texts. Then, A’s quality value should be decreased. Therefore, editor’s quality is important to calculate text quality values.\n
I explain how to calculate quality value of texts using editor quality value. If B’s quality value is 100%, this means that if B is a high quality editor, then B should deletes low quality texts. Therefore, when B deletes delete 25 letters, A’s text should be deleted 25 letters. However, if C deletes 60 letters and C’s quality value is 50%, C may delete 50% of high quality texts. Therefore, A’s text should be deleted 30 letters, a half of actual deleted letters, by C.\n
However, there is another problem. Text quality value is calculated by both edit history and editor’s quality values. Editor quality value is calculated by text quality value. Therefore, calculating editor’s quality values and text quality values are a kind of chicken-or-the-egg problem. Therefore, to solve this problem, we mutually calculate editor’s and text’s quality values until converge these values.\n
Using these discussions, we improve our proposed method. First, we identify editors of articles. Then, we get edit history of each editor. Then, we calculate text’s quality values using editor’s quality value. When it is a first time to calculate text quality values, we assume that all editor’s quality value is considered as 1, the highest value. Then we calculate editor’s quality value. Next, if text quality values and editor’s quality values are not converged, return to step 3. Finally, we calculate article quality values.\n
I used Japanese Wikipedia edit history data from Wikipedia site. I used of 85 thousands and 28 articles, about 13.6% of all all articles. These articles are written by 705 thousands and 713 editors except bot. I used credible articles as featured articles and good articles selected by Wikipedians. In this experiment, I used Japanese Wikipedia, but I can use any language of Wikipedia. However, English version of Wikipedia edit history is not available now. So I cannot use English version of Wikipedia.\n
This is an experimental result. From this recall precision graph, we can confirm that precision ratio improves about 10%. From recall 0 to 0.5, precision ratio improves about 20%, whereas precision ratio does not improve at recall 0.6 to 1. When an article is about current events and is high quality, our system can decide as high quality. But these articles are not in featured articles. When one editor writes excellent texts, and the other editors do not edit, the articles is featured articles, but do not decided as high quality in our method. Moreover, the quality value of texts and editors converges when we calculate quality values 20 times each.\n
Finally, I conclude our study. In this study, we calculate history’s quality values using editor’s quality values. Relation of text and editor quality values is a kind of chicken-or-the-egg problem. To solve this problem, we mutually calculate text and editor quality values until converge. As a result, we improve averaging precision about 10%. At low recall ratio, precision ratio improves about 20%. Next, I introudce our future work. First topic is about confidence of quality values. When A edits 100 articles many times, B edits only one article once, and A and B has same quality values, the qualities of A and B are decided as the same by the system. But, this should be different because confidence of A and B is different. Another topic is about the other effective assumption. When high quality editor confirms a text, the text should be high quality even if the text is written by low quality editor.\n
I consider several problems, such as content analysis techniques. In this method, I estimate terms which appear frequently in credible articles, but do not appear in not credible articles. Next I use multiple language articles. I think english Wikipedia is the richest, therefore if an article in japanese is similar to that in English, the article is credible or rich. I want to adopt my system to Web documents and SNS, but there is no edit history for Web documents. So I should discover how to calculate quality without edit history.\n