1. What’s on Wikipedia, and What’s Not…?
Completeness of Information on the Online Collaborative Encyclopedia
Cindy Royal, Ph.D.
Assistant Professor
Texas State University
School of Journalism and Mass Communication
Deepina Kapila
Graduate Student
Texas State University
School of Journalism and Mass Communication
2. Introduction - Wikipedia
• Wikipedia (www.wikipedia.com), deemed “the free
encyclopedia,” was launched on the web in 2001.
• Since then, it has become the Web’s 3rd most
popular news and information source
• It uses the Wiki software format, which allows a
community of users to develop and monitor content
• Wikipedia operates under the assumption that the
public will act as a policing force, keeping content
reliable and up to date.
3. Introduction - Research
• Denning et al. (2005) listed the risks inherent in
Wikipedia’s model: accuracy, motives, uncertain
expertise, volatility, coverage, sources.
• Bopp and Smith (2001) state that coverage in an
encyclopedia should be “Even across all subjects”
• Shoemaker and Reese (1995) identified the
individual as a news influencer. Web users and
content creators tend to be young.
• Tankard/Royal (2005) – inherent biases in Web
content, based on systematic searches.
4. Research Questions
This project measures the content of Wikipedia against
various indexes or standards of completeness to identify
and uncover potential inherent biases.
We are asking:
1. Are there some systematic gaps or biases in the overall presentation of
information made available on Wikipedia?
2. Is recency (or currency) a predictor of amount of information on Wikipedia?
3. Is importance of information a predictor of amount of information on
Wikipedia?
4. Is population a predictor of amount of information about particular countries
on Wikipedia?
5. Is economic power a predictor of amount of information about individual
corporations on Wikipedia?
5. Method
• Using predictors of recency, importance, country
population, and economic power, several systematic
searches on Wikipedia were conducted
• Each article for each topic was visited, the relevant
content highlighted, and the selection’s words were
counted
• Word counts were captured in a spreadsheet, and
items were plotted on charts
• Ascending order
• Predictor variable
6. Topics Covered
• Years (1900-2010)
• Academy Award Winning Films
• Time Magazine’s Person of the Year
• #1 Song on Billboard Top 100 (1940-2006)
• Encyclopedia Terms
• Countries in the United Nations
• Fortune 1000 companies
7. Results - Years
0
2,000
4,000
6,000
8,000
10,000
12,000
1 9 17 25 33 41 49 57 65 73 81 89 97 105
0
2,000
4,000
6,000
8,000
10,000
12,000
1900
1906
1912
1918
1924
1930
1936
1942
1948
1954
1960
1966
1972
1978
1984
1990
1996
2002
2008
Ascending Order Chronological Order
-Backward L-shaped curve
-Clear progression of length of article with year; dramatic increase in
years after 2001
-Years in the future displayed understandably shorter word counts
-Spearman Correlation between variables: .79
8. Results - Films
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
1928
1932
1936
1940
1944
1948
1952
1956
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
2000
2004
Ascending Order Chronological Order
-Backward L-shaped curve is apparent.
-With few exceptions (ie. Gone with the Wind, 1939 and Casablanca, 1943) the results
show progression favoring more current films. Recency is important, but certain films
transcend time and are deemed important for other reasons.
-Average word count for films since 2001 was 80% higher than word count before
2001.
-Spearman correlation between variables: .49; increased to .62 simply by removing 2
9. Results - Person of the Year
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
1927
1931
1935
1939
1943
1947
1952
1957
1962
1967
1974
1979
1985
1991
1996
2001
Ascending Order Chronological Order
-Softer backward-shaped L curve
-Even distribution shows bias is unrelated to recency, measured by another variable
of importance
-Spearman Correlation between variables: O-there was no relationship with time.
10. Results - Billboard Top 100
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1940
1943
1946
1949
1952
1955
1958
1961
1964
1967
1970
1973
1976
1979
1982
1985
1988
1991
1994
1997
2000
2003
2006
Ascending Order Chronological Order
-Backward L-shaped curve
-Although Average word count was 32% higher for artists since 1990, distribution
shows trend similar to movies in that some artists transcend time.
-Spearman correlation between variables: .40 (by eliminating 2 outliers)
11. Encyclopedia Terms
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
Ascending Order
-Comparison between Encyclopedia Britannica and Wikipedia articles
-Backward L-shaped distribution apparent
-Spearman correlation used to compare inches of content in Encyclopedia Britannica
with word count in Wikipedia: .26
-Of 100 terms, 14 were not represented in Wikipedia
12. Results - UN Countries
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 13 25 37 49 61 73 85 97 109 121 133 145 157 169
Ordered by populationAscending Order
-Backward L-shaped curve - although fairly evenly distributed, a SHARP increase appears
for the top 22 countries.
-Gradual upward curve in 2nd
chart shows that as population increases, so does word count
-Average word count for top 10% of countries was 63% higher than the rest on the list
-Spearman correlation between variables: .55
13. Results - Fortune 1000
0
1,000
2,000
3,000
4,000
5,000
6,000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85
0
1,000
2,000
3,000
4,000
5,000
6,000
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
Ascending Order Ordered by Revenue
-Backward L-shaped curve
-SHARP increase for top 10% of companies by revenue
-Top 10% of companies by revenue counted for 30% of total word count on companies
-Spearman correlation between variables: .49
14. Conclusion
-Information on Wikipedia is volatile, dynamic and constantly changing over time
-Wikipedia’s purpose is to serve as a general reference source, but the content is
weighted due to its contributors’ demographics
-In each search performed for the dimensions, strong biases were evident and strong
correlations experienced:
-Currency/Recency: the more current topics were covered the most
-Random Selection: Encyclopedia terms showed clear bias towards more
common or popular terms
-Relevancy: Wikipedia’s word count correlates to inches in a traditional
encyclopedia, showing a strong agenda by each publication
-Population: the larger the country and the larger its population, the higher the
word count
-Revenue: The larger the revenue, the higher the word count