With 4.7 million articles in the English version of Wikipedia, this crowd-sourced online encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to source for a first read on a topic. The open-source and free Network Overview, Discovery and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the capture of “article networks” from Wikipedia. Such content network analysis-based data visualizations enable the development of research leads; some understandings of public conceptualizations of related concepts, peoples, events, and phenomena; the profiling of Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will showcase this affordance of NodeXL and provide some ideas for practical applications of this channel of research and knowing.
2. PRESENTATION DESCRIPTION
• With 4.8 million articles in the English version of Wikipedia, this crowd-sourced online
encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to
source for a first read on a topic. The open-source and free Network Overview, Discovery
and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the
capture of “article networks” from Wikipedia. Such content network analysis-based data
visualizations enable the development of research leads; some understandings of public
conceptualizations of related concepts, peoples, events, and phenomena; the profiling of
Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will
showcase this affordance of NodeXL and provide some ideas for practical applications of this
channel of research and knowing.
2
3. OVERVIEW
• Wikipedia ethos and practices
• Wikipedia
• The many Wikipedias; the English Wikipedia
• The Wikimedia Foundation
• MediaWiki and basic functionalities
• Basic article network analysis
• NodeXL and basic functionalities; automation
3
4. OVERVIEW (CONT.)
• http page networks on Wikipedia:
• article networks
• human author / editor networks
• robot networks
• Live demos
• Other (future) networks from Wikipedia
4
5. WIKIPEDIA ETHOS AND PRACTICES
• Objective, fact-based, and
research-focused
• Full research citations
• Isolating of opinions into Talk pages
• Open
• Open-access
• Open-source, public domain-released
• Crowd-sourced knowledge co-
creation; curated public data
• Crowd-funded 501(C)3; transparent
finances ($58.5 million goal for FY
2015)
• Editing via email-verified accounts
or Internet Protocol (IP) capture
5
6. WIKIPEDIA
THE MANY WIKIPEDIAS
• 288 Wikipedias (with 277 active)
• In order of articles: English (13.9%),
Swedish (5.6%), Dutch (5.2%), German
(5.25%), French (4.6%), Waray-Waray
(3.6%), Russian (3.5%), Cebuano
(3.4%), Italian (3.4%), Spanish (3.4%),
and Other (48.2%)
• (“List of Wikipedias” on Wikipedia)
THE ENGLISH WIKIPEDIA
• Founded in Jan. 15, 2001
• 4.8 million articles
• 25 million user accounts
• 1.347 administrators (“English
Wikipedia” on Wikipedia)
6
7. THE WIKIMEDIA FOUNDATION
• Objective: to encourage “the growth, development and distribution of free,
multilingual, educational content,” and to provide “the full content of these
wiki-based projects to the public free of charge”
• A range of projects: Wikipedia, Wikibooks, Wikiversity, Wikimedia
Commons, Wiktionary, Wikiquote, Wikivoyage, Wikidata, Wikinews,
Wikisource, Wikispecies, and MediaWiki (Wikimedia Foundation)
7
8. MEDIAWIKI AND BASIC FUNCTIONALITIES
• “wiki wiki”: “quick” or “fast” in Hawaiian
• Ward Cunningham as the developer of the first wiki software (WikiWikiWeb) in 1994 to
enable online collaborations with history versioning and rollback capabilities
• MediaWiki first created by the Wikimedia Foundation in 2002
• Magnus Manske and Lee Daniel Crocker were the initial developers of this tool using PHP
(MediaWiki)
8
11. BASIC ARTICLE NETWORK ANALYSIS
• Basics of network graphs: nodes-links, entities-relationships, vertices-edges;
undirected or directed (digraphs) graphs; networks and meta-networks;
subgraphs and clusters, motifs; network centrality
• Direct ties represented in ego neighborhoods (with a maximum geodesic
distance or graph diameter of 2); also 1.5 degree ties for transitivity (with a
maximum geodesic distance or graph diameter of 3) and 2 degree ties to
include networks of the respective “alters” (with much larger maximum
geodesic distances possible)
11
12. BASIC ARTICLE NETWORK ANALYSIS (CONT.)
• Entities may be individuals or groups, contents, and other elements
• Relatedness: Article networks created based on in-links and outlinks; node
“degree”
• Other types of relatedness are possible such as based on word co-occurrences, title
relatedness (same synset or “synonym set”), shared categories, and others
• Relations are conceptualized as enabling paths
12
13. NODEXL AND BASIC FUNCTIONALITIES;
AUTOMATION
• A free and open-source add-on to Microsoft Excel available on the Microsoft
CodePlex platform
• Enables…
• Graph visualization (with datasets from UCINET, GraphML, and other types)
• Data extraction from a number of social media platform APIs; refreshed runs based on
the same parameters (macros)
• Large number of tools of graph analysis
• A number of layout algorithms and selections to represent the data visually
13
14. HTTP PAGE NETWORKS ON WIKIPEDIA
(IN THIS CASE)
• http page links within Wikipedia, not connecting out to the Surface Web
• One-directional (outlink) directional graph of the target Wikipedia page
• May include article page networks, human page networks, robot page networks, and
others
• Networks seeded by one target title or name (as long as the string appears as a
page in Wikipedia)
• No need for an application programming interface (API) on the MediaWiki platform
14
17. MEDIAWIKI
ARTICLE
NETWORK ON
WIKIPEDIA
(2 DEG., 923,006 VERTICES)
17
In the first run, the software
kicked up an “out of memory”
exception error and crashed.
Another run was conducted on a
different machine with more
processing capability. The
screenshots are from that data
extraction. The data itself
involved some edge pairs (over
half a dozen) in which one of the
vertices was missing.
18. EXAMPLE: ARTICLE NETWORK
• Who are individuals related to a topic? Events? Years? Topics? Which of
these may be useful leads to learn more about the basic seed topic?
• Based on a real-world individual, what is he or she known for? Who are
people that this person is connected with?
• Based on a technology, when was it originated? Who originated it? What
were precursor inventions? What inventions were linked to the particular
technology?
18
19. EXAMPLE: ARTICLE NETWORK (CONT.)
• Based on collected lists, who is on a target list, and for what?
• Based on a particular topic, are there gaps in the information based on
“missing” article links?
• Based on a particular phenomena, event, phrase, or individual, in a foreign
context and foreign language, what may be learned?
19
22. EXAMPLE: HUMAN (AUTHOR / EDITOR) USER
NETWORK
• Based on the human user’s network on Wikipedia, what articles does he or she
tend to edit? In total, what does this network suggest about the person behind
the edits?
• (This requires the existence of a user page though.)
22
25. EXAMPLE: ROBOT NETWORK
• Based on the approved robot user’s network, what are the interests of the
maker of the robot? What other accounts is the robot connected to?
25
28. ADDITIONAL APPROACHES
• Chaining from one target account to related others
• Cross-comparing information on the Wikipedia site with the extracted
networks
• Connecting the Wikipedia information with related sites on the Surface Web /
World Wide Web (WWW) and Internet
28
29. OTHER (FUTURE) NETWORKS FROM WIKIPEDIA
• The third-party tool to NodeXL has spaces to enable user-content (two-mode)
network extractions and the mapping of co-editing networks…but those
functions are not currently enabled (apparently)
29