On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Webometrics Revisited in Big Data Age_DISC2013Presentation Transcript
Virtual Knowledge Studio (VKS)
“Webometrics Studies” Revisited
in the Age of ―Big Data‖
Asso. Prof. Dr. Han Woo PARK
CyberEmotions Research Institute
Dept. of Media & Communication
214-1 Dae-dong, Gyeongsan-si,
Republic of Korea
The term ―big data‖ refers to ―analytical technologies that
have existed for years but can now be applied faster, on a
greater scale and are accessible to more users. (Miller,
Big data sizes may vary per discipline.
Characteristics: Garner‘s 3Vs plus SAS‘s VC and IBM‘s
- Volume (amount of data), Velocity (speed of data in and
out), Variety (range of data types and sources)
- Variability: Data flows can be highly inconsistent with daily,
seasonal, and event-triggered peak data loads
- Complexity: Multiple data sources requiring cleaning,
linking, and matching the data across system
- Veracity: 1 in 3 business leaders don‘t trust the information
they use to make decisions.
Data-driven Research that focuses on
extracting meaningful data from technosocio-economic systems to discover
some hidden patterns.
Today‘s ―big‖ is probably tomorrow‘s ―medium‖ and
next week‘s ―small‖ and thus the most effective def
inition of ―big data‖ may be derived when the size of
itself becomes part of the research problem.
Webometrics is broadly defined as the study of
web-based content (e.g., text, images, audio-visual
objects, and hyperlinks) with primarily quantitative
indicators for social science research goals and
visualization techniques derived from information
science and social network analysis.
• Han Woo Park
- “hidden” and “relational” data about
lots of people as well as the few
individuals, or small groups
• Lev Manovich
- ―surface‖ data about lots of people (i.e.,
statistical, mathematical or computational
techniques for analyzing data)
- ―deep‖ data about the few individuals or small
groups (i.e., hermeneutics, participant
observation, thick description, semiotics, and
First type of Webometrics
• Hyperlink Network Analysis
Inter-linkage: who linked to whom matrix
Co-inlink: a link to two different nodes from a third node
Co-outlink: A link from two different nodes to a third node
Inter-link network analysis diagram among Korean escience sites within public domain
Mapping the e-science landscape
In South Korea using the Webometrics method
Co-inlink network analysis
Mapping the e-science landscape
In South Korea using the Webometrics method
As seen in Figure 4, the network structure shows a clear butterfly pattern. There is one hub (ghism)
that belongs to Park Gyun-Hye (Park GH, www.cyworld.com/ghism), the daughter of ex-president
Park Jeong-Hee and one of two major GNP candidates (along with president-elect Lee MB) in the
2007 presidential race.
Figure 4: Cyworld Mini-hompies of Korean legislators
How do social scientists use link data
from search engines to understand
Internet-based political and electoral
INVESTIGATING INTERNET-BASED POLITICS WITH E-RESEARCH TOOLS
Case 2. Cyworld Mini-hompies of Korean Legislators
Sociology of Hyperlink Networks of Web 1.0,
Web 2.0, and Twitter
A Case Study of South Korea
‣ Online & offline lives ➭ co-constructing (e.g. Beer & Burrows, 2007)
‣ Politicians communicate with their constituencies using different platforms
- What are the structural similarities and/or differences in South Korean
politicians‘ networks from Web 1.0 to Web 2.0 (and Twitter)?
- Are online structures similar to structures in the physical world?
- Are online patterns affected by offline relationships?
‣ Related studies conducted:
- online social network analysis
- online networks in Web 2.0
- role of Twitter on online politics
‣ 59 isolated in 2000
‣ more centralised in 2001
‣ network of 2001 ➭ a ‗star‘ network
- might affected by political events
➭ presidential election in 2001
‣easy use of blogs
‣Clear boundaries between different parties
‣strong presence of GNP Assembly members
➭ party policy on using blogs
Politician Twitter Network (Following and Mention Network)
Bi-linked network of politically active
A-list Korean citizen blogs (July 2005)
Just A-list blogs exchanging links with politicians
Affiliation network diagram using pages
linked to Lee’s and Park’s sites
N = 901 (Lee: 215, Park: 692, Shared: 6)
Tweets on the name of S. Korea president
A Study of the 2012 South Korean Presidential Debate
Reply-To Networks of Park‘s & Moon‘s
Facebook page visitors during TV debates
―Those studies perpetuate the idea that linking behaviour
is not random, and that links are ‗socially significant in
some way‘. In this perspective, links have an
‗information side-effect‘, they can be used to
understand other facts even though they were not
individually designed to do so: ‗information side-effects
are by-products of data intended for one use which
can be mined in order to understand some tangential,
and possibly larger scale, phenomena‘
Park and his colleagues were
extensively cited: 9 times!
Barnett GA, Chung CJ and Park HW (2011) Uncovering transnational hyperlink patterns
and web mediated contents: a new approach based on cracking.com domain. Social
Science Computer Review 29(3): 369–384.
Hsu C and Park HW (2011) Sociology of hyperlink networks of Web 1.0, Web 2.0, and
Twitter: a case study of South Korea. Social Science Computer Review 29(3): 354–368.
Park HW (2003) Hyperlink network analysis: a new method for the study of social structure
on the web. Connections 25(1): 49–61.
Park HW (2010) Mapping the e-science landscape in South Korea using the webometrics
method. Journal of Computer-Mediated Communication 15(2): 211–229.
Park HW and Jankowski NW (2008) A hyperlink network analysis of citizen blogs in South
Korean politics. Javnost: The Public 15(2): 5–16.
Park HW and Thelwall M (2003) Hyperlink analyses of the World Wide Web: a review.
Journal of Computer-Mediated Communication 8(4).
Park HW and Thelwall M (2008) Developing network indicators for ideological landscapes
from the political blogosphere in South Korea. Journal of Computer-Mediated
Communication 13(4): 856–879.
Park HW, Kim C and Barnett GA (2004) Socio-communicational structure among political
actors on the web in South Korea. New Media & Society 6(3): 403–423.
Park HW, Thelwall M and Kluver R (2005) Political hyperlinking in South Korea: technical
indicators of ideology and content. Sociological Research Online 12(3).
A comment from those who are
NOT doing a hyperlink analysis
• In a chapter of The Sage Handbook of
Online Research Methods edited by
Fielding et al. (2008), Horgan emphasizes
that ‗link analysis‘ has become an active
research domain in examining social
A threat to Webometrics
• The key application in this area is to collect
some incoming, outgoing, inter-linking, and
co-linking data from search engines
- AltaVista in early 2000
- Yahoo renewed the AltaVista‘s hyperlink
commands via ―Site Explorer‖ and its API
- Yahoo discontinued its API option for
interlinkage data in April 2011, and finally
stopped its popular Site Explore service in
A new proposal
• Mike Thelwall
- URL citation searches with the Bing search
• Liwen Vaughan
- Incoming hyperlinks from Alexa.com
Can these "alternative" techniques be
acceptable for scientific publishing?
A new proposal : SEO Tools
Search Engine Optimization Tools
Enrique Orduña-Malea & John J.
Regazzi (2013). Influence of the academic
Library on U.S. university reputation:
a webometric approach. Technologies. 1, 2643, http://www.mdpi.com/2227-7080/1/2/26
Webometrics Ranking of
The link visibility data is collected from the two most
important providers of this information: Majestic
SEO and ahrefs.
Both use their own crawlers, generating different
databases that should be used jointly for filling
gaps or correcting mistakes.
The indicator is the product of square root of the
number of backlinks and the number of
domains originating those backlinks, so it is not
only important the link popularity but even more
the link diversity.
The maximum of the normalized results is the impact
Interlinkage among world universities
• Barnett, G.A., Park, H. W., Jiang, K., Tang, C.,
& Aguillo, I. F. (2013 forthcoming). A MultiLevel Network Analysis of Web-Citations
Among The World‘s Universities.
Isidro F. Aguillo
―Large interlinking matrix (1000*1000) are no
longer possible to obtain. Perhaps national
academic systems (200 or 300 institutions)‖
among Information Scientists?
• Robert Ackland (2013). Web Social Science.
• Richard Rogers (2013). Digital Methods.
Let us move to Web Visibility Analysis
Frequently occurring key words in e-science webpages in Korea
Created on Many Eyes(http://many-eyes.com)
Words are larger according to the frequency of their occurrence but their
positions are randomly-chosen for the best visualization
Websites retrieved more than two times
Note: Websites are larger according to their frequency of retrieval; however, heir
colors and locations are randomly-chosen for the best visualization
2nd type of Webometrics: Web Visibility
Web visibility as an indicator of online political power
Presence or appearance of actors or issues being
discussed by the public (Internet users) on the web.
Tracking web visibility is powerful way to get an insight
into public reactions to actors or issues.
Recent studies indicates the positive relationships
between politicians‘ web visibility level and election.
Also, the co-occurrence web visibility between two
politicians represents their hidden online political
relationships based on the public perception.
e-리서치 도구의 활용: 웹가시성 분석
블로그 공간에서 후보자들의 웹가시성 수준과 득표 수간
에 밀접한 상관성을 나타냄. (임연수, 박한우, 2010,
평균 블로그 수
경대수 정범구 정원헌 박기수 이태희 김경회
2009년 10월 28일 재보선 결과
- 당선자 모두 블로그 가시성 높음
소셜 미디어의 특징 및 영향력
10.26 재보궐 선거 사례
페이스북에서 이름이 동시에 언급되는 이름 연결망을 구
초반에는 두 후보자가 비슷하게 언급되다가,
중반에 접어들자 박원순 지지자들과 박원순이 언급되면서
나경원 후보자 지지자가 안보이게 되고,
종반에는 박원순 중심으로 네트워크가 재편되며 종결됨
Semantic network에서 중심성 비교
10.26 재보궐 선거 사례
서울시장 선거 관련 메세지들의 내
용을 분석하여 나오는 단어들의 빈
초반부터 나경원 후보는 빈도가 떨
어지다가, 후반에 박원순 후보와 경
쟁 및 선거 결과를 이야기하면서 나
타나는 경우를 제외하고는 줄곳 담
안철수 효과는 초반에 크고, 중반이
후 떨이지는 효과가 나타났으나, 한
나라당이라는 언급이 높게 나오면
서 집권여당에 반하는 정서가 나타
나, 선거의 성격을 말해줌
As Lim & Park (2011, 2013)
claim, the use of web
mentions of politicians‘
names is particularly useful
for hierarchically ranking
However, it may not
sufficiently capture the
entropy probability of an
event (hidden in changing
resulting from the amount of
information conveyed by the
occurrence of that event
Taleb (2012) argues that society
can be conceived as a complex
fabric consisting of the extended
disorder family including
uncertainty, chance, entropy, etc.
Therefore, such disorder system
can be better derived from
empirical data mining, not
obtained by a priori theorem.
Uncertainty exists when three or
more events take place
simultaneously and is
increasingly beyond the control of
In social and communication
indicators have been widely
used for exploring entropy
values generated from
This ―Triple Helix Model‖
(THM) can be applied to
the concurrence of a pair
of two or three terms in
the public search engine
Mapping Election Campaigns Through Negative Entropy:
Triple and Quadruple Helix Approach
to Korea’s 2012 Presidential Election
Social media platforms have become a notable venue for Korean
voters wishing to share their opinions and predictions with others
(Park et al., 2011; Sams & Park, 2013).
Politicians have made increasingly use of SNSs to provide updates
and communicate with citizens (Hsu & Park, 2012).
With the increasing proliferation of smartphones and portable
computers in Korea, SNSs have been widely used for facilitating
Prior studies have found that Web 1.0 contents tended to contain the
more enduring political and electoral statements of the public in
To better understand the dynamics of the 2012 presidential election
in Korea, this study estimates the web visibility of the three major
candidates— Geun-Hye Park (PARK), Cheol-Soo Ahn (AHN), and
Jae-In Moon (MOON)—in the entire digital sphere.
The total probabilistic entropy (uncertainty) produced by changes in one or
two dimensions is always positive, which is in accordance with the second
law of thermodynamics (Theil, 1972, p. 59).
On the other hand, the relative contribution of each event to the
summation in three or four dimensions can be positive, zero, or negative
This configurational information provides a measure of synergy within a
complex communication system. Network effects occur in a systemic and
nonlinear manner when loops in the configuration generate redundancies
in relationships between three or four events (Leydesdorff, 2008).
Method: Data collection
The number of hits for each search query per media
channel (Facebook, Twitter, and Google) was harvested.
The hit counts obtained from Google.com were
employed to look primarily at entropies represented on a
set of digitally accessible documents (e.g., online
versions of newspapers, online word-of-mouth, Web 1.0
We measured the occurrence and co-occurrence of the
politicians‘ names based on their bilateral, trilateral, and
quadruple relationships by using Boolean operators.
For example, we measured the number of web and
social media mentions referring only to PARK (this is, no
mention of AHN, MOON, or the term ―president‖).
SNS 미디어에 따른 중심성에 따른 시각화
Twitter can be very effective to amplify messages particularly in terms of their
one-to-many mode of communication (Barash & Golder, 2010).
Twitter is viable both as a political news and communication channel
(González-Bailón, Borge-Holthoefer, Rivero & Moreno, 2011; Hsu &
Park, 2011, 2012; Otterbacher, Shapiro, & Hemphill, 2013)
and to citizens who look for platforms for political participation and engagement
(Hsu, Park, & Park, 2013; Kim & Park, 2011; Tufekci& Wilson, 2012).
The mode of information sharing on Facebook differs from that on Twitter.
Facebook functions as a living room where friends talk to one another.
Facebook can be a mixture of interpersonal and mass channels for the sharing of
informational as well as social messages in a context of political campaign (Bond
et al., 2012; Effing, van Hillegersberg, & Huibers, 2011; Robertson, Vatrapu, &
Medina, 2010; Vitak et al., 2011).
Both Twitter and Facebook communications seem to be biased because two
platforms have been particularly dominated by the ―2040 Generation‖, who are
generally categorized as political liberals in Korea (Kwak et al., 2011).
Therefore, it is important to examine what (social) media
conversations are more likely to generate more entropies that
others and which politician:
RQ 1) What (social) media generate (negative) entropy more than
others across different periods?
RQ 2) Which politician (or which pair of politicians) generates
entropy more than others for bilateral, trilateral, or quadruple
relationships across various media and periods?
Entropy values (expressed as T for transmission)
for bilateral relationships are, by
definition, positive. Here T is defined as the
difference in uncertainty when the probability
distributions of two incidents (e.g., i and j) are
combined. The mutual information transmission
capacity, expressed in T values, is measured by
―bits‖ of information (for a more detailed
mathematical definition, see Leydesdorff, 2003):
Hi = – Σi pi log2 (pi); Hij = – Σi Σj pij log2 (pij),
Hij = Hi + Hj – Tij ,
Tij = Hi + Hj – Hij
Here Tij is zero if the two distributions are mutually
independent and positive otherwise (Theil, 1972).
On the other hand, T values for trilateral and quadruple
relationships can be negative, positive, or zero depending on the
size of contributing terms. Therefore, it is necessary to compare
the absolute value of each (negative) entropy value when entropy
values are calculated for trilateral and quadruple relationships. In
the case of entropy values for trilateral and quadruple
relationships, the higher the absolute entropy value, the more
balanced the communication system is. Let p denote PARK;
a, AHN; and m, MOON and formulate mutual information in these
three dimensions as follows (Abramson. 1963, p. 129):
Tpam = Hp + Ha + Hm – Hpa – Hpm – Ham + Hpam
Here we are interested not only in information on mutual
relationships between these three candidates but also in semantic
relationships with respect to the term ―president.‖ Accordingly, we
measure the entropy value by using mutual information in these
four dimensions (here ―r‖ denotes ―president‖):
Tpamr = Hp + Ha + Hm + Hr – Hpa – Hpm – Hpr – Ham – Har – Hmr +
Hpam + Hpar + Hpmr + Hamr –Hpamr
Figure 2. Entropy Values Across Media Channels and Time Periods
Figure 3. T Values for Bilateral and Trilateral Relationships on November 3.
Figure 4. T Values for Bilateral Relationships between Park and Moon
Discussion and conclusions
Twitter has scored the most negative entropy
values and Facebook followed. Google came last.
This indicates that Twitter is the most open
The entropy values for liberal candidates (AHN and
MOON) have been higher than their conservative
opponent PARK on social media than Google
This may not be surprising because both Twitter
and Facebook have particularly appeared to the
Korean citizens in the age of late teenagers to
Discussion and conclusions
PARK‘s entropy has been slightly higher on
Google than her liberal challenger MOON.
Park was successful in garnering a strong support
from senior voters in their 50s and 60s accounted
for 39% of the population, up from 29% a decade
ago (Wall Street Journal, 2012).
Exit poll also revealed that PARK gained a support
from 62% of voters in their 50s and 72% of voters
in their 60s. Indeed, the most significant statistic on
the election was that South Koreans in their
20s, 30s, and 40s actually voted
65.2%, 72.5%, and 78.7% respectively but 89.9%
in 50s and 78.8% over 60s went to the polling