Webometrics Revisited in Big Data Age_DISC2013


Published on

Webometrics Revisited in Big Data Age_DISC2013

Daegu Gyeongbuk International Social Network Conference

Published in: Technology, Travel
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • It could be more intuitive to see through graphics.We depicted Politicians’ Twitter network. We have drawn the mention network over the following-follower network(explanation, if necessary)
  • (conclusion)
  • As shown in Figure 4, Park’s network suggests that users constructed an organized and hierarchical issue network. In addition, productive users who continuously participated in the issue network played a role as hubs in terms of user interactions. Red nodes indicate those users who were consistently engaged in Park’s network. These users had more communication power and were more centrally positioned in the network than other users who temporarily participated in the issue network. In other words, their Tweets were more likely to be retweeted and induce responses by others. In Park’s issue network, 7,103 users generated 8,018 retweets, 122 replies, and 22 mentions. Noteworthy is that,as shown in Figure 5,Lee’s issue network was unique in terms of its topology. As shown in Table 2, the high clusterability of the network indicate that users who tweeted about Lee formed an extremely cohesive network and were more connected to one another than those in Park’s and Moon’s networks. In Lee’s network, a total of 6,292 users produced 7,561 retweets, 208 replies, and 48 mentions. As shown in Figure 6, Moon’s issue network indicates that those users with more communication power were not necessarily productive, which differs from the case of Park’s network. In Moon’s network, a total of 5,328 users generated 5,707 retweets, 78 replies, and 24 mentions. Given that he was the major opposition candidate against Park and that he had strong public support comparable to that for Park, there were substantial differences in the number of users and user interactions between Moon and Park (The Press, 2012).
  • http://www.accountancyage.com/IMG/750/145750/regulatory-uncertainty.JPG
  • http://lerablog.org/wp-content/uploads/2013/05/facebook-marketing.jpg
  • http://lerablog.org/wp-content/uploads/2013/05/facebook-marketing.jpg
  • http://lerablog.org/wp-content/uploads/2013/05/facebook-marketing.jpg
  • Source: http://en.wikipedia.org/wiki/Binary_entropy_function.
  • Webometrics Revisited in Big Data Age_DISC2013

    1. 1. Virtual Knowledge Studio (VKS) “Webometrics Studies” Revisited in the Age of ―Big Data‖ Asso. Prof. Dr. Han Woo PARK CyberEmotions Research Institute Dept. of Media & Communication YeungNam University 214-1 Dae-dong, Gyeongsan-si, Gyeongsangbuk-do 712-749 Republic of Korea www.hanpark.net cerc.yu.ac.kr eastasia.yu.ac.kr asia-triplehelix.org
    2. 2. Big data  The term ―big data‖ refers to ―analytical technologies that have existed for years but can now be applied faster, on a greater scale and are accessible to more users. (Miller, 2013).  Big data sizes may vary per discipline.  Characteristics: Garner‘s 3Vs plus SAS‘s VC and IBM‘s Veracity - Volume (amount of data), Velocity (speed of data in and out), Variety (range of data types and sources) - Variability: Data flows can be highly inconsistent with daily, seasonal, and event-triggered peak data loads - Complexity: Multiple data sources requiring cleaning, linking, and matching the data across system - Veracity: 1 in 3 business leaders don‘t trust the information they use to make decisions. http://en.wikipedia.org/wiki/Big_data http://www-01.ibm.com/software/data/bigdata/
    3. 3. http://www.emc.com/leadership/digitaluniverse/iview/executive-summary-a-universe-of.htm
    4. 4. http://www.emc.com/leadership/digitaluniverse/iview/images/impact-ofconsumers-lg.jpg
    5. 5. Data-driven Research that focuses on extracting meaningful data from technosocio-economic systems to discover some hidden patterns. Today‘s ―big‖ is probably tomorrow‘s ―medium‖ and next week‘s ―small‖ and thus the most effective def inition of ―big data‖ may be derived when the size of data itself becomes part of the research problem.
    6. 6. Introduction  Webometrics is broadly defined as the study of web-based content (e.g., text, images, audio-visual objects, and hyperlinks) with primarily quantitative indicators for social science research goals and visualization techniques derived from information science and social network analysis.
    7. 7. • Han Woo Park - “hidden” and “relational” data about lots of people as well as the few individuals, or small groups • Lev Manovich - ―surface‖ data about lots of people (i.e., statistical, mathematical or computational techniques for analyzing data) - ―deep‖ data about the few individuals or small groups (i.e., hermeneutics, participant observation, thick description, semiotics, and close reading) 7
    8. 8. First type of Webometrics • Hyperlink Network Analysis - Inter-linkage: who linked to whom matrix Co-inlink: a link to two different nodes from a third node Co-outlink: A link from two different nodes to a third node Björneborn (2003)
    9. 9. Inter-link network analysis diagram among Korean escience sites within public domain WCU WEBOMETRICS INSTITUTE Mapping the e-science landscape In South Korea using the Webometrics method
    10. 10. Co-inlink network analysis WCU WEBOMETRICS INSTITUTE Mapping the e-science landscape In South Korea using the Webometrics method
    11. 11. Findings As seen in Figure 4, the network structure shows a clear butterfly pattern. There is one hub (ghism) that belongs to Park Gyun-Hye (Park GH, www.cyworld.com/ghism), the daughter of ex-president Park Jeong-Hee and one of two major GNP candidates (along with president-elect Lee MB) in the 2007 presidential race. Figure 4: Cyworld Mini-hompies of Korean legislators How do social scientists use link data from search engines to understand Internet-based political and electoral communication? WCU WEBOMETRICS INSTITUTE INVESTIGATING INTERNET-BASED POLITICS WITH E-RESEARCH TOOLS Case 2. Cyworld Mini-hompies of Korean Legislators
    12. 12. Sociology of Hyperlink Networks of Web 1.0, Web 2.0, and Twitter A Case Study of South Korea
    13. 13. Introduction ‣ Online & offline lives ➭ co-constructing (e.g. Beer & Burrows, 2007) ‣ Politicians communicate with their constituencies using different platforms ‣ Questions: - What are the structural similarities and/or differences in South Korean politicians‘ networks from Web 1.0 to Web 2.0 (and Twitter)? - Are online structures similar to structures in the physical world? - Are online patterns affected by offline relationships? ‣ Related studies conducted: - online social network analysis - online networks in Web 2.0 - role of Twitter on online politics
    14. 14. 2001 2000 ‣ 59 isolated in 2000 ‣ more centralised in 2001 ‣ network of 2001 ➭ a ‗star‘ network - might affected by political events ➭ presidential election in 2001 Web 1.0
    15. 15. 2005 2006 ‣hubs disappearing ‣easy use of blogs ‣Clear boundaries between different parties ‣strong presence of GNP Assembly members ➭ party policy on using blogs Web 2.0
    16. 16. Politician Twitter Network (Following and Mention Network)
    17. 17. Conclusion Politicians Twitter Following-follower Network Politicians Twitter Mention Network
    18. 18. Bi-linked network of politically active A-list Korean citizen blogs (July 2005) URI=Centre DLP=Left GNP=Right Just A-list blogs exchanging links with politicians
    19. 19. Affiliation network diagram using pages linked to Lee’s and Park’s sites N = 901 (Lee: 215, Park: 692, Shared: 6)
    20. 20. Tweets on the name of S. Korea president 20
    21. 21. Viewertariat Networks: A Study of the 2012 South Korean Presidential Debate Park‘s network Moon‘s network
    22. 22. Reply-To Networks of Park‘s & Moon‘s Facebook page visitors during TV debates
    23. 23. ―Those studies perpetuate the idea that linking behaviour is not random, and that links are ‗socially significant in some way‘. In this perspective, links have an ‗information side-effect‘, they can be used to understand other facts even though they were not individually designed to do so: ‗information side-effects are by-products of data intended for one use which can be mined in order to understand some tangential, and possibly larger scale, phenomena‘
    24. 24. Park and his colleagues were extensively cited: 9 times! • • • • • • • • • Barnett GA, Chung CJ and Park HW (2011) Uncovering transnational hyperlink patterns and web mediated contents: a new approach based on cracking.com domain. Social Science Computer Review 29(3): 369–384. Hsu C and Park HW (2011) Sociology of hyperlink networks of Web 1.0, Web 2.0, and Twitter: a case study of South Korea. Social Science Computer Review 29(3): 354–368. Park HW (2003) Hyperlink network analysis: a new method for the study of social structure on the web. Connections 25(1): 49–61. Park HW (2010) Mapping the e-science landscape in South Korea using the webometrics method. Journal of Computer-Mediated Communication 15(2): 211–229. Park HW and Jankowski NW (2008) A hyperlink network analysis of citizen blogs in South Korean politics. Javnost: The Public 15(2): 5–16. Park HW and Thelwall M (2003) Hyperlink analyses of the World Wide Web: a review. Journal of Computer-Mediated Communication 8(4). Park HW and Thelwall M (2008) Developing network indicators for ideological landscapes from the political blogosphere in South Korea. Journal of Computer-Mediated Communication 13(4): 856–879. Park HW, Kim C and Barnett GA (2004) Socio-communicational structure among political actors on the web in South Korea. New Media & Society 6(3): 403–423. Park HW, Thelwall M and Kluver R (2005) Political hyperlinking in South Korea: technical indicators of ideology and content. Sociological Research Online 12(3).
    25. 25. A comment from those who are NOT doing a hyperlink analysis • In a chapter of The Sage Handbook of Online Research Methods edited by Fielding et al. (2008), Horgan emphasizes that ‗link analysis‘ has become an active research domain in examining social behavior online. 25
    26. 26. A threat to Webometrics • The key application in this area is to collect some incoming, outgoing, inter-linking, and co-linking data from search engines - AltaVista in early 2000 - Yahoo renewed the AltaVista‘s hyperlink commands via ―Site Explorer‖ and its API - Yahoo discontinued its API option for interlinkage data in April 2011, and finally stopped its popular Site Explore service in November 2011
    27. 27. http://cybermetrics.wlv.ac.uk/Quer iesForWebometrics.htm
    28. 28. A new proposal • Mike Thelwall - URL citation searches with the Bing search API facilities • Liwen Vaughan - Incoming hyperlinks from Alexa.com Can these "alternative" techniques be acceptable for scientific publishing?
    29. 29. A new proposal : SEO Tools • - Search Engine Optimization Tools http://www.majesticseo.com/ http://www.opensiteexplorer.org/ https://ahrefs.com/ Enrique Orduña-Malea & John J. Regazzi (2013). Influence of the academic Library on U.S. university reputation: a webometric approach. Technologies. 1, 2643, http://www.mdpi.com/2227-7080/1/2/26
    30. 30. Webometrics Ranking of World Universities The link visibility data is collected from the two most important providers of this information: Majestic SEO and ahrefs. Both use their own crawlers, generating different databases that should be used jointly for filling gaps or correcting mistakes. The indicator is the product of square root of the number of backlinks and the number of domains originating those backlinks, so it is not only important the link popularity but even more the link diversity. The maximum of the normalized results is the impact indicator. http://www.webometrics.info/en/Methodology
    31. 31. Interlinkage among world universities • Barnett, G.A., Park, H. W., Jiang, K., Tang, C., & Aguillo, I. F. (2013 forthcoming). A MultiLevel Network Analysis of Web-Citations Among The World‘s Universities. Scientometrics*. Isidro F. Aguillo ―Large interlinking matrix (1000*1000) are no longer possible to obtain. Perhaps national academic systems (200 or 300 institutions)‖
    32. 32. Intentional inattention among Information Scientists? • Robert Ackland (2013). Web Social Science. - http://voson.anu.edu.au/ • Richard Rogers (2013). Digital Methods. - https://www.issuecrawler.net/index.php - https://www.digitalmethods.net/Dmi/ToolDa tabase
    33. 33. Let us move to Web Visibility Analysis Frequently occurring key words in e-science webpages in Korea Created on Many Eyes(http://many-eyes.com) Words are larger according to the frequency of their occurrence but their positions are randomly-chosen for the best visualization WCU WEBOMETRICS INSTITUTE
    34. 34. Websites retrieved more than two times Note: Websites are larger according to their frequency of retrieval; however, heir colors and locations are randomly-chosen for the best visualization WCU WEBOMETRICS INSTITUTE
    35. 35. 2nd type of Webometrics: Web Visibility  Web visibility as an indicator of online political power   Presence or appearance of actors or issues being discussed by the public (Internet users) on the web. Tracking web visibility is powerful way to get an insight into public reactions to actors or issues.  Recent studies indicates the positive relationships between politicians‘ web visibility level and election.  Also, the co-occurrence web visibility between two politicians represents their hidden online political relationships based on the public perception.
    36. 36. Results – Web Visibility (co-occurrence)
    37. 37. Results – Correlation & Path Analysis Correlation 1 (N=278) 1 Finance 2 (N=278) 3 (N=234) 1 0.420** 0.101 1 0.184** 2 Web 3 Vote 1 Spearman Correlation 1 (N=278) 1 Finance 2 (N=278) 3 (N=234) 1 0.513** 0.090 1 0.163* 2 Web Political finance‘s indirect effect = .076 3 Vote Note. * p<.05, ** p<.01 ** p<.01 1
    38. 38. Results – QAP Correlation 1 1 Committee 2 Constituency 2 3 1 0.004 -0.016 1 3 Party 4 Gender 5 Age 6 Incumbent 7 Web 8 Finance Note. * p<.05, ** p<.01 4 0.025 5 7 8 -0.074** 0.045** -0.037** 0.097** -0.007 -0.043** -0.064** 0.105** -0.119** 1 -0.045* -0.050* 0.242** -0.094** 0.024 0.031 1 0.179** -0.051* 0.049* 1 0.098** 0.027 1 -0.021 6 0.041 -0.060** 1 -0.224** -0.158** 1
    39. 39. e-리서치 도구의 활용: 웹가시성 분석  블로그 공간에서 후보자들의 웹가시성 수준과 득표 수간 에 밀접한 상관성을 나타냄. (임연수, 박한우, 2010, JKDAS) 실제 득표수 29,120 평균 블로그 수 19,427 14,218 3,071 2,125 504 경대수 정범구 정원헌 박기수 이태희 김경회
    40. 40. 2009년 10월 28일 재보선 결과 - 당선자 모두 블로그 가시성 높음
    41. 41. I. 소셜 미디어의 특징 및 영향력 10.26 재보궐 선거 사례 • (2) 페이스북에서 이름이 동시에 언급되는 이름 연결망을 구 성하여 분석 • 초반에는 두 후보자가 비슷하게 언급되다가, 중반에 접어들자 박원순 지지자들과 박원순이 언급되면서 나경원 후보자 지지자가 안보이게 되고, 종반에는 박원순 중심으로 네트워크가 재편되며 종결됨
    42. 42. I. Semantic network에서 중심성 비교 10.26 재보궐 선거 사례 (2) • 서울시장 선거 관련 메세지들의 내 용을 분석하여 나오는 단어들의 빈 도 분석 • 초반부터 나경원 후보는 빈도가 떨 어지다가, 후반에 박원순 후보와 경 쟁 및 선거 결과를 이야기하면서 나 타나는 경우를 제외하고는 줄곳 담 론외곽에 존재 • 안철수 효과는 초반에 크고, 중반이 후 떨이지는 효과가 나타났으나, 한 나라당이라는 언급이 높게 나오면 서 집권여당에 반하는 정서가 나타 나, 선거의 성격을 말해줌
    43. 43.   As Lim & Park (2011, 2013) claim, the use of web mentions of politicians‘ names is particularly useful for hierarchically ranking individual politicians. However, it may not sufficiently capture the entropy probability of an event (hidden in changing communication structures) resulting from the amount of information conveyed by the occurrence of that event (Shannon, 1948).
    44. 44.  Taleb (2012) argues that society can be conceived as a complex fabric consisting of the extended disorder family including uncertainty, chance, entropy, etc.  Therefore, such disorder system can be better derived from empirical data mining, not obtained by a priori theorem.  Uncertainty exists when three or more events take place simultaneously and is increasingly beyond the control of individual events (Leydesdorff, 2008).
    45. 45.  In social and communication sciences, entropy-based indicators have been widely used for exploring entropy values generated from university-industrygovernment (UIG) relationships.  This ―Triple Helix Model‖ (THM) can be applied to the concurrence of a pair of two or three terms in the public search engine database
    46. 46. Mapping Election Campaigns Through Negative Entropy: Triple and Quadruple Helix Approach to Korea’s 2012 Presidential Election Social media platforms have become a notable venue for Korean voters wishing to share their opinions and predictions with others (Park et al., 2011; Sams & Park, 2013).  Politicians have made increasingly use of SNSs to provide updates and communicate with citizens (Hsu & Park, 2012).  With the increasing proliferation of smartphones and portable computers in Korea, SNSs have been widely used for facilitating political discourse.  Prior studies have found that Web 1.0 contents tended to contain the more enduring political and electoral statements of the public in various contexts. 
    47. 47. Introduction  To better understand the dynamics of the 2012 presidential election in Korea, this study estimates the web visibility of the three major candidates— Geun-Hye Park (PARK), Cheol-Soo Ahn (AHN), and Jae-In Moon (MOON)—in the entire digital sphere.
    48. 48. Literature Review The total probabilistic entropy (uncertainty) produced by changes in one or two dimensions is always positive, which is in accordance with the second law of thermodynamics (Theil, 1972, p. 59).  On the other hand, the relative contribution of each event to the summation in three or four dimensions can be positive, zero, or negative (configurational information).  This configurational information provides a measure of synergy within a complex communication system. Network effects occur in a systemic and nonlinear manner when loops in the configuration generate redundancies in relationships between three or four events (Leydesdorff, 2008). 
    49. 49. Method: Data collection     The number of hits for each search query per media channel (Facebook, Twitter, and Google) was harvested. The hit counts obtained from Google.com were employed to look primarily at entropies represented on a set of digitally accessible documents (e.g., online versions of newspapers, online word-of-mouth, Web 1.0 contents, etc.). We measured the occurrence and co-occurrence of the politicians‘ names based on their bilateral, trilateral, and quadruple relationships by using Boolean operators. For example, we measured the number of web and social media mentions referring only to PARK (this is, no mention of AHN, MOON, or the term ―president‖).
    50. 50. SNS 미디어에 따른 중심성에 따른 시각화
    51. 51. Literature Review Twitter can be very effective to amplify messages particularly in terms of their one-to-many mode of communication (Barash & Golder, 2010).  Twitter is viable both as a political news and communication channel (González-Bailón, Borge-Holthoefer, Rivero & Moreno, 2011; Hsu & Park, 2011, 2012; Otterbacher, Shapiro, & Hemphill, 2013)  and to citizens who look for platforms for political participation and engagement (Hsu, Park, & Park, 2013; Kim & Park, 2011; Tufekci& Wilson, 2012). 
    52. 52. Literature Review    The mode of information sharing on Facebook differs from that on Twitter. Facebook functions as a living room where friends talk to one another. Facebook can be a mixture of interpersonal and mass channels for the sharing of informational as well as social messages in a context of political campaign (Bond et al., 2012; Effing, van Hillegersberg, & Huibers, 2011; Robertson, Vatrapu, & Medina, 2010; Vitak et al., 2011). Both Twitter and Facebook communications seem to be biased because two platforms have been particularly dominated by the ―2040 Generation‖, who are generally categorized as political liberals in Korea (Kwak et al., 2011).
    53. 53. Research questions  Therefore, it is important to examine what (social) media conversations are more likely to generate more entropies that others and which politician:  RQ 1) What (social) media generate (negative) entropy more than others across different periods?  RQ 2) Which politician (or which pair of politicians) generates entropy more than others for bilateral, trilateral, or quadruple relationships across various media and periods?
    54. 54. Method: Measuring (negative) entropy  Figure 1. Binary Entropy Plot
    55. 55.  Entropy values (expressed as T for transmission) for bilateral relationships are, by definition, positive. Here T is defined as the difference in uncertainty when the probability distributions of two incidents (e.g., i and j) are combined. The mutual information transmission capacity, expressed in T values, is measured by ―bits‖ of information (for a more detailed mathematical definition, see Leydesdorff, 2003):  Hi = – Σi pi log2 (pi); Hij = – Σi Σj pij log2 (pij), Hij = Hi + Hj – Tij , Tij = Hi + Hj – Hij (1) Here Tij is zero if the two distributions are mutually independent and positive otherwise (Theil, 1972).   
    56. 56.  On the other hand, T values for trilateral and quadruple relationships can be negative, positive, or zero depending on the size of contributing terms. Therefore, it is necessary to compare the absolute value of each (negative) entropy value when entropy values are calculated for trilateral and quadruple relationships. In the case of entropy values for trilateral and quadruple relationships, the higher the absolute entropy value, the more balanced the communication system is. Let p denote PARK; a, AHN; and m, MOON and formulate mutual information in these three dimensions as follows (Abramson. 1963, p. 129):  Tpam = Hp + Ha + Hm – Hpa – Hpm – Ham + Hpam  Here we are interested not only in information on mutual relationships between these three candidates but also in semantic relationships with respect to the term ―president.‖ Accordingly, we measure the entropy value by using mutual information in these four dimensions (here ―r‖ denotes ―president‖):  Tpamr = Hp + Ha + Hm + Hr – Hpa – Hpm – Hpr – Ham – Har – Hmr + Hpam + Hpar + Hpmr + Hamr –Hpamr (3) (2)
    57. 57. Results  Figure 2. Entropy Values Across Media Channels and Time Periods
    58. 58. Results  Figure 3. T Values for Bilateral and Trilateral Relationships on November 3.
    59. 59. Results  Figure 4. T Values for Bilateral Relationships between Park and Moon
    60. 60. Discussion and conclusions    Twitter has scored the most negative entropy values and Facebook followed. Google came last. This indicates that Twitter is the most open communication system. The entropy values for liberal candidates (AHN and MOON) have been higher than their conservative opponent PARK on social media than Google sphere. This may not be surprising because both Twitter and Facebook have particularly appeared to the Korean citizens in the age of late teenagers to early 40s.
    61. 61. Discussion and conclusions PARK‘s entropy has been slightly higher on Google than her liberal challenger MOON.  Park was successful in garnering a strong support from senior voters in their 50s and 60s accounted for 39% of the population, up from 29% a decade ago (Wall Street Journal, 2012).  Exit poll also revealed that PARK gained a support from 62% of voters in their 50s and 72% of voters in their 60s. Indeed, the most significant statistic on the election was that South Koreans in their 20s, 30s, and 40s actually voted 65.2%, 72.5%, and 78.7% respectively but 89.9% in 50s and 78.8% over 60s went to the polling booth. 