Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

International Collaboration Networks in the Emerging (Big) Data Science


Published on

2013 한국데이터사이언스 창립기념 심포지움 발표 - 영남대 박한우교수

Published in: Technology, Education

International Collaboration Networks in the Emerging (Big) Data Science

  1. 1. International Collaboration Networks in the Emerging (Big) Data Science HanWoo Park Dept. of Media & Communication YeungNam University 214-1 Dae-dong, Gyeongsan-si, Gyeongsangbuk-do 712-749 Republic of Korea Loet Leydesdorff Amsterdam School of Communication Research (ASCoR) University of Amsterdam Kloveniersburgwal 48, 1012 CX Amsterdam, The Netherlands This presentation is based on Park, H.W., & Leydesdorff, L. (2013 forthcoming). Decomposing Social and Semantic Networks in Emerging “Big Data” Research. Journal of Informetrics*.
  2. 2. 빅데이터의 개념 및 특징 데이터 사이언스 배경 (빅)데이터 R&D 동향 사회적 이슈 및 시사점 1. 3. 4. 2. [목차]
  3. 3. Big data  The term “big data” refers to “analytical technologies that have existed for years but can now be applied faster, on a greater scale and are accessible to more users. (Miller, 2013).  Big data sizes may vary per discipline.  Characteristics: Garner’s 3Vs plus SAS’s VC and IBM’s Veracity - Volume (amount of data), Velocity (speed of data in and out), Variety (range of data types and sources) - Variability: Data flows can be highly inconsistent with daily, seasonal, and event-triggered peak data loads - Complexity: Multiple data sources requiring cleaning, linking, and matching the data across system - Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions.
  4. 4. Data-driven Research that focuses on extracting meaningful data from techno-socio-economic systems to discover some hidden patterns.
  5. 5. 빅데이터의 개념 및 특징 데이터 사이언스 배경 (빅)데이터 R&D 동향 사회적 이슈 및 시사점 1. 3. 4. 2. [목차]
  6. 6. “Data Science” refers to “a discipline that incorporates varying elements and builds on techniques and theories from many fields, including data visualization with the goal of extracting meaning from data and creating data products.”
  7. 7. Today’s “big” is probably tomorrow’s “medium” and next week’s “small” and thus the most effective defini- tion of “big data” may be derived when the size of data itself becomes part of the research problem. Loukides (2012)
  8. 8. Origin of Data Science  One is Peter Naur’s 1974 book “Concise Survey of Computer Methods”, a survey of contemporary data processing methods in a wide range of applications (Gilpress, 2012).  The other is when the term “big data” first appeared in 1970 in the Scopus database (Halevi and Moed, 2012). There was no particular key milestone since 1970s.  During the 1990s period, the term had been usually related to computer modeling and software development for large datasets. Knowledge Discovery and Data Mining in 1997. Rousseau (2012) also regards the 1993 publication as the first documents indexed in the Web version of Web of Science.
  9. 9. A more recent development was made with the establishment of journals that included the term “Data Science” in their titles: • Data Science Journal in 2002 • Journal of Data Science in 2003 • EPJ Data Science in 2012 • Journal of Big Data in 2013 • GigaScience in 2012
  10. 10. Science published a special issue (February 11, 2011) looking broadly at increasingly data-driven research efforts as a scientific domain (Science staff, 2011). Data Science is composed of interrelated clusters of research tasks. For example, the technologies on data collection, curation, and access, and the unique skill sets have increasingly been central to Data Science (Science staff, 2011).
  11. 11. An international conference called “Data Science Summit” (
  12. 12. 에서 재인용
  13. 13. All models are wrong but some are useful Emergence of data author on dataverse
  14. 14. Andersons claims  Data is everything we need.  We don't have to settle for models.  Agnostic statistics.  Out with every theory of human behavior.  This approach to science — hypothesize, model, test — is becoming obsolete.  Petabytes allow us to say: "Correlation is enough." We can stop looking for models.  What can science learn from Google? E-Science.
  15. 15. Computational (Social) Science Park, H.W., & Leydesdorff, L. (2013 Work-In-Progress). Decomposing a Data-Driven Science Using a Scientometric Method.  Focus on the methodological perspective based on the use of new digital tools to manage the data deluge.  Development of e-science tools to automate research process.  Experimentation with new types of data visualization.
  16. 16. php?title=Online_Research
  17. 17. Why Data Science? Savage and Burrows (2007, p. 886) lament, “Fifty years ago, academic social scientists might be seen as occupying the apex of the – generally limited – social science research ‘apparatus’. Now they occupy an increasingly marginal position in the huge research infrastructure”. Bonacich, P. (2004). The Invasion of the Physicists. Social Networks 26(3): 285-288
  18. 18. This approach to science is attributed to the late Jim Gray, one of the most influential computer scientists, at Microsoft.
  19. 19. “The fourth paradigm” Research purpose lies in handling huge amounts of data from technological, sociological, and economic systems to discover some hidden patterns. Jim Gray
  20. 20. Global Communication 2team (빅) 데이터과학의 도전 이론의 종말-증거기반 경영 Jeffrey Pfeffer, Robert I. Sutton (2006) How companies can bolster performance and trump the competition through evidence-based management, an approach to decision-making and action that is driven by hard facts rather than half-truths or hype. · 빅데이터의 등장으로 전통적인 과학 연구방법론 퇴색 · 인식의 한계치를 넘어선 데이 터 (팩트가아닌패턴)
  21. 21. The Signal and the Noise: Why Most Predictions Fail but Some Don't. Nate Silver I do not go as far as a Popper in asserting that such theories are therefore unscientific or that they lack any value. However, the fact that the few theories we can test have produced quite poor results suggests that many of the ideas we haven’t tested are very wrong as well. We are undoubtedly living with many delusions that we do not even realize. page 15
  22. 22. OECD (2012). OECDTechnology Foresight Forum 2012 - Harnessing data as a new source of growth: Big data analytics and policies. OECD Headquarters, Paris, France 22 October 2012
  23. 23. Big data and the end of theory?  Does big data have the answers? Maybe some, but not all, says - Mark Graham  In 2008, Chris Anderson, then editor of Wired, wrote a provocative piece titled The End of Theory. Anderson was referring to the ways that computers, algorithms, and big data can potentially generate more insightful, useful, accurate, or true results than specialists or domain experts who traditionally craft carefully targeted hypotheses and research strategies.  We may one day get to the point where sufficient quantities of big data can be harvested to answer all of the social questions that most concern us. I doubt it though. There will always be digital divides; always be uneven data shadows; and always be biases in how information and technology are used and produced.  And so we shouldn't forget the important role of specialists to contextualize and offer insights into what our data do, and maybe more importantly, don't tell us.
  24. 24. 빅데이터의 개념 및 특징 데이터 사이언스 배경 (빅)데이터 R&D 동향 사회적 이슈 및 시사점 1. 3. 4. 2. [목차]
  25. 25. Number of “Big data” papers per year Halevi, G., & Moed, H. F. (2012).
  26. 26. Rousseau (2012) We performed a similar search in the WoS (TS=“Big data”) on October 2, 2012, leading to 142 articles. We removed the oldest one (1974), and kept 141 published during the period 1993-2012). Halevi and Moed observed an over-exponential growth over the period 1970-2011, while we found a growth curve that could best be described by a cubic polynomial (R2=0.963, with year 1992=0), which is illustrated in Fig. 1.
  27. 27. Subject areas researching Big Data Halevi, G., & Moed, H. F. (2012).
  28. 28. Rousseau (2012)
  29. 29. Geographical Distribution of Big Data papers Halevi, G., & Moed, H. F. (2012).
  30. 30. Rousseau (2012)
  31. 31. Phrase map of highly occurring keywords 1999-2005 Halevi, G., & Moed, H. F. (2012).
  32. 32. Phrase map of highly occurring keywords 2006-2012 Halevi, G., & Moed, H. F. (2012).
  33. 33. Park, H. W., & Leydesdorff, L. (2013 Work-In-Progress). Decomposing a Data-Driven Science Using a Scientometric Method.  But, Halevi and Moed (2012), and Rousseau (2012) are based on descriptive statistics. Therefore, we intend to add the network perspective both in the social (in terms of co- authorship) and semantic networks.  Furthermore, we extend search queries to various terminologies related to Data Science because the term “big data” is regarded only as one among a list of policy priority issues.  We show where the research system in Data Science is “hot” in terms of international collaborations and prevailing semantics.
  34. 34. Problem Statement Previous studies have not systematically examined whether research efforts driven by various sources of big data are really becoming increasingly widespread across the world. Further, the status of the literature based on big data has not been extensively discussed or sufficiently examined with respect to its semantic variations, disciplinary scope, institutional adoption, and international collaboration.
  35. 35.  We employed a method rooted in the social network analysis (SNA) (Hanneman & Riddle, 2005).  Here the unit of analysis is often the node, which refers to a point in a network where ties cross or connect nodes.  A tie is a connection between parts (i.e., nodes) in a network.  We considered countries as nodes and a tie as the number of papers co-authored by a pair of researchers with different addresses in terms of their country of origin.
  36. 36.  We considered papers published in SCI journals in 2011.  we selected three types of documents: journal articles, letters, and reviews.  We obtained the data from the DVD version of the SCI data- base by using several search terms based on titles, author key words, and keyword-plus.
  37. 37. As expected, the global co-authorship network was far denser than the subnetwork, that is, co-authorship in big data research. Note that these were not really co- authorship relationships between countries but relationships between them measured in terms of co- authorship relationships.The sum of ties in the global network and that of the subnetwork were 1,073,764 and 10,798, respectively. In addition, the global network was more centralized around hub countries than the network of big data science in terms of all three measures of centrality. However, the QAP correlation between the whole 2011 co-authorship network and big data research demonstrates their significant relationship: this (Pearson) correlation was .740 (p < .001).
  38. 38. Network Type Density (S.D.) Centralization (%) Degree Node Flow Global 26.71 (245.70) 5.11 10.08 9.83 Big Data 0.01 (0.18) 4.37 2.70 2.28 N=201. Comparison of Density and CentralizationValues
  39. 39. Rank Country Degree Rank Country Betweenness Rank Country FlowBet 1 U.S. 4.450 1 U.S. 2.734 1 USA 2.309 2 GERMANY 1.650 2 FRANCE 1.253 2 FRANCE 0.929 3 U.K. 1.600 3 U.K. 0.680 3 CANADA 0.537 4 FRANCE 1.400 4 CANADA 0.643 4 ITALY 0.510 5 AUSTRALIA 1.150 5 ITALY 0.620 5 UK 0.377 6 NETHERLANDS 1.150 6 AUSTRALIA 0.602 6 SOUTH_KORE A 0.359 7 CHINA 1.100 7 SOUTH_KOREA 0.346 7 BELGIUM 0.331 8 DENMARK 0.950 8 GERMANY 0.291 8 AUSTRALIA 0.328 9 CANADA 0.900 9 BELGIUM 0.290 9 JAPAN 0.262 10 TAIWAN 0.850 10 PORTUGAL 0.266 10 SLOVENIA 0.200 11 ISRAEL 0.750 11 JAPAN 0.256 11 PORTUGAL 0.185 12 SOUTH_KOREA 0.750 12 CHINA 0.137 12 CHINA 0.132 13 SWEDEN 0.750 13 NETHERLAND 0.104 13 SPAIN 0.129 14 ITALY 0.700 14 DENMARK 0.099 14 GERMANY 0.108 15 PORTUGAL 0.700 15 SAUDI_ARABIA 0.088 15 MALAYSIA 0.103 16 IRELAND 0.650 16 SLOVENIA 0.068 16 TANZANIA 0.095 17 NORWAY 0.650 17 TAIWAN 0.057 17 VENEZUELA 0.095 18 SPAIN 0.650 18 SPAIN 0.055 18 NETHERLANDS 0.089 19 SINGAPORE 0.500 19 ISRAEL 0.037 19 SAUDI_ARABIA 0.071 20 SWITZERLAND 0.450 20 AUSTRIA 0.036 20 AUSTRIA 0.063 Table 4. CentralityValues for Countries
  40. 40. Rank Country Effectiveness Rank Country Efficiency Rank Country Constrain 1 U.K. 13.071 1 EGYPT 1.000 1 DENMARK 0.312 2 AUSTRALIA 12.879 2 INDIA 1.000 2 NETHERLAND 0.331 3 FRANCE 12.562 3 POLAND 1.000 3 PORTUGAL 0.338 4 U.S. 11.563 4 UZBEKISTAN 1.000 4 ISRAEL 0.343 5 GERMANY 10.746 5 GREECE 0.805 5 NORWAY 0.345 6 NETHERLANDS 8.873 6 JAPAN 0.789 6 IRELAND 0.352 7 DENMARK 8.530 7 AUSTRIA 0.725 7 UK 0.364 8 PORTUGAL 8.229 8 BRAZIL 0.722 8 SWEDEN 0.365 9 ISRAEL 8.208 9 NEW_ZEALAND 0.722 9 AUSTRALIA 0.381 10 CANADA 7.672 10 MALAYSIA 0.698 10 GERMANY 0.397 11 ITALY 7.554 11 AUSTRALIA 0.678 11 FRANCE 0.411 12 IRELAND 7.252 12 SAUDI_ARABIA 0.667 12 CANADA 0.532 13 NORWAY 7.214 13 IRAN 0.667 13 ITALY 0.535 14 SOUTH_KOREA 6.365 14 THAILAND 0.667 14 SAUDI_ARABIA 0.548 15 CHINA 6.057 15 SINGAPORE 0.659 15 SWITZERLAND 0.556 16 SWEDEN 5.978 16 CZECH_REPUBLIC 0.644 16 USA 0.573 17 JAPAN 5.520 17 CANADA 0.639 17 SOUTH_KORE A 0.578 18 TAIWAN 5.490 18 SLOVENIA 0.638 18 BELGIUM 0.583 19 SPAIN 5.312 19 SOUTH_KOREA 0.636 19 SPAIN 0.625 20 SWITZERLAND 4.224 20 PORTUGAL 0.633 20 TAIWAN 0.627 Table 5. Structural HoleValues by Country
  41. 41. International Co-Authorship Network of Big Data Research
  42. 42. Semantic Network of Paper Titles in Big Data (50 Most Frequently OccurringTerms with the Cosine ≥ 0.1)
  43. 43. Semantic Network of PaperTitles and Countries in Big Data (50 Most Frequently OccurringTerms and theTop 20 Countries with the Cosine ≥ 0.2)
  44. 44. 빅데이터의 개념 및 특징 데이터 사이언스 배경 (빅)데이터 R&D 동향 사회적 이슈 및 시사점 1. 3. 4. 2. [목차]
  45. 45.  Internationally co-authored papers in the field of data science have generally focused on primary technologies.  SCI papers do not necessarily focus on conceptually new me- thodologies for analyzing and synthesizing massive data sets. The results suggest the emergence of some new subjects such as MapReduce.
  46. 46.  The U.S. was central in various aspects because of its connec- tions with E.U. member countries as well as individual Asian countries.  Various European countries are the second most central posi- tions based on centrality measures.  In terms of structural hole indicators, some smaller and less advanced countries were more efficient than effective in terms of controlling central positions.  The results suggest that a combination of words and locations in a two-mode network can provide a richer representation of the emerging field of big data science than the sum of two re- presentations.
  47. 47. Yet, there still are serious problems to overcome. A trenchant critique concerning the big data field as it is nowadays came in the form of six statements intending to temper unbridled enthusiasm. [42] These six provocative statements are:  Big data change the definition of knowledge;  Claims to accuracy and objectivity are misleading;  More data are not always better data;  Taken out of context, big data loses its meaning;  Just because it is accessible, it does not make it ethical; and  (Limited) access to big data creates a new digital divide. Rousseau (2012)
  48. 48. Global Communication 2team 빅데이터에 대한 부정적인 시각 등장 -빅데이터의 가치 -저장, 분석 및 해석기술 한계 존재 -현재의 붐은 호들갑스러운 측면 존재 빅데이터 갭: PromiseVS Capabilities 빅데이터의 도전
  49. 49. Global Communication 2team 빅데이터의 도전 빅데이터 ‘Gap’ 분석사례 · 151명 연방 정부 CIO및 IT관리자 대상 빅데이터갭 조사실 시 . · 실질적으로 현재 데이터를 제대로 활용하고 있는 기관도 적으며, 데이터소유권 문제도 확립되지 않은 것으로 나타 [美정부 IT네트워크 ‘Meritalk’는 빅데이터의 가 능성과 현실에는 Gap이 존재한다고 분석]
  50. 50.
  51. 51. 어떤 실험을 하는지 우리는 알고 있는가?
  52. 52. 우리는 정확히 인지하지 못한 채 동의했다
  53. 53. User Content VS Site Content 대부분의 SNS 서비스는 “User Content”를 무력 하게 만드는 “Site Content” 규정이 있음 (p. 60).
  54. 54. Issues in “Big Data” Internet Research Cugelman, B., Thelwall, M. & Dawes, P. (in press). The psychology of online behavioural influence interventions: a meta analysis. Journal of Medical Internet Research.  Health Information Privacy Protection Act (HIPPA) in U.S. put strict limit on the sharing of an individual’s health information, • 병원에서 수술 등을 생중계하는 것은 어떻게 해결: 트위터를 가장 활발하게 이용하고 있는 ‘헨리 포드 병원’ 외에 도 현재 미국에서 트위터, 페이스북, 유튜브 등 소셜 네트워 크 서비스를 적극 활용하는 병원이 늘어나고 있는 추세임 • 건강용 스마트폰 Application 개발
  55. 55. Global Communication 2team 3.결론및 시사점 기술+사회문화적 요소에 대한 면밀한 검토 - 빅데이터 및 AI 논의에서 빠지지 않는 것이 개인정보 유출 및 사생활 침해와 같은 역기능 문제 - 기술의 발전과 더불어 우리가 원하는 미래상에 대한 명확한 이해와, 이를 달성하기 위한 정치사회적 기반에 대한 근본적인 모색이 중요. 박한우 교수는 2012년 2월에 미국에서 벌어진 사건을 예로 들었다. 영국의 대학생 두 명이 미국에 입국하면서 로스앤젤레스 공항을 폭파하겠다는 말을 트위터에 썼는데 이것이 미국 정부에 적발됐다. 박 교수는 “이 경우 정부는 트위터 전체가 아니라 트위터에 글을 올린 사람을, 올린 것을 규제한 것인데 미국 정부가 일상적으로 트위터를 들여 다본다는 문제로 번졌다”고 설명했다.
  56. 56. Prof. Han Woo PARK World Class University Webometrics Institute CyberEmotions Research Center Department of Media and Communication, YeungNam University, Korea 이 슬라이드 작성에 도움을 준 사이버감성연구소 연구원들과 학부 /대학원 강의 수강생에게 고마움을 표시합니다. 이 슬라이드는 개인적 목적으로 만든 비공개 자료입니다. 배포 및 복사를 금지합니다.