Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Invited Talk: Early Detection of Research Topics

98 views

Published on

Slides of my talk at Chan Zuckerberg Initiative (Meta)
Abstract:
The ability to promptly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. While the literature describes several approaches which aim to identify the emergence of new research topics early in their lifecycle, these rely on the assumption that the topic in question is already associated with a number of publications and consistently referred to by a community of researchers. Hence, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. In this paper, we begin to address this challenge by performing a study of the dynamics preceding the creation of new topics. This study indicates that the emergence of a new topic is anticipated by a significant increase in the pace of collaboration between relevant research areas, which can be seen as the ‘parents’ of the new topic. These initial findings (i) confirm our hypothesis that it is possible in principle to detect the emergence of a new topic at the embryonic stage, (ii) provide new empirical evidence supporting relevant theories in Philosophy of Science, and also (iii) suggest that new topics tend to emerge in an environment in which weakly interconnected research areas begin to cross-fertilise.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Invited Talk: Early Detection of Research Topics

  1. 1. Early Detection of Research Topics Angelo Antonio Salatino Knowledge Media Institute The Open University, UK @angelosalatino Aug 2 2018 @ Chan Zuckerberg Initiative
  2. 2. $whoami • Raised in Bari (IT), where I got my Master in Computer Systems Engineering from Polytechnic University of Bari • Finishing my PhD at the Knowledge Media Institute (KMi) of The Open University (UK) under the supervision of Prof. Enrico Motta and Dr. Francesco Osborne • Research Assistant at the SKM3 group in the same department at the OU Bari The Open University, Milton Keynes
  3. 3. Scholarly Knowledge: Modelling, Mining and SenseMaking The SKM3 team aims at producing innovative approaches leveraging large-scale data mining, semantic technologies, machine learning, and visual analytics for making sense of scholarly data and forecasting research dynamics. We collaborate with major publishers, universities, and companies and produce a variety of services supporting researchers, editors, and policy makers. http://skm.kmi.open.ac.uk/
  4. 4. How can we early detect the emergence of new research topics?
  5. 5. Background Anatomy of a research topic • Early stage: researchers build their conceptual framework, establish their community • Recognised: many researchers work in this topic, start to produce and disseminate their results
  6. 6. How early in the topic lifecycle is it possible to identify an emerging topic?
  7. 7. Background «[…] successive transition from one paradigm to another via revolution is the usual developmental pattern of mature science.» Thomas Kuhn - The Structure of Scientific Revolutions
  8. 8. Background «As the work and the points of view grow more specialised, men in different disciplines have fewer things in common, in their background and in their daily problems» Clark - The study of Campus Cultures «Sometimes, of course, friendly relations may be established to mutual benefit ...» Becher and Throwler - Academic Tribes and Territories
  9. 9. Hypothesis The fundamental assumption of this research is that it should be possible to detect the emergence of new research topics even before they are consistently labelled by the community • The approach focuses on uncovering the relevant patterns in the dynamics of existing topics Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born? Understanding the research dynamics preceding the emergence of new areas." PeerJ Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/
  10. 10. Approach • Study #1: How Research Topics are born? • Prove feasibility of hypothesis by demonstrating the existence of regular patterns preceding the «birth» of 75 debutant topics which emerged in the years 2000-10 • Study #2: Augur • Following on from successful results from Study #1, develop a method which can effectively predict the emergence of new topics before they are explicitly recognised
  11. 11. How Research Topics are born?
  12. 12. How Research Topics are born? • Selection Phase • Treatment group (debutant topics) vs. Control group (non-debutant topics) • Analysis Phase • Statistical analysis of the pace of collaboration (diachronic activity of triangles of collaborating topics) and changes in network density over the two populations (treatment and control groups)
  13. 13. Datasets Scholarly Data • Dump of Scopus until 2014 • 3M papers in Computer Science • Co-occurrence networks • Nodes: keywords in papers • Links: number of times two keywords co- occur together in a particular year Computer Science Ontology* • Large-scale ontology of research areas automatically generated using the Klink-2 algorithm** • Defines when a topic is broader than another topic • Defines when two topics express the same subject of study * Salatino A. A., Thanapalasingam T., Mannocci A., Osborne F., and Motta E.: “The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas.” In ISWC 2018, http://oro.open.ac.uk/55484/ ** Osborne, F. and Motta, E.: ”Klink-2: integrating multiple web sources to generate semantic topic networks.” In ISWC 2015
  14. 14. Dataset: Semantic Enhanced Topic Networks Computer Science Ontology Semantic Enhanced Topic Networks Co-occurrence Networks year year
  15. 15. Selection Phase For each testing topic we have: Network of Topics Related topics: co-occurring topics Testing topics: • Debutant Topics (treatment group) • Non-Debutant Topics (control group) Graph extraction Year of analysis -1 Year of analysis -2 Year of analysis -3 Year of analysis -4 Year of analysis -5
  16. 16. Analysis Phase (1/2) Pace of Collaboration Matched 3-clique in the 5 sub-graphs A C B !"#$%& !"#$%' !"#$%( !"#$%) !"#$%* Timeline of measures associated to the 3-cliques Δy Δt Slope α = Δy/Δt t (years) y A B C +, +- +. /,- /., /-. !"0123
  17. 17. Analysis Phase (2/2) Network Density • Triad Census* 1 5 5 ( )*100 % Yr Yr i i i Yr i H H GrowthH H - - - - = 3 0 %topic i i GrowingIndex i GrowthH = = ×å H0 H1 H2 H3 empty one edge two-star triangle * Davis, James A., and Samuel Leinhardt. "The structure of positive interpersonal relations in small groups." (1967). 4 isomorphism classes of triad
  18. 18. Findings • Pace of Collaboration • Density
  19. 19. How can we forecast the emergence of new Research Topics?
  20. 20. Emerging Topics Related Topics Selection Data Analysis Dynamics ? How? • Study #1 assumed we knew which topics were debutant and uncovered the dynamics associated with their emergence • Study #2 takes as starting point the dynamics and tries to predict the birth of debutant topics
  21. 21. AUGUR Salatino, Angelo, Francesco Osborne, and Enrico Motta. "AUGUR: Forecasting the Emergence of New Research Topics." JCDL’18: The 18th ACM/IEEE Joint Conference on Digital Libraries. ACM, Fort Worth, TX, USA, 2018.
  22. 22. Evolutionary Network Snapshot of the collaborations of topics in a period of five years , , , ˆ ( , )year year year year year topic topic u v u vtopic u v topic topic v u w w w HarmonicMean p p = 4 , , 0 , 4 2 0 ˆ ˆ( )( ) ( ) year i year topic topic t i u v u v evol i u v t i i year year w w w year year -- = - = - - = - å å 4 0 4 2 0 ( )( ) ( ) year i year topic topic t i k k evol i k t i i year year p p p year year -- = - = - - = - å å Edge weights Node weights !"#$% = (("#$% , *"#$% , +"#$% , ,"#$% )
  23. 23. Advanced Clique Percolation Method Clique detection Clique-graph construction Topic network Communities Clique detection Clique-graph construction Topic network Communities Measure & Filtering Find local maxima and select neighbors Advanced Standard* An example: 1 connected components with 4 local maxima * Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (2005). “Uncovering the overlapping community structure of complex networks in nature and society.” Nature, 435(7043), 814.
  24. 24. Evolutionary Network of year 2000 CPM ACPM
  25. 25. Post-processing & Sense Making • Some clusters are filtered and others merged • Extraction of influential papers • Extraction of influential authors A B A ∪ B
  26. 26. Final Outcome Influential Authors W. Bruce Croft, Dieter Fensel, Dan Suciu, William W. Cohen, Berthier Ribeiro-Neto, Clement T. Yu, James Allan, Justin Zobel, Dragomir R. Radev, Victor Vianu Influential Papers - A Sheth et al. "Managing semantic content for the Web" (2002) - RWP Luk et al. "A survey in indexing and searching XML documents" (2002) - J Kahan et al. "Annotea: An open RDF infrastructure for shared Web annotations" (2002) - R Manmatha et al. "Modeling score distributions for combining the outputs of search engines" (2001) - S Dagtas et al. "Models for motion-based video indexing and retrieval" (2000) Portion of the evolutionary network in 2002, reflecting the emergence of Semantic Search in 2003
  27. 27. Evaluating We evaluated AUGUR and the ACPM against a gold standard of emerging topics and we compared it against four state-of-the-art algorithms: • Fast Greedy (FG) • Leading Eigenvector (LE) • Fuzzy C-Means (FCM) • Clique Percolation Method (CPM)
  28. 28. Evaluating • Gold standard of 1408 emerging topics in 2000-11 • For each emerging topic we extracted 25 ancestors • Topics that mostly collaborated with the debutant topic during its first five years of activity Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 #topics 149 194 221 216 137 241 134 60 27 12 12 5 IF match(cluster, ancestors) THEN claim_match(cluster, debutant) END IF
  29. 29. Evaluating Given a cluster in a year, let’s see if it leads to the emergence of a new research topic in the following two years Clusters(yt) vs. (Ancestors_Of_)Debutant_Topics(yt+1, yt+2)
  30. 30. Evaluating Metrics: • Precision, number of matching clusters divided by the total number of clusters • Recall, number of matching debutant topics divided by the total number of debutant topics
  31. 31. Evaluating Matching: • Jaccard Similarity between Cluster (Ci) with ancestors of debutant topics (Dk) • From CSO, we added the same-as (SAi) of the topics in cluster (Ci) 0 ≤ )"($%, '(, )*% ≤ 1 fuzzy c means ∩ fuzzy c-means (fcm) (fuzzy c means ∪ fuzzy c-means (fcm) ∪ fuzzy c-means ) ∩ fuzzy c-means (fcm)
  32. 32. Evaluating Four strategies: Strategy (1) ,- ./. 12 (shown previously) Strategy (2) (,- ∪ ?,-) ./. 12 Strategy (3) ,- ./. (1A ∪ ?12) Strategy (4) (,- ∪ ?,-) ./. (12 ∪ ?12) ( ) ( ) ( , , ) , , i i i k k i k i i k i i k k C EC SA D ED J C D SA EC ED C EC D ED È È Ç È = È È È 0 ≤ )C(,D, 12, FGD, ?,D, ?12 ≤ 1
  33. 33. Evaluating Degrees of freedom: • Year, [1999-2010] • Similarity threshold, [0-1] • Strategy, {Strategy 1, 2, 3, 4} • Algorithm, {FG, LE, FCM, CPM, ACPM} precision = f(year, similarity, strategy, algorithm) recall = f(year, similarity, strategy, algorithm)
  34. 34. Results Which strategy? Fixed variables: algorithm(ACPM), year(1999) (1) $% &'. )* (2) ($% ∪ -$%) &'. )* (3) $% &'. ()* ∪ -)*) (4) ($% ∪ -$%) &'. ()* ∪ -)*)
  35. 35. Results Which algorithm? Fixed variables: similarity(0.1), strategy(4) FG LE FCM CPM ACPM Years Pr Re Pr Re Pr Re Pr Re Pr Re 1999 .27 .11 .00 .00 .00 .00 .06 .01 .86 .76 2000 .21 .07 .14 .02 .96 .01 .05 .00 .78 .70 2001 .13 .04 .11 .01 .00 .00 .17 .00 .77 .72 2002 .14 .04 .11 .01 .00 .00 .29 .01 .82 .80 2003 .09 .02 .20 .02 .00 .00 .08 .02 .83 .79 2004 .11 .05 .06 .00 .00 .00 .00 .00 .84 .68 2005 .07 .11 .06 .01 .00 .00 .00 .00 .71 .66 2006 .01 .01 .07 .01 .00 .00 .00 .00 .43 .51 2007 .01 .08 .00 .00 .00 .00 .00 .00 .28 .44 2008 .01 .04 .00 .00 .00 .00 .00 .00 .15 .33 2009 .00 .00 .00 .00 .00 .00 .00 .00 .09 .76
  36. 36. Results Final results Fixed variables: algorithm(ACPM), strategy(4)
  37. 37. Conclusion • We evaluated Augur and ACPM versus four alternative approaches on a gold standard of 1,408 debutant topics in the 2000-2011 timeframe. • The results show that our approach outperforms state of the art solutions and is able to successfully identify clusters that will produce new topics in the two following years.
  38. 38. Future work • Analysis on more recent data • We are currently plugging Augur on MAG • Gold Standard • Scope • We plan to analyse the field of Engineering and Medicine • Further dynamics • Involving author collaborations, funds and publication venues
  39. 39. Thank you References • Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born? Understanding the research dynamics preceding the emergence of new areas." PeerJ Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/ • Salatino, Angelo, Francesco Osborne, and Enrico Motta. "AUGUR: Forecasting the Emergence of New Research Topics." JCDL’18: The 18th ACM/IEEE Joint Conference on Digital Libraries. ACM, Fort Worth, TX, USA, 2018. • Salatino, Angelo A., Thiviyan Thanapalasingam, Andrea Mannocci, Francesco Osborne, and Enrico Motta. "The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas." In ISWC 2018, Monterey, CA, USA. • Osborne, F. and Motta, E.: “Klink-2: integrating multiple web sources to generate semantic topic networks.” In ISWC 2015, Bethlehem, PA, USA Angelo Salatino @angelosalatino__ angelo.salatino@open.ac.uk__ salatino.org__ http://skm.kmi.open.ac.uk__

×