Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AUGUR: Forecasting the Emergence of New Research Topics

189 views

Published on

Being able to rapidly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. The literature pre-sents several approaches to identifying the emergence of new re-search topics, which rely on the assumption that the topic is al-ready exhibiting a certain degree of popularity and consistently referred to by a community of researchers. However, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. We address this issue by introducing Augur, a novel approach to the early detection of research topics. Augur analyses the diachronic relationships between research areas and is able to detect clusters of topics that exhibit dynamics correlated with the emergence of new research topics. Here we also present the Advanced Clique Percolation Method (ACPM), a new communi-ty detection algorithm developed specifically for supporting this task. Augur was evaluated on a gold standard of 1,408 debutant topics in the 2000-2011 interval and outperformed four alternative approaches in terms of both precision and recall.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

AUGUR: Forecasting the Emergence of New Research Topics

  1. 1. AUGUR: Forecasting the Emergence of New Research Topics Angelo A. Salatino, Francesco Osborne, Enrico Motta @angelosalatino
  2. 2. Background Anatomy of a research topic • Early stage: researchers build their conceptual framework, establish their community • Recognised: many researchers work in this topic, start to produce and disseminate their results
  3. 3. Background «[…] successive transition from one paradigm to another via revolution is the usual developmental pattern of mature science.» Thomas Kuhn - The Structure of Scientific Revolutions
  4. 4. How research topics are born? • The fundamental assumption of this research is that it should be possible to detect the emergence of new research topics even before they are consistently labelled by the community – The approach focuses on uncovering the relevant patterns in the dynamics of existing topics Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born? Understanding the research dynamics preceding the emergence of new areas." PeerJ Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/
  5. 5. How research topics are born? The creation of novel topics is anticipated by a significant increase in the pace of collaboration and density of the portions of the network in which they will appear.
  6. 6. Forecasting the Emergence of New Research Topics? ? Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born? Understanding the research dynamics preceding the emergence of new areas." PeerJ Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/ Emerging Topics Related Topics Selection Data Analysis Dynamics
  7. 7. AUGUR
  8. 8. Background data Scholarly Data • Dump of Scopus until 2014 • Co-occurrence network – Nodes: keywords in papers – Links: number of times two keywords co-occur together Computer Science Ontology* • Large-scale ontology of research areas automatically generated using the Klink-2 algorithm** • Defines when a topic is broader than another topic • Defines when two topics express the same subject of study * Computer Science Ontology Portal: https://w3id.org/cso ** Osborne, F. and Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In ISWC 2015
  9. 9. Evolutionary Network Snapshot of the collaborations of topics in a period of five years , , , ˆ ( , )year year year year year topic topic u v u vtopic u v topic topic v u w w w HarmonicMean p p = 4 , , 0 , 4 2 0 ˆ ˆ( )( ) ( ) year i year topic topic t i u v u v evol i u v t i i year year w w w year year -- = - = - - = - å å 4 0 4 2 0 ( )( ) ( ) year i year topic topic t i k k evol i k t i i year year p p p year year -- = - = - - = - å å !"#$% = (("#$% , *"#$% , +"#$% , ,"#$% ) Edge weights Node weights
  10. 10. Advanced Clique Percolation Method Clique detection Clique-graph construction Topic network Communities Clique detection Clique-graph construction Topic network Communities Measure & Filtering Find local maxima and select neighbors Standard (Palla et al., 2005) Advanced An example: 1 connected components with 4 local maxima
  11. 11. Evolutionary Network of year 2000 CPM ACPM
  12. 12. Post-processing & Sense Making • Some clusters are filtered and others merged • Extraction of influential papers • Extraction of influential authors A B A ∪ B
  13. 13. Final Outcome Influential Authors W. Bruce Croft, Dieter Fensel, Dan Suciu, William W. Cohen, Berthier Ribeiro-Neto, Clement T. Yu, James Allan, Justin Zobel, Dragomir R. Radev, Victor Vianu Influential Papers - A Sheth et al. "Managing semantic content for the Web" (2002) - RWP Luk et al. "A survey in indexing and searching XML documents" (2002) - J Kahan et al. "Annotea: An open RDF infrastructure for shared Web annotations" (2002) - R Manmatha et al. "Modeling score distributions for combining the outputs of search engines" (2001) - S Dagtas et al. "Models for motion-based video indexing and retrieval" (2000) Portion of the evolutionary network in 2002, reflecting the emergence of Semantic Search in 2003
  14. 14. Evaluating We evaluated AUGUR and the ACPM against a gold standard of emerging topics and we compared it against four state-of-the-art algorithms: • Fast Greedy (FG) • Leading Eigenvector (LE) • Fuzzy C-Means (FCM) • Clique Percolation Method (CPM)
  15. 15. Evaluating • Gold standard of 1408 emerging topics in 2000-11 • For each emerging topic we extracted 25 ancestors – Topics that mostly collaborated with the debutant topic during its first five years of activity Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 #topics 149 194 221 216 137 241 134 60 27 12 12 5
  16. 16. Evaluating clusters(yt) vs. ancestorsOfDebutantTopics(yt+1, yt+2)
  17. 17. Evaluating Metrics: • Precision, number of matching clusters divided by the total number of clusters • Recall, number of matching debutant topics divided by the total number of debutant topics
  18. 18. Evaluating Matching: • Jaccard Similarity between Cluster (Ci) with ancestors of debutant topics (Dk) • We added the same-as (SAi) of the cluster (Ci) fuzzy c means ∩ fuzzy c-means (fcm) (fuzzy c means ∪ fuzzy c-means (fcm) ∪ fuzzy c-means ) ∩ fuzzy c-means (fcm) 0 ≤ )#(%&, (), *+& ≤ 1
  19. 19. Evaluating Four strategies: Strategy (1) ,- ./. 12 (shown previously) Strategy (2) (,- ∪ ?,-) ./. 12 Strategy (3) ,- ./. (12 ∪ ?12) Strategy (4) (,- ∪ ?,-) ./. (12 ∪ ?12) ( ) ( ) ( , , ) , , i i i k k i k i i k i i k k C EC SA D ED J C D SA EC ED C EC D ED È È Ç È = È È È 0 ≤ )B(,C, 1E, FGC, ?,C, ?1E ≤ 1
  20. 20. Evaluating Degrees of freedom • Year, [1999-2010] • Similarity threshold, [0-1] • Strategy, {Strategy 1, 2, 3, 4} • Algorithm, {FG, LE, FCM, CPM, ACPM} precision = f(year, similarity, strategy, algorithm) recall = f(year, similarity, strategy, algorithm)
  21. 21. Results Which strategy? Fixed variables: algorithm(ACPM), year(1999) (1) $% &'. )* (2) ($% ∪ -$%) &'. )* (3) $% &'. ()* ∪ -)*) (4) ($% ∪ -$%) &'. ()* ∪ -)*)
  22. 22. Results Which algorithm? FG LE FCM CPM ACPM Years Pr Re Pr Re Pr Re Pr Re Pr Re 1999 .27 .11 .00 .00 .00 .00 .06 .01 .86 .76 2000 .21 .07 .14 .02 .96 .01 .05 .00 .78 .70 2001 .13 .04 .11 .01 .00 .00 .17 .00 .77 .72 2002 .14 .04 .11 .01 .00 .00 .29 .01 .82 .80 2003 .09 .02 .20 .02 .00 .00 .08 .02 .83 .79 2004 .11 .05 .06 .00 .00 .00 .00 .00 .84 .68 2005 .07 .11 .06 .01 .00 .00 .00 .00 .71 .66 2006 .01 .01 .07 .01 .00 .00 .00 .00 .43 .51 2007 .01 .08 .00 .00 .00 .00 .00 .00 .28 .44 2008 .01 .04 .00 .00 .00 .00 .00 .00 .15 .33 2009 .00 .00 .00 .00 .00 .00 .00 .00 .09 .76 Fixed variables: similarity(0.1), strategy(4)
  23. 23. Results Fixed variables: algorithm(ACPM), strategy(4) Final results
  24. 24. Conclusion • We evaluated Augur and ACPM versus four alternative approaches on a gold standard of 1,408 debutant topics in the 2000-2011 timeframe. • The results show that our approach outperforms state of the art solutions and is able to successfully identify clusters that will produce new topics in the two following years.
  25. 25. Future work • Gold Standard • Scope • Further dynamics • Analysis on more recent data
  26. 26. Thank you SKM3 Scholarly Knowledge: Modelling, Mining and Sense Making http://skm.kmi.open.ac.uk/ Angelo A. Salatino email: angelo.salatino@open.ac.uk twitter: @angelosalatino Web: salatino.org Francesco Osborne Enrico Motta Angelo Salatino

×