AUGUR: Forecasting the Emergence of New Research Topics

AUGUR: Forecasting the Emergence of
New Research Topics
Angelo A. Salatino, Francesco Osborne, Enrico Motta
@angelosalatino

Background
Anatomy of a research topic
• Early stage: researchers build their conceptual framework,
establish their community
• Recognised: many researchers work in this topic, start to
produce and disseminate their results

Background
«[…] successive transition from one paradigm to
another via revolution is the usual developmental
pattern of mature science.»
Thomas Kuhn - The Structure of Scientific Revolutions

How research topics are born?
• The fundamental assumption of this research is that it
should be possible to detect the emergence of new
research topics even before they are consistently
labelled by the community
– The approach focuses on uncovering the relevant patterns in the
dynamics of existing topics
Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born?
Understanding the research dynamics preceding the emergence of new areas." PeerJ
Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/

How research topics are born?
The creation of novel topics is anticipated by a significant
increase in the pace of collaboration and density of the
portions of the network in which they will appear.

Forecasting the Emergence of New
Research Topics?
?
Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "How are topics born?
Understanding the research dynamics preceding the emergence of new areas." PeerJ
Computer Science 3 (2017): e119. https://peerj.com/articles/cs-119/
Emerging
Topics
Related Topics
Selection
Data
Analysis
Dynamics

Background data
Scholarly Data
• Dump of Scopus until 2014
• Co-occurrence network
– Nodes: keywords in papers
– Links: number of times two
keywords co-occur together
Computer Science Ontology*
• Large-scale ontology of
research areas automatically
generated using the Klink-2
algorithm**
• Defines when a topic is
broader than another topic
• Defines when two topics
express the same subject of
study
* Computer Science Ontology Portal: https://w3id.org/cso
** Osborne, F. and Motta, E.: Klink-2: integrating multiple web sources to generate
semantic topic networks. In ISWC 2015

Evolutionary Network
Snapshot of the collaborations of topics in a period of five years
, ,
,
ˆ ( , )year year
year
year year
topic topic
u v u vtopic
u v topic topic
v u
w w
w HarmonicMean
p p
=
4
, ,
0
, 4
2
0
ˆ ˆ( )( )
( )
year i
year
topic topic
t i u v u v
evol i
u v
t i
i
year year w w
w
year year
--
=
-
=
- -
=
-
å
å
4
0
4
2
0
( )( )
( )
year i
year
topic topic
t i k k
evol i
k
t i
i
year year p p
p
year year
--
=
-
=
- -
=
-
å
å
!"#$%
= (("#$%
, *"#$%
, +"#$%
, ,"#$%
)
Edge weights
Node weights

Advanced Clique Percolation Method
Clique
detection
Clique-graph
construction
Topic
network
Communities
Clique
detection
Clique-graph
construction
Topic
network
Communities
Measure &
Filtering
Find local
maxima and
select
neighbors
Standard (Palla et al., 2005)
Advanced
An example:
1 connected components
with 4 local maxima

Evolutionary Network of year 2000
CPM ACPM

Post-processing & Sense Making
• Some clusters are filtered and others merged
• Extraction of influential papers
• Extraction of influential authors
A
B
A ∪ B

Final Outcome
Influential Authors
W. Bruce Croft,
Dieter Fensel,
Dan Suciu,
William W. Cohen,
Berthier Ribeiro-Neto,
Clement T. Yu,
James Allan,
Justin Zobel,
Dragomir R. Radev,
Victor Vianu
Influential Papers
- A Sheth et al. "Managing semantic content for the Web" (2002)
- RWP Luk et al. "A survey in indexing and searching XML documents" (2002)
- J Kahan et al. "Annotea: An open RDF infrastructure for shared Web annotations"
(2002)
- R Manmatha et al. "Modeling score distributions for combining the outputs of
search engines" (2001)
- S Dagtas et al. "Models for motion-based video indexing and retrieval" (2000)
Portion of the evolutionary network in 2002, reflecting the emergence
of Semantic Search in 2003

Evaluating
We evaluated AUGUR and the ACPM against a gold
standard of emerging topics and we compared it against
four state-of-the-art algorithms:
• Fast Greedy (FG)
• Leading Eigenvector (LE)
• Fuzzy C-Means (FCM)
• Clique Percolation Method (CPM)

Evaluating
• Gold standard of 1408 emerging topics in 2000-11
• For each emerging topic we extracted 25 ancestors
– Topics that mostly collaborated with the debutant topic during its
first five years of activity
Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
#topics 149 194 221 216 137 241 134 60 27 12 12 5

Evaluating
clusters(yt) vs. ancestorsOfDebutantTopics(yt+1, yt+2)

Evaluating
Metrics:
• Precision, number of matching clusters divided by the total
number of clusters
• Recall, number of matching debutant topics divided by the
total number of debutant topics

Evaluating
Matching:
• Jaccard Similarity between Cluster (Ci) with ancestors
of debutant topics (Dk)
• We added the same-as (SAi) of the cluster (Ci)
fuzzy c means ∩ fuzzy c-means (fcm)
(fuzzy c means ∪ fuzzy c-means (fcm) ∪ fuzzy c-means ) ∩ fuzzy c-means (fcm)
0 ≤ )#(%&, (), *+& ≤ 1

Evaluating
Four strategies:
Strategy (1) ,- ./. 12 (shown previously)
Strategy (2) (,- ∪ ?,-) ./. 12
Strategy (3) ,- ./. (12 ∪ ?12)
Strategy (4) (,- ∪ ?,-) ./. (12 ∪ ?12)
( )
( ) (
, ,
)
, , i i i k k
i k i i k
i i k k
C EC SA D ED
J C D SA EC ED
C EC D ED
È È Ç È
=
È È È
0 ≤ )B(,C, 1E, FGC, ?,C, ?1E ≤ 1

Evaluating
Degrees of freedom
• Year, [1999-2010]
• Similarity threshold, [0-1]
• Strategy, {Strategy 1, 2, 3, 4}
• Algorithm, {FG, LE, FCM, CPM, ACPM}
precision = f(year, similarity, strategy, algorithm)
recall = f(year, similarity, strategy, algorithm)

Results
Which strategy?
Fixed variables: algorithm(ACPM), year(1999)
(1) $% &'. )*
(2) ($% ∪ -$%) &'. )*
(3) $% &'. ()* ∪ -)*)
(4) ($% ∪ -$%) &'. ()* ∪ -)*)

Results
Which algorithm?
FG LE FCM CPM ACPM
Years Pr Re Pr Re Pr Re Pr Re Pr Re
1999 .27 .11 .00 .00 .00 .00 .06 .01 .86 .76
2000 .21 .07 .14 .02 .96 .01 .05 .00 .78 .70
2001 .13 .04 .11 .01 .00 .00 .17 .00 .77 .72
2002 .14 .04 .11 .01 .00 .00 .29 .01 .82 .80
2003 .09 .02 .20 .02 .00 .00 .08 .02 .83 .79
2004 .11 .05 .06 .00 .00 .00 .00 .00 .84 .68
2005 .07 .11 .06 .01 .00 .00 .00 .00 .71 .66
2006 .01 .01 .07 .01 .00 .00 .00 .00 .43 .51
2007 .01 .08 .00 .00 .00 .00 .00 .00 .28 .44
2008 .01 .04 .00 .00 .00 .00 .00 .00 .15 .33
2009 .00 .00 .00 .00 .00 .00 .00 .00 .09 .76
Fixed variables: similarity(0.1), strategy(4)

Results
Fixed variables: algorithm(ACPM), strategy(4)
Final results

Conclusion
• We evaluated Augur and ACPM versus four alternative
approaches on a gold standard of 1,408 debutant topics
in the 2000-2011 timeframe.
• The results show that our approach outperforms state of
the art solutions and is able to successfully identify
clusters that will produce new topics in the two following
years.

Future work
• Gold Standard
• Scope
• Further dynamics
• Analysis on more recent data

Thank you
SKM3 Scholarly Knowledge: Modelling, Mining and Sense Making
http://skm.kmi.open.ac.uk/
Angelo A. Salatino
email: angelo.salatino@open.ac.uk
twitter: @angelosalatino
Web: salatino.org
Francesco
Osborne
Enrico
Motta
Angelo
Salatino

AUGUR: Forecasting the Emergence of New Research Topics

Recommended

Recommended

More Related Content

Similar to AUGUR: Forecasting the Emergence of New Research Topics

Similar to AUGUR: Forecasting the Emergence of New Research Topics (20)

More from Angelo Salatino

More from Angelo Salatino (9)

Recently uploaded

Recently uploaded (20)

AUGUR: Forecasting the Emergence of New Research Topics