Nowadays we are experiencing that the research environment evolves rapidly. New research areas emerge meanwhile others fade out, making difficult to keep up with these dynamics. At the moment, the task of understanding the main emergent area is accomplished either in an automatic or in a semi-automatic way using systems such as rexplore, saffron, arnetminer, MAS, google scholar, faceted dblp and citeseer.
Taking as an example the evolution in time of a topic based on the number of papers, like for example the semantic web in figure, we can recognize three main stages: embryonic, early stage and recognised.
In fact, it can be argued that a number of topics start to exist in an embryonic way, often as a combination of other topics, before being officially identified and then named by researchers. For example, the Semantic Web emerged as a common area for researchers working on Artificial Intelligence, WWW and Knowledge-Based Systems, before being acknowledged and labelled in the 2001 paper by Tim Berners-Lee. The early stage phase starts when a group of scientists agree with some theories related to the topic, build their own conceptual framework, and potentially give birth to a new scientific community. Finally, in the recognized phase, many authors are aware of this topic and then they start to work on it, producing results and then publish research papers.
The problem is that all the aforementioned systems are capable of performing the detection of trends only when the research area is already recognised and not before. They actually need some years to make sense of these new trends. Moreover there are no systems able to forecast their impact in the early stage.
I am interested in identifying, making sense and forecasting the impact of research trends.
Who is really interested? Well, Researchers need to be updated regularly on the evolution of research environments because they are interested in new trends related to their topics. Academic publishers or editors knowing in advance new emerging topics is crucial for offering the most up to date and interesting contents. For example, an editor can gain a competitive advantage by being the first one to recognize the importance of a new trend and publish a special issue or a journal about it. And actually my PhD project is supported from Springer-Verlag. Institutional funding bodies and companies need also to be aware of research developments and promising research trends. For example, being aware of the future research trends will allow them to move in advance for making some important investments.
This problem can be analysed from two point of view that are the topic trend detection and the forecast of the impact of topics.
For what concerns the trend detection, all the current approaches do use bibliometric analysis aiming to extract either topics or main terms from the text and then the evolution of these topics is analysed investigating the citation network. The main limitation of these approaches is that the content for specific topic need first to be produced and then cited taking years before they can realise it.
On the other hand, for forecasting the impact there are approaches that define the impact as number of publications and authors associated with topics and they are mainly based on statistical techniques like exponential smoothing, simple medium average and also machine learning algorithms.
In this case, the main limitations of these approaches is that they do not work in the first phases of the evolution of topics and also they employ limited set of features.
However it can argued that a different definition of the impact based also on social media data can improve the forecasting phase and will allow us to perform it in a short timescale.
Initially, I will aim to integrate a variety of heterogeneous data sources including scholarly data and social media data in order to create a comprehensive knowledge base. This knowledge based will make use of an ontology to describe all the relationships between the research elements.
Afterwards I will focus on analysing pattern that can lead to the emergence of a new research topic. For example, before Tim Berners Lee named officially the semantic web as a research area, we were already able to identify that the AI, the WWW and KBS were sharing their knowledge in this new common area. An interesting fact about scholarly data is that they store information about papers, therefore many research elements like topics, author, communities, venues, organizations can be inferred and all these research elements are inherently interconnected because an author writes paper about certain topics, an author belong to a community that is connected to a topic. These relationships can be analysed diachronically to derive new dynamics that can lead to the emergence of new topics, and then I can design a comprehensive model that takes into account all the discovered patterns.
I conducted an initial study aiming to identify the dynamics that may lead to the emergence of a new topic using only scholarly data. In order to do so, I firstly combined the keywords network and the semantic topic network available in REXPLORE database. The keywords network as the name suggests is a network in which nodes represent keywords tagged in paper and the link between two keywords represent the amount of paper in which these two keyword co-occur per each year. The semantic topic network is also a network of keywords but in this case they are connected by semantic relationships subAreaOf, sameAs and so on that creates then a hierarchy of research topics. As a next step, I conducted a diachronic analysis on some portion of this joint network that are related to two different kind of topics: debutant and non debutant
As a result I obtained that for the portion of network related to the debutant group of topics the pace of collaboration between topics is higher than the portion of network related to the non-debutant group.
In this picture we can see two different distribution of the pace of collaboration of topics in time. The green line is for the topics belonging to the non debutant group while the blue line is for topics belonging to the debutant group. We can see that the distribution of the pace in collaboration for the non debutant group is centred in zero which means that on overall this group doesn’t show any increase in collaboration, while for the debutant group the distribution is shifted toward positive values showing that in this case the pace of collaboration is increasing. Moreover, applying the Student’s t-test on the two distributions allows us to reject the null hypothesis indicating that there is no relationship between the two measured phenomena.
For this reason I believe that the acquired know-how can be applied for understanding the emergence of new topics based on the established ones.
As preliminary results, I joined the Keywords Network that is a co-occurrences graphs with nodes representing topics and links representing the number of co-occurrences between them and the Semantic Topic Network that is a taxonomy of topic connected by semantic relationships extracted by Klink. I conducted a diachronic analysis on some portions of this joined graphs to confirm if the creation of novel topics is actually correlated to an increase in the pace of collaboration of already existing ones. These portions of graph were related to two different groups of topics: debutant and non-debutant.
I plan to evaluate my work on both quantitative and qualitative perspective. From a quantitative point of view, I will use historical data to estimate statistical indexes like precision, recall, f-measure and so on. While from the qualitative perspecive, it is intended to receive informal feedback about future trend from domain experts, such as senior editors and publishers at Springer
It can be said that the initial experiments provided promising result confirming also the initial hypotheses about the emergence of new topics. And, the adoption of semantic technologies like the semantic topic network has been beneficial to improve these results.
As a next step I aim to analyse the dynamics of other research elements like authors, communities and venues that can lead to the emergence of a new research topics and also integrate entities from social media like tweets and blog posts.
Early Detection and Forecasting of Research Trends
Early Detection and Forecasting of
Angelo Antonio Salatino
Prof. Enrico Motta
Dr. Francesco Osborne
ISWC 2015 – Doctoral Consortium
• Researchers: following the evolution of
the research environment
• Academic publishers: promoting up-to-
date and interesting contents
• Companies: early intelligence on
potentially important research trends to
remain at the forefront of innovation
• Funding bodies: improved understanding
of the research landscape
State of the art: Trend detection
• Topic evolution using bibliometric analysis:
– Content analysis
• Topics extraction
• Main terms in documents
– Citation analysis
– Main limitation: cannot detect new trends
early enough in the lifecycle
[Wu et al. 2011, Bolelli et al. 2009, He et al. 2009]
State of the art: Forecasting impact
• Impact based on number of publications and
authors associated with topics
• Approaches based on exponential
smoothing, simple medium average and
– These approaches don’t work at embryonic and
– They only use a limited set of data sources
[Budi al. 2012, Jun et al. 2010, Tseng et al. 2009]
Wider range of data sources:
comprehensive knowledge base integrating
both scholarly data and social media
– For example, before the Semantic Web
emerged explicitly as research area we
could identify new interesting dynamics
involving authors from different research
areas such as knowledge representation,
agent systems, hypertext and databases.
– Creation of a model that takes into
account all the discovered patterns which
may involve different entities (e.g.,
authors, venues, topics, communities)
Focus on discovering patterns emerging from the
• Goal: To identify the dynamics that may
indicate the emergence of a new topic
– Integration of Keywords network and Semantic
topics network (Klink-2, Osborne et al. @ ISWC
– Analysis of the evolution in time of sub-networks
that will generate new topics vs. a control
group of establish topics.
• Debutant group (new topics)
• Non-debutant group (established topics)
• My analysis indicates that for Debutant Topics there is
an intense activity between the most co-occurring
keywords which would normally be established topics
• My hypothesis is that I can use this understanding for
the early detection of new topics on the basis of the
activity of established topics
Student’s t-test on the two distributions:
• p-value = 2.81*10-83
• null hypothesis can be rejected
• Quantitative: retrospective analysis and
detection of historical trends
• Qualitative: informal feedback from
domain experts, including senior editors
and publishers at Springer, on the system
suggestions for future trends
• So far, my initial experiments provided
promising results which confirm the initial
• The adoption of semantic technologies
has been beneficial to improve these
• Analyse dynamics in other networks (e.g.,
authors, communities and venues)
• Integration of social media data