A Framework to Automatically Extract Funding Information from Text

A Framework to Automatically Extract Funding
Information from Text
Deep Kayal, Zubair Afzal, George Tsatsaronis et al.
Content and Innovation Group, Elsevier B.V., Amsterdam, NL.
d.kayal@elsevier.com
16 September, 2018
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25

Overview
1 Motivation and Problem Deﬁnition
2 Background
3 Methodology
4 Experiments and Results
5 Conclusions

Section 1
Motivation and Problem Deﬁnition

Motivation
Usually, institutions and researchers are required to acknowledge the
funding source and grants.
This information, if captured eﬀectively, will enable funding
organizations to justify the impact of their allocated research funds.
Plus, this information will also help researchers discover appropriate
funding opportunities for their interests.
In this work, we address the problem of automating the
extraction of funding information from text, using natural
language processing and machine learning techniques.

Problem Statement
Can we automatically detect the funder information from scientiﬁc
papers?
Support for the Nurses’ Health Study and the Health Professionals
Follow-up Study was provided by grants (P01 CA87969 and UM1
CA167552, respectively) from the NCI. Support for the Women’s Health
Initiative program is provided by contracts (N01WH22110, N01WH24152,
N01WH3210032102 and N01WH32105) from the National Heart, Lung,
and Blood Institute.
Can we mark them with entities of the form Funding Body and Grant
Number?

Section 2
Background

Problem Definition
Given a scientific article as raw text input, we design a system to
perform two tasks:
1 identify all text segments which contain funding information.
2 process all the funding text segments in order to detect the set of the
funding bodies (FB) and the set of grants (GR) that appear in the text.
The former is a binary text classification task.
While, the latter can be seen as a named entity recognition (NER)
problem.

NER and Sequential Learning
NER extracts information, known as named entities, from
unstructured text; for example, the names of persons, locations and
organizations.
In literature, NER systems have been found to employ rule-based,
gazetteer and machine learning approaches [Nadeau, 2007].
Sequential learning approaches are machine learning models that
leverage the relationships between nearby data points and their class
labels.
Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003]
and Maximum Entropy Models [Chieu, 2002] are popular ways of
modeling data for NER.

Implementations and Toolkits
The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a
CRF implementation, enhanced with long-distance features. An
important aspect of the toolkit is the ability to use distributional
similarity measures.
LingPipe2 is another NLP toolkit, whose eﬃcient HMM
implementation includes n-gram features.
In this work, we also use the Apache OpenNLP4 toolkit3, which has a
MaxEnt implementation for NER.
Finally, this work also makes use of Elseviers Fingerprint Engine
(FPE)4, which is an industrial solution for annotating text with
ontological concepts, given a vocabulary.
1
http://stanfordnlp.github.io/CoreNLP/
2
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
3
https://opennlp.apache.org/
4
https://www.elsevier.com/solutions/elsevier-fingerprint-engine

Section 3
Methodology

Design Choice
As mentioned earlier, we use a two-stage approach to extract funding
information from text.
This design has the following beneﬁts:
1 it minimizes the execution time of the approach as the costliest
component, namely NER.
2 it reduces the number of false positives, as there are many text
segments in a scientiﬁc full text article that contain strings which a
NER component could potentially annotate falsely.

Data Collection
Silver Set:
randomly sample articles from ScienceDirect5
database from the last 10
years and select only acknowledgment sections from them.
using the Fingerprint Engine (FPE) and Crossref’s open funder
registry6
, annotate FBs from these acknowledgment sections.
at the end of this step, the number of retained sections with at least
one annotated FB resulted in 44,660.
Gold Set:
journal articles were picked randomly from a large number of
publications, annotated by three diﬀerent experts and harmonized.
1,682 articles, out of around 2000, contained at least one
funding-related annotation, resulting in 4,537 FB and 3,156 GR
annotations in the set.
pair-wise averaged Cohens kappa was used to calculate the
inter-annotator agreement to assess dataset quality, and was found to
be 0.89, suggesting high quality.
5
http://www.sciencedirect.com/
6
http://www.crossref.org/fundingdata/registry.html

Data Usage
The Silver Set was used to learn word clusters for the distributional
similarity measure that can be employed within the Stanford CoreNLP
toolkit
this was done by generating word-embeddings using this dataset, using
the Word2Vec algorithm [Mikolov, 2013].
followed by K-means clustering using cosine similarity.
Additionally, it was also used to train models to detect FB
annotations.
The Gold Set was used to train the binary text classiﬁer that detects
the paragraphs of text which contain funding information.
It was also used to train models to detect FB and GR annotations.

Detecting Text Blocks with Funding Information
As the first step, the text segments which contain funding information
are to be separated from the rest (Binary Classification).
To address this problem, we have used a cost-sensitive L2-regularized
linear Support Vector Machine (SVMs), as SVMs are known to
perform well on text classification problems.
The SVMs operate on TF-IDF vectors extracted from the segments
of each input text, based on a bigram bag-of-words text.
The SVM was trained on the examples of positive (1,682) and
negative segments (47,565), i.e., paragraphs with and without
funding information, which could be found in the Gold Set.

Extracting Funding Information using NER
In order to annotate a piece of text with the FB label, a variety of
models were used:
1 pre-trained models packaged as part of the Stanford CoreNLP and
LingPipe suites; in this work they were used to identify the
Organization labels in the text, which were then stored as FB.
2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained
on the Silver and Gold sets.
3 Stanford CRF classiﬁers using distributional similarity features based
on the word clusters created from the Silver Set data.
As for GR labels:
1 we use a rule-based approach, considering every word inside the
funding section with at least a digit, as a grant ID
2 we train all of the aforementioned models based on the labeled data in
the Gold Set.

Ensembling
Figure: An example of the ensemble approach for extracting funding information
from text.

Overall Pipeline
Figure: Schematic showing the overall pipeline.

Section 4
Experiments and Results

Detection of Text Blocks with Funding Information
Section P R F1
SVM 99 5 9
Cost sensitive L2-SVM (C=2) 95 85 90
Table: Results for the identiﬁcation of text with funding informationm using SVM.

Extraction of Funding Organization names
Method P R F1
HMM-Pre 18(±0) 31(±0) 23(±0)
CRF-Pre 35(±0) 54(±0) 42(±0)
FPE 48(±0) 46(±0) 47(±0)
CRF-S 49(±0) 43(±0) 46(±0)
HMM-S 36(±0) 48(±0) 41(±0)
MaxEnt-S 50(±0) 39(±0) 44(±0)
CRF-G 64(±.2) 58(±.2) 61(±.2)
CRF-dsim-G 66(±.2) 61(±.3) 63(±.2)
HMM-G 49(±.3) 54(±.2) 52(±.2)
MaxEnt-G 64(±.4) 54(±.2) 59(±.3)
FundingFinder 72(±.3) 63(±.2) 68(±.3)
Table: NER Results for Funding Body (FB) annotation label. Best performing
model is highlighted in bold while the second best is in italics.

Extraction of Grant IDs
Method P R F1
Rule-based 78(±0) 89(±0) 83(±0)
CRF-G 91(±.1) 91(±.08) 91(±.1)
HMM-G 76(±.2) 77(±.2) 76(±.2)
MaxEnt-G 87(±.2) 89(±.1) 88(±.2)
FundingFinder 92(±.1) 91(±.1) 92(±.1)
Table: NER Results for Grant (GR) annotation label. Best performing model is
highlighted in bold while the second best is in italics.

Section 5
Conclusions

Conclusions and Contributions
1 We have discussed on the practically important problem of extracting
funding information from text, and have experimentally provided an
overview of the state-of-the-art methods that could be used for the
same. This may prove to be a signiﬁcant head-start for researchers
delving into the same problem for further research.
2 Empirically, we have shown that a small and high quality dataset is
more suitable for this NER task than a larger, but noisier, dataset.
3 We have suggested an eﬃcient two-stage pipeline for the task of
funding information extraction.
4 A learning mechanism, based on an ensemble of state-of-the-art base
annotators, was suggested, which should be easily extensible to any
NER task.

References
Nadeau, D., Sekine, S (2007)
A survey of named entity recognition and classiﬁcation
Linguisticae Investigationes 30(1), 3 – 26.
Chieu, H.L. (2002)
Named entity recognition : a maximum entropy approach using global information
Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196.
McCallum, A., Li, W. (2003)
Early results for named entity recognition with conditional random ﬁelds, feature induction and web-enhanced lexicons
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191.
Zhou, G., Su, J. (2002)
Named entity recognition using an HMM-based chunk tagger
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119.

Thank You!
Please email me at d.kayal@elsevier.com for
critiques, comments, advice, dataset inquiries, etc.

A Framework to Automatically Extract Funding Information from Text

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to A Framework to Automatically Extract Funding Information from Text

Similar to A Framework to Automatically Extract Funding Information from Text (20)

More from Deep Kayal

More from Deep Kayal (6)

Recently uploaded

Recently uploaded (20)

A Framework to Automatically Extract Funding Information from Text