A Framework to Automatically Extract Funding Information from Text
1. A Framework to Automatically Extract Funding
Information from Text
Deep Kayal, Zubair Afzal, George Tsatsaronis et al.
Content and Innovation Group, Elsevier B.V., Amsterdam, NL.
d.kayal@elsevier.com
16 September, 2018
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25
2. Overview
1 Motivation and Problem Definition
2 Background
3 Methodology
4 Experiments and Results
5 Conclusions
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 2 / 25
3. Section 1
Motivation and Problem Definition
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 3 / 25
4. Motivation
Usually, institutions and researchers are required to acknowledge the
funding source and grants.
This information, if captured effectively, will enable funding
organizations to justify the impact of their allocated research funds.
Plus, this information will also help researchers discover appropriate
funding opportunities for their interests.
In this work, we address the problem of automating the
extraction of funding information from text, using natural
language processing and machine learning techniques.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 4 / 25
5. Problem Statement
Can we automatically detect the funder information from scientific
papers?
Support for the Nurses’ Health Study and the Health Professionals
Follow-up Study was provided by grants (P01 CA87969 and UM1
CA167552, respectively) from the NCI. Support for the Women’s Health
Initiative program is provided by contracts (N01WH22110, N01WH24152,
N01WH3210032102 and N01WH32105) from the National Heart, Lung,
and Blood Institute.
Can we mark them with entities of the form Funding Body and Grant
Number?
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 5 / 25
7. Problem Definition
Given a scientific article as raw text input, we design a system to
perform two tasks:
1 identify all text segments which contain funding information.
2 process all the funding text segments in order to detect the set of the
funding bodies (FB) and the set of grants (GR) that appear in the text.
The former is a binary text classification task.
While, the latter can be seen as a named entity recognition (NER)
problem.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 7 / 25
8. NER and Sequential Learning
NER extracts information, known as named entities, from
unstructured text; for example, the names of persons, locations and
organizations.
In literature, NER systems have been found to employ rule-based,
gazetteer and machine learning approaches [Nadeau, 2007].
Sequential learning approaches are machine learning models that
leverage the relationships between nearby data points and their class
labels.
Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003]
and Maximum Entropy Models [Chieu, 2002] are popular ways of
modeling data for NER.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 8 / 25
9. Implementations and Toolkits
The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a
CRF implementation, enhanced with long-distance features. An
important aspect of the toolkit is the ability to use distributional
similarity measures.
LingPipe2 is another NLP toolkit, whose efficient HMM
implementation includes n-gram features.
In this work, we also use the Apache OpenNLP4 toolkit3, which has a
MaxEnt implementation for NER.
Finally, this work also makes use of Elseviers Fingerprint Engine
(FPE)4, which is an industrial solution for annotating text with
ontological concepts, given a vocabulary.
1
http://stanfordnlp.github.io/CoreNLP/
2
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
3
https://opennlp.apache.org/
4
https://www.elsevier.com/solutions/elsevier-fingerprint-engine
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 9 / 25
11. Design Choice
As mentioned earlier, we use a two-stage approach to extract funding
information from text.
This design has the following benefits:
1 it minimizes the execution time of the approach as the costliest
component, namely NER.
2 it reduces the number of false positives, as there are many text
segments in a scientific full text article that contain strings which a
NER component could potentially annotate falsely.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 11 / 25
12. Data Collection
Silver Set:
randomly sample articles from ScienceDirect5
database from the last 10
years and select only acknowledgment sections from them.
using the Fingerprint Engine (FPE) and Crossref’s open funder
registry6
, annotate FBs from these acknowledgment sections.
at the end of this step, the number of retained sections with at least
one annotated FB resulted in 44,660.
Gold Set:
journal articles were picked randomly from a large number of
publications, annotated by three different experts and harmonized.
1,682 articles, out of around 2000, contained at least one
funding-related annotation, resulting in 4,537 FB and 3,156 GR
annotations in the set.
pair-wise averaged Cohens kappa was used to calculate the
inter-annotator agreement to assess dataset quality, and was found to
be 0.89, suggesting high quality.
5
http://www.sciencedirect.com/
6
http://www.crossref.org/fundingdata/registry.html
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 12 / 25
13. Data Usage
The Silver Set was used to learn word clusters for the distributional
similarity measure that can be employed within the Stanford CoreNLP
toolkit
this was done by generating word-embeddings using this dataset, using
the Word2Vec algorithm [Mikolov, 2013].
followed by K-means clustering using cosine similarity.
Additionally, it was also used to train models to detect FB
annotations.
The Gold Set was used to train the binary text classifier that detects
the paragraphs of text which contain funding information.
It was also used to train models to detect FB and GR annotations.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 13 / 25
14. Detecting Text Blocks with Funding Information
As the first step, the text segments which contain funding information
are to be separated from the rest (Binary Classification).
To address this problem, we have used a cost-sensitive L2-regularized
linear Support Vector Machine (SVMs), as SVMs are known to
perform well on text classification problems.
The SVMs operate on TF-IDF vectors extracted from the segments
of each input text, based on a bigram bag-of-words text.
The SVM was trained on the examples of positive (1,682) and
negative segments (47,565), i.e., paragraphs with and without
funding information, which could be found in the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 14 / 25
15. Extracting Funding Information using NER
In order to annotate a piece of text with the FB label, a variety of
models were used:
1 pre-trained models packaged as part of the Stanford CoreNLP and
LingPipe suites; in this work they were used to identify the
Organization labels in the text, which were then stored as FB.
2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained
on the Silver and Gold sets.
3 Stanford CRF classifiers using distributional similarity features based
on the word clusters created from the Silver Set data.
As for GR labels:
1 we use a rule-based approach, considering every word inside the
funding section with at least a digit, as a grant ID
2 we train all of the aforementioned models based on the labeled data in
the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 15 / 25
16. Ensembling
Figure: An example of the ensemble approach for extracting funding information
from text.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 16 / 25
17. Overall Pipeline
Figure: Schematic showing the overall pipeline.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 17 / 25
18. Section 4
Experiments and Results
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 18 / 25
19. Detection of Text Blocks with Funding Information
Section P R F1
SVM 99 5 9
Cost sensitive L2-SVM (C=2) 95 85 90
Table: Results for the identification of text with funding informationm using SVM.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 19 / 25
20. Extraction of Funding Organization names
Method P R F1
HMM-Pre 18(±0) 31(±0) 23(±0)
CRF-Pre 35(±0) 54(±0) 42(±0)
FPE 48(±0) 46(±0) 47(±0)
CRF-S 49(±0) 43(±0) 46(±0)
HMM-S 36(±0) 48(±0) 41(±0)
MaxEnt-S 50(±0) 39(±0) 44(±0)
CRF-G 64(±.2) 58(±.2) 61(±.2)
CRF-dsim-G 66(±.2) 61(±.3) 63(±.2)
HMM-G 49(±.3) 54(±.2) 52(±.2)
MaxEnt-G 64(±.4) 54(±.2) 59(±.3)
FundingFinder 72(±.3) 63(±.2) 68(±.3)
Table: NER Results for Funding Body (FB) annotation label. Best performing
model is highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 20 / 25
21. Extraction of Grant IDs
Method P R F1
Rule-based 78(±0) 89(±0) 83(±0)
CRF-G 91(±.1) 91(±.08) 91(±.1)
HMM-G 76(±.2) 77(±.2) 76(±.2)
MaxEnt-G 87(±.2) 89(±.1) 88(±.2)
FundingFinder 92(±.1) 91(±.1) 92(±.1)
Table: NER Results for Grant (GR) annotation label. Best performing model is
highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 21 / 25
23. Conclusions and Contributions
1 We have discussed on the practically important problem of extracting
funding information from text, and have experimentally provided an
overview of the state-of-the-art methods that could be used for the
same. This may prove to be a significant head-start for researchers
delving into the same problem for further research.
2 Empirically, we have shown that a small and high quality dataset is
more suitable for this NER task than a larger, but noisier, dataset.
3 We have suggested an efficient two-stage pipeline for the task of
funding information extraction.
4 A learning mechanism, based on an ensemble of state-of-the-art base
annotators, was suggested, which should be easily extensible to any
NER task.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 23 / 25
24. References
Nadeau, D., Sekine, S (2007)
A survey of named entity recognition and classification
Linguisticae Investigationes 30(1), 3 – 26.
Chieu, H.L. (2002)
Named entity recognition : a maximum entropy approach using global information
Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196.
McCallum, A., Li, W. (2003)
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191.
Zhou, G., Su, J. (2002)
Named entity recognition using an HMM-based chunk tagger
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 24 / 25
25. Thank You!
Please email me at d.kayal@elsevier.com for
critiques, comments, advice, dataset inquiries, etc.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 25 / 25