SlideShare a Scribd company logo
1 of 25
Download to read offline
A Framework to Automatically Extract Funding
Information from Text
Deep Kayal, Zubair Afzal, George Tsatsaronis et al.
Content and Innovation Group, Elsevier B.V., Amsterdam, NL.
d.kayal@elsevier.com
16 September, 2018
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25
Overview
1 Motivation and Problem Definition
2 Background
3 Methodology
4 Experiments and Results
5 Conclusions
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 2 / 25
Section 1
Motivation and Problem Definition
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 3 / 25
Motivation
Usually, institutions and researchers are required to acknowledge the
funding source and grants.
This information, if captured effectively, will enable funding
organizations to justify the impact of their allocated research funds.
Plus, this information will also help researchers discover appropriate
funding opportunities for their interests.
In this work, we address the problem of automating the
extraction of funding information from text, using natural
language processing and machine learning techniques.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 4 / 25
Problem Statement
Can we automatically detect the funder information from scientific
papers?
Support for the Nurses’ Health Study and the Health Professionals
Follow-up Study was provided by grants (P01 CA87969 and UM1
CA167552, respectively) from the NCI. Support for the Women’s Health
Initiative program is provided by contracts (N01WH22110, N01WH24152,
N01WH3210032102 and N01WH32105) from the National Heart, Lung,
and Blood Institute.
Can we mark them with entities of the form Funding Body and Grant
Number?
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 5 / 25
Section 2
Background
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 6 / 25
Problem Definition
Given a scientific article as raw text input, we design a system to
perform two tasks:
1 identify all text segments which contain funding information.
2 process all the funding text segments in order to detect the set of the
funding bodies (FB) and the set of grants (GR) that appear in the text.
The former is a binary text classification task.
While, the latter can be seen as a named entity recognition (NER)
problem.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 7 / 25
NER and Sequential Learning
NER extracts information, known as named entities, from
unstructured text; for example, the names of persons, locations and
organizations.
In literature, NER systems have been found to employ rule-based,
gazetteer and machine learning approaches [Nadeau, 2007].
Sequential learning approaches are machine learning models that
leverage the relationships between nearby data points and their class
labels.
Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003]
and Maximum Entropy Models [Chieu, 2002] are popular ways of
modeling data for NER.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 8 / 25
Implementations and Toolkits
The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a
CRF implementation, enhanced with long-distance features. An
important aspect of the toolkit is the ability to use distributional
similarity measures.
LingPipe2 is another NLP toolkit, whose efficient HMM
implementation includes n-gram features.
In this work, we also use the Apache OpenNLP4 toolkit3, which has a
MaxEnt implementation for NER.
Finally, this work also makes use of Elseviers Fingerprint Engine
(FPE)4, which is an industrial solution for annotating text with
ontological concepts, given a vocabulary.
1
http://stanfordnlp.github.io/CoreNLP/
2
http://alias-i.com/lingpipe/demos/tutorial/read-me.html
3
https://opennlp.apache.org/
4
https://www.elsevier.com/solutions/elsevier-fingerprint-engine
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 9 / 25
Section 3
Methodology
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 10 / 25
Design Choice
As mentioned earlier, we use a two-stage approach to extract funding
information from text.
This design has the following benefits:
1 it minimizes the execution time of the approach as the costliest
component, namely NER.
2 it reduces the number of false positives, as there are many text
segments in a scientific full text article that contain strings which a
NER component could potentially annotate falsely.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 11 / 25
Data Collection
Silver Set:
randomly sample articles from ScienceDirect5
database from the last 10
years and select only acknowledgment sections from them.
using the Fingerprint Engine (FPE) and Crossref’s open funder
registry6
, annotate FBs from these acknowledgment sections.
at the end of this step, the number of retained sections with at least
one annotated FB resulted in 44,660.
Gold Set:
journal articles were picked randomly from a large number of
publications, annotated by three different experts and harmonized.
1,682 articles, out of around 2000, contained at least one
funding-related annotation, resulting in 4,537 FB and 3,156 GR
annotations in the set.
pair-wise averaged Cohens kappa was used to calculate the
inter-annotator agreement to assess dataset quality, and was found to
be 0.89, suggesting high quality.
5
http://www.sciencedirect.com/
6
http://www.crossref.org/fundingdata/registry.html
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 12 / 25
Data Usage
The Silver Set was used to learn word clusters for the distributional
similarity measure that can be employed within the Stanford CoreNLP
toolkit
this was done by generating word-embeddings using this dataset, using
the Word2Vec algorithm [Mikolov, 2013].
followed by K-means clustering using cosine similarity.
Additionally, it was also used to train models to detect FB
annotations.
The Gold Set was used to train the binary text classifier that detects
the paragraphs of text which contain funding information.
It was also used to train models to detect FB and GR annotations.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 13 / 25
Detecting Text Blocks with Funding Information
As the first step, the text segments which contain funding information
are to be separated from the rest (Binary Classification).
To address this problem, we have used a cost-sensitive L2-regularized
linear Support Vector Machine (SVMs), as SVMs are known to
perform well on text classification problems.
The SVMs operate on TF-IDF vectors extracted from the segments
of each input text, based on a bigram bag-of-words text.
The SVM was trained on the examples of positive (1,682) and
negative segments (47,565), i.e., paragraphs with and without
funding information, which could be found in the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 14 / 25
Extracting Funding Information using NER
In order to annotate a piece of text with the FB label, a variety of
models were used:
1 pre-trained models packaged as part of the Stanford CoreNLP and
LingPipe suites; in this work they were used to identify the
Organization labels in the text, which were then stored as FB.
2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained
on the Silver and Gold sets.
3 Stanford CRF classifiers using distributional similarity features based
on the word clusters created from the Silver Set data.
As for GR labels:
1 we use a rule-based approach, considering every word inside the
funding section with at least a digit, as a grant ID
2 we train all of the aforementioned models based on the labeled data in
the Gold Set.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 15 / 25
Ensembling
Figure: An example of the ensemble approach for extracting funding information
from text.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 16 / 25
Overall Pipeline
Figure: Schematic showing the overall pipeline.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 17 / 25
Section 4
Experiments and Results
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 18 / 25
Detection of Text Blocks with Funding Information
Section P R F1
SVM 99 5 9
Cost sensitive L2-SVM (C=2) 95 85 90
Table: Results for the identification of text with funding informationm using SVM.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 19 / 25
Extraction of Funding Organization names
Method P R F1
HMM-Pre 18(±0) 31(±0) 23(±0)
CRF-Pre 35(±0) 54(±0) 42(±0)
FPE 48(±0) 46(±0) 47(±0)
CRF-S 49(±0) 43(±0) 46(±0)
HMM-S 36(±0) 48(±0) 41(±0)
MaxEnt-S 50(±0) 39(±0) 44(±0)
CRF-G 64(±.2) 58(±.2) 61(±.2)
CRF-dsim-G 66(±.2) 61(±.3) 63(±.2)
HMM-G 49(±.3) 54(±.2) 52(±.2)
MaxEnt-G 64(±.4) 54(±.2) 59(±.3)
FundingFinder 72(±.3) 63(±.2) 68(±.3)
Table: NER Results for Funding Body (FB) annotation label. Best performing
model is highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 20 / 25
Extraction of Grant IDs
Method P R F1
Rule-based 78(±0) 89(±0) 83(±0)
CRF-G 91(±.1) 91(±.08) 91(±.1)
HMM-G 76(±.2) 77(±.2) 76(±.2)
MaxEnt-G 87(±.2) 89(±.1) 88(±.2)
FundingFinder 92(±.1) 91(±.1) 92(±.1)
Table: NER Results for Grant (GR) annotation label. Best performing model is
highlighted in bold while the second best is in italics.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 21 / 25
Section 5
Conclusions
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 22 / 25
Conclusions and Contributions
1 We have discussed on the practically important problem of extracting
funding information from text, and have experimentally provided an
overview of the state-of-the-art methods that could be used for the
same. This may prove to be a significant head-start for researchers
delving into the same problem for further research.
2 Empirically, we have shown that a small and high quality dataset is
more suitable for this NER task than a larger, but noisier, dataset.
3 We have suggested an efficient two-stage pipeline for the task of
funding information extraction.
4 A learning mechanism, based on an ensemble of state-of-the-art base
annotators, was suggested, which should be easily extensible to any
NER task.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 23 / 25
References
Nadeau, D., Sekine, S (2007)
A survey of named entity recognition and classification
Linguisticae Investigationes 30(1), 3 – 26.
Chieu, H.L. (2002)
Named entity recognition : a maximum entropy approach using global information
Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196.
McCallum, A., Li, W. (2003)
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191.
Zhou, G., Su, J. (2002)
Named entity recognition using an HMM-based chunk tagger
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 24 / 25
Thank You!
Please email me at d.kayal@elsevier.com for
critiques, comments, advice, dataset inquiries, etc.
Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 25 / 25

More Related Content

What's hot

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
A classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningA classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningIOSR Journals
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search EngineJay R Modi
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Cs583 info-retrieval
Cs583 info-retrievalCs583 info-retrieval
Cs583 info-retrievalBorseshweta
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Daniel Valcarce
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern treeIJDKP
 

What's hot (17)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
A classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningA classification of methods for frequent pattern mining
A classification of methods for frequent pattern mining
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
05
0505
05
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Text mining
Text miningText mining
Text mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Cs583 info-retrieval
Cs583 info-retrievalCs583 info-retrieval
Cs583 info-retrieval
 
Week12
Week12Week12
Week12
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
 
Discovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining ProcedureDiscovering Frequent Patterns with New Mining Procedure
Discovering Frequent Patterns with New Mining Procedure
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern tree
 

Similar to A Framework to Automatically Extract Funding Information from Text

Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data miningijpla
 
Acknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 PapersAcknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 PapersMartha Brown
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Nidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxNidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxChristyPansyNidoy
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationBlerina Spahiu
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsFaegheh Hasibi
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
 
An Ontology-Based Information Extraction Approach For R Sum S
An Ontology-Based Information Extraction Approach For R Sum SAn Ontology-Based Information Extraction Approach For R Sum S
An Ontology-Based Information Extraction Approach For R Sum SRichard Hogue
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
Visual mining of science citation data for benchmarking scientific and techno...
Visual mining of science citation data for benchmarking scientific and techno...Visual mining of science citation data for benchmarking scientific and techno...
Visual mining of science citation data for benchmarking scientific and techno...Gurdal Ertek
 
Data Mining based on Hashing Technique
Data Mining based on Hashing TechniqueData Mining based on Hashing Technique
Data Mining based on Hashing Techniqueijtsrd
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
 

Similar to A Framework to Automatically Extract Funding Information from Text (20)

Comparative study of frequent item set in data mining
Comparative study of frequent item set in data miningComparative study of frequent item set in data mining
Comparative study of frequent item set in data mining
 
Acknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 PapersAcknowledgement Entity Recognition In CORD-19 Papers
Acknowledgement Entity Recognition In CORD-19 Papers
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Predicting Budget from Transportation Research Grant Description: An Explorat...
Predicting Budget from Transportation Research Grant Description: An Explorat...Predicting Budget from Transportation Research Grant Description: An Explorat...
Predicting Budget from Transportation Research Grant Description: An Explorat...
 
Nidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptxNidoy_Grounded Theory by Corbin & Strauss.pptx
Nidoy_Grounded Theory by Corbin & Strauss.pptx
 
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern MinimalizationABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
An Ontology-Based Information Extraction Approach For R Sum S
An Ontology-Based Information Extraction Approach For R Sum SAn Ontology-Based Information Extraction Approach For R Sum S
An Ontology-Based Information Extraction Approach For R Sum S
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Visual mining of science citation data for benchmarking scientific and techno...
Visual mining of science citation data for benchmarking scientific and techno...Visual mining of science citation data for benchmarking scientific and techno...
Visual mining of science citation data for benchmarking scientific and techno...
 
Data Mining based on Hashing Technique
Data Mining based on Hashing TechniqueData Mining based on Hashing Technique
Data Mining based on Hashing Technique
 
P33077080
P33077080P33077080
P33077080
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 

More from Deep Kayal

State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer VisionDeep Kayal
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionDeep Kayal
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleDeep Kayal
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Deep Kayal
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkDeep Kayal
 

More from Deep Kayal (6)

State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projection
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Topic Pages. From articles to answers.
Topic Pages. From articles to answers.Topic Pages. From articles to answers.
Topic Pages. From articles to answers.
 
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and SparkLarge-Scale Data Extraction, Structuring and Matching using Python and Spark
Large-Scale Data Extraction, Structuring and Matching using Python and Spark
 

Recently uploaded

Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiologyDrAnita Sharma
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 

Recently uploaded (20)

Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiology
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 

A Framework to Automatically Extract Funding Information from Text

  • 1. A Framework to Automatically Extract Funding Information from Text Deep Kayal, Zubair Afzal, George Tsatsaronis et al. Content and Innovation Group, Elsevier B.V., Amsterdam, NL. d.kayal@elsevier.com 16 September, 2018 Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 1 / 25
  • 2. Overview 1 Motivation and Problem Definition 2 Background 3 Methodology 4 Experiments and Results 5 Conclusions Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 2 / 25
  • 3. Section 1 Motivation and Problem Definition Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 3 / 25
  • 4. Motivation Usually, institutions and researchers are required to acknowledge the funding source and grants. This information, if captured effectively, will enable funding organizations to justify the impact of their allocated research funds. Plus, this information will also help researchers discover appropriate funding opportunities for their interests. In this work, we address the problem of automating the extraction of funding information from text, using natural language processing and machine learning techniques. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 4 / 25
  • 5. Problem Statement Can we automatically detect the funder information from scientific papers? Support for the Nurses’ Health Study and the Health Professionals Follow-up Study was provided by grants (P01 CA87969 and UM1 CA167552, respectively) from the NCI. Support for the Women’s Health Initiative program is provided by contracts (N01WH22110, N01WH24152, N01WH3210032102 and N01WH32105) from the National Heart, Lung, and Blood Institute. Can we mark them with entities of the form Funding Body and Grant Number? Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 5 / 25
  • 6. Section 2 Background Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 6 / 25
  • 7. Problem Definition Given a scientific article as raw text input, we design a system to perform two tasks: 1 identify all text segments which contain funding information. 2 process all the funding text segments in order to detect the set of the funding bodies (FB) and the set of grants (GR) that appear in the text. The former is a binary text classification task. While, the latter can be seen as a named entity recognition (NER) problem. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 7 / 25
  • 8. NER and Sequential Learning NER extracts information, known as named entities, from unstructured text; for example, the names of persons, locations and organizations. In literature, NER systems have been found to employ rule-based, gazetteer and machine learning approaches [Nadeau, 2007]. Sequential learning approaches are machine learning models that leverage the relationships between nearby data points and their class labels. Hidden Markov Models [Zhou, 2002], Linear CRFs [McCallum, 2003] and Maximum Entropy Models [Chieu, 2002] are popular ways of modeling data for NER. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 8 / 25
  • 9. Implementations and Toolkits The Stanford CoreNLP toolkit1 is a Java-based toolkit that has a CRF implementation, enhanced with long-distance features. An important aspect of the toolkit is the ability to use distributional similarity measures. LingPipe2 is another NLP toolkit, whose efficient HMM implementation includes n-gram features. In this work, we also use the Apache OpenNLP4 toolkit3, which has a MaxEnt implementation for NER. Finally, this work also makes use of Elseviers Fingerprint Engine (FPE)4, which is an industrial solution for annotating text with ontological concepts, given a vocabulary. 1 http://stanfordnlp.github.io/CoreNLP/ 2 http://alias-i.com/lingpipe/demos/tutorial/read-me.html 3 https://opennlp.apache.org/ 4 https://www.elsevier.com/solutions/elsevier-fingerprint-engine Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 9 / 25
  • 10. Section 3 Methodology Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 10 / 25
  • 11. Design Choice As mentioned earlier, we use a two-stage approach to extract funding information from text. This design has the following benefits: 1 it minimizes the execution time of the approach as the costliest component, namely NER. 2 it reduces the number of false positives, as there are many text segments in a scientific full text article that contain strings which a NER component could potentially annotate falsely. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 11 / 25
  • 12. Data Collection Silver Set: randomly sample articles from ScienceDirect5 database from the last 10 years and select only acknowledgment sections from them. using the Fingerprint Engine (FPE) and Crossref’s open funder registry6 , annotate FBs from these acknowledgment sections. at the end of this step, the number of retained sections with at least one annotated FB resulted in 44,660. Gold Set: journal articles were picked randomly from a large number of publications, annotated by three different experts and harmonized. 1,682 articles, out of around 2000, contained at least one funding-related annotation, resulting in 4,537 FB and 3,156 GR annotations in the set. pair-wise averaged Cohens kappa was used to calculate the inter-annotator agreement to assess dataset quality, and was found to be 0.89, suggesting high quality. 5 http://www.sciencedirect.com/ 6 http://www.crossref.org/fundingdata/registry.html Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 12 / 25
  • 13. Data Usage The Silver Set was used to learn word clusters for the distributional similarity measure that can be employed within the Stanford CoreNLP toolkit this was done by generating word-embeddings using this dataset, using the Word2Vec algorithm [Mikolov, 2013]. followed by K-means clustering using cosine similarity. Additionally, it was also used to train models to detect FB annotations. The Gold Set was used to train the binary text classifier that detects the paragraphs of text which contain funding information. It was also used to train models to detect FB and GR annotations. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 13 / 25
  • 14. Detecting Text Blocks with Funding Information As the first step, the text segments which contain funding information are to be separated from the rest (Binary Classification). To address this problem, we have used a cost-sensitive L2-regularized linear Support Vector Machine (SVMs), as SVMs are known to perform well on text classification problems. The SVMs operate on TF-IDF vectors extracted from the segments of each input text, based on a bigram bag-of-words text. The SVM was trained on the examples of positive (1,682) and negative segments (47,565), i.e., paragraphs with and without funding information, which could be found in the Gold Set. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 14 / 25
  • 15. Extracting Funding Information using NER In order to annotate a piece of text with the FB label, a variety of models were used: 1 pre-trained models packaged as part of the Stanford CoreNLP and LingPipe suites; in this work they were used to identify the Organization labels in the text, which were then stored as FB. 2 Stanford CRF, LingPipe HMM and OpenNLP MaxEnt models trained on the Silver and Gold sets. 3 Stanford CRF classifiers using distributional similarity features based on the word clusters created from the Silver Set data. As for GR labels: 1 we use a rule-based approach, considering every word inside the funding section with at least a digit, as a grant ID 2 we train all of the aforementioned models based on the labeled data in the Gold Set. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 15 / 25
  • 16. Ensembling Figure: An example of the ensemble approach for extracting funding information from text. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 16 / 25
  • 17. Overall Pipeline Figure: Schematic showing the overall pipeline. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 17 / 25
  • 18. Section 4 Experiments and Results Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 18 / 25
  • 19. Detection of Text Blocks with Funding Information Section P R F1 SVM 99 5 9 Cost sensitive L2-SVM (C=2) 95 85 90 Table: Results for the identification of text with funding informationm using SVM. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 19 / 25
  • 20. Extraction of Funding Organization names Method P R F1 HMM-Pre 18(±0) 31(±0) 23(±0) CRF-Pre 35(±0) 54(±0) 42(±0) FPE 48(±0) 46(±0) 47(±0) CRF-S 49(±0) 43(±0) 46(±0) HMM-S 36(±0) 48(±0) 41(±0) MaxEnt-S 50(±0) 39(±0) 44(±0) CRF-G 64(±.2) 58(±.2) 61(±.2) CRF-dsim-G 66(±.2) 61(±.3) 63(±.2) HMM-G 49(±.3) 54(±.2) 52(±.2) MaxEnt-G 64(±.4) 54(±.2) 59(±.3) FundingFinder 72(±.3) 63(±.2) 68(±.3) Table: NER Results for Funding Body (FB) annotation label. Best performing model is highlighted in bold while the second best is in italics. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 20 / 25
  • 21. Extraction of Grant IDs Method P R F1 Rule-based 78(±0) 89(±0) 83(±0) CRF-G 91(±.1) 91(±.08) 91(±.1) HMM-G 76(±.2) 77(±.2) 76(±.2) MaxEnt-G 87(±.2) 89(±.1) 88(±.2) FundingFinder 92(±.1) 91(±.1) 92(±.1) Table: NER Results for Grant (GR) annotation label. Best performing model is highlighted in bold while the second best is in italics. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 21 / 25
  • 22. Section 5 Conclusions Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 22 / 25
  • 23. Conclusions and Contributions 1 We have discussed on the practically important problem of extracting funding information from text, and have experimentally provided an overview of the state-of-the-art methods that could be used for the same. This may prove to be a significant head-start for researchers delving into the same problem for further research. 2 Empirically, we have shown that a small and high quality dataset is more suitable for this NER task than a larger, but noisier, dataset. 3 We have suggested an efficient two-stage pipeline for the task of funding information extraction. 4 A learning mechanism, based on an ensemble of state-of-the-art base annotators, was suggested, which should be easily extensible to any NER task. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 23 / 25
  • 24. References Nadeau, D., Sekine, S (2007) A survey of named entity recognition and classification Linguisticae Investigationes 30(1), 3 – 26. Chieu, H.L. (2002) Named entity recognition : a maximum entropy approach using global information Proceedings of the 2002 International Conference on Computational Linguistics 190 – 196. McCallum, A., Li, W. (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 4, 188 – 191. Zhou, G., Su, J. (2002) Named entity recognition using an HMM-based chunk tagger Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 473 – 480. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013) Distributed representations of words and phrases and their compositionality Proceedings of the 26th International Conference on Neural Information Processing Systems 3111 – 3119. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 24 / 25
  • 25. Thank You! Please email me at d.kayal@elsevier.com for critiques, comments, advice, dataset inquiries, etc. Deep Kayal, Zubair Afzal, George Tsatsaronis et al. (Elsevier B.V.)FundingFinder 16 September, 2018 25 / 25