SlideShare a Scribd company logo
1 of 10
Download to read offline
Proceedings of the First Workshop on Scholarly Document Processing, pages 10–19
Online, November 19, 2020. c 2020 Association for Computational Linguistics
https://doi.org/10.18653/v1/P17
10
Acknowledgement Entity Recognition in CORD-19 Papers
Jian Wu
Computer Science
Old Dominion University
Norfolk, VA, USA
jwu@cs.odu.edu
Pei Wang, Xin Wei
Computer Science
Old Dominion University
Norfolk, VA, USA
{pwang001,xwei001}@odu.edu
Sarah Michele Rajtmajer, C. Lee Giles
Information Sciences and Technology
Pennsylvania State University
University Park, PA, USA
Christopher Griffin
Applied Research Laboratory
Pennsylvania State University
University Park, PA, USA
Abstract
Acknowledgements are ubiquitous in schol-
arly papers. Existing acknowledgement en-
tity recognition methods assume all named
entities are acknowledged. Here, we exam-
ine the nuances between acknowledged and
named entities by analyzing sentence struc-
ture. We develop an acknowledgement ex-
traction system, ACKEXTRACT based on
open-source text mining software and eval-
uate our method using manually labeled
data. ACKEXTRACT uses the PDF of a
scholarly paper as input and outputs ac-
knowledgement entities. Results show an
overall performance of F1 = 0.92. We
built a supplementary database by link-
ing CORD-19 papers with acknowledge-
ment entities extracted by ACKEXTRACT
including persons and organizations and
find that only up to 50–60% of named enti-
ties are actually acknowledged. We fur-
ther analyze chronological trends of ac-
knowledgement entities in CORD-19 pa-
pers. All codes and labeled data are pub-
licly available at https://github.com/
lamps-lab/ackextract.
1 Introduction
Acknowledgements have been an institutionalized
part of research publications for some time (Blaise,
2001). Acknowledgement statements show the au-
thors’ public gratitude and recognition to individ-
uals, organizations, and grants for various contri-
butions. Acknowledged individuals and organiza-
tions have been under-presented in author ranking
and citation impact analysis mostly due to their
presumed sub-authorship contribution. A recent
survey found that discipline, academic rank, and
gender have a significant effect on coauthorship
disagreement rate (Smith et al., 2019), leading to
non-author collaborators receiving less attention.
Recently, the presence of non-author collaborators
in the biomedical and social sciences (Paul-Hus
et al., 2017) showed that non-author collaborators
are not rare and their presence varies significantly
by disciplines.
Acknowledgements can be classified depending
on the nature of the contribution. Song et al. (2020)
classified sentences in acknowledgement sections
into 6 categories: declaration, financial, peer inter-
active communication and technical support, pre-
sentation, general acknowledgement, and general
statement. They can also be classified based on
the type of entities such as individual, organization,
and grant. Since 2008, funding acknowledgements
have been indexed by Web of Science. However,
there is still no dedicated software to accurately rec-
ognize acknowledged people and organizations and
generate a centralized acknowledgement database.
Early works on acknowledgements were based on
datasets manually extracted from specific journals,
which was not scalable. Building such a large
database can support further study of acknowledge-
ments at a larger scale.
There are several scenarios that make the ac-
knowledgement entity recognition (AER) task chal-
lenging. The upper panel of Figure 1 shows ex-
amples of sentences appearing in an isolated ac-
knowledgement section. The “Utrecht University”
is mentioned but should not be counted as an ac-
knowledgement entity because it is just the affil-
iation of “Arno van Vliet” who is acknowledged.
Acknowledgement statements can also appear at
footnotes (Figure 1 Bottom), mixed with other foot-
note and/or body text. Author names may also
appear in the statements, such as “Andreoni” in this
example, and should be excluded.
Existing works on AER leverage off-the-shelf
named entity recognition (NER) packages, such
11
Figure 1: Upper: Acknowledgement statements appear
in an isolated section (Stewart et al., 2018). Bottom:
acknowledgement statements appear in a footnote (An-
dreoni and Gee, 2015).
as the Natural Language Toolkits (NLTK) (Bird,
2006) e.g., Khabsa et al. (2012); Paul-Hus et al.
(2020), followed by simple semi-manual data
cleansing, resulting in a fraction of entities that
are mentioned but not actually acknowledged. In
this paper, we design an automatic AER system
called ACKEXTRACT that further classifies extrac-
tion results from open source NER packages that
recognize people and organizations and distinguish
entities that are actually acknowledged. The extrac-
tor finds acknowledgement statements from iso-
lated sections and other locations such as footnotes,
which is common for papers in social and behav-
ioral sciences. Our contributions are:
1. Develop ACKEXTRACT as open-source soft-
ware to automatically extract acknowledge-
ment entities from research papers.
2. Apply ACKEXTRACT on the CORD-19
dataset and supplement the dataset with a cor-
pus of classified acknowledgement entities for
further studies.
3. Use the CORD-19 dataset as a case study to
demonstrate that acknowledgement studies
without classifying named entities, can sig-
nificantly overestimate the number of entities
that are actually acknowledged because many
people and organizations are mentioned but
not explicitly acknowledged.
2 Related Works
Early work on acknowledgement extraction was
manually applied, which was labor-intensive.
Cronin et al. (1993) extracted a total of 9561 peer
interactive communication (PIC) names from a to-
tal of 4200 research sociology articles, most were
persons’ names. They also defined the following
six categories of acknowledgement: moral support,
financial support, editorial support, presentational
support, instrumental/technical support, and con-
ceptual support, or PIC (Cronin et al., 1992).
Councill et al. (2005) used a hybrid method for
automatic AER from research papers and automati-
cally created an acknowledgement index(Giles and
Councill, 2004). The algorithm first used a heuris-
tic method for identifying acknowledgement pas-
sages. It then uses an SVM model for identifying
lines containing acknowledgement sentences out-
side labeled acknowledgement sections. A regular
expression was used to extract entity names from
acknowledging text. This method achieved an over-
all precision of about 0.785 and a recall of 0.896 on
CiteSeer papers (Giles et al., 1998). The algorithm
does not distinguish entity types.
Khabsa et al. (2012) leveraged OpenCalais1 and
AlchemyAPI2, free services at that time, to extract
named entities from acknowledgement sections and
built ACKSEER, a search engine for acknowledge-
ment entities. They merged outputs of both NER
APIs and generated a list and disambiguated entity
mentions using the longest common subsequence
(LCS) algorithm. The ground truth contains 200
top-cited CiteSeerX papers in which 130 had ac-
knowledgement sections. They achieved 92.3%
and 91.6% precision and recall for acknowledge-
ment section extraction but did not evaluate entity
extraction.
Recent studies of acknowledgements tend to use
results from off-the-shelf NER packages with sim-
ple filters, assuming that named entities were ac-
knowledged entities. For example, Paul-Hus et al.
(2020) uses the Stanford NER module in NLTK to
extract persons. Song et al. (2020) also directly use
people and organizations recognized by the Stan-
ford CoreNLP (Manning et al., 2014). These works
achieved a high recall by recognizing most name
entities in the acknowledgements but ignored their
relations to the papers where they appear, resulting
in a fraction of entities that are mentioned but not
actually acknowledged. Song et al. (2020) consider
grammar structure such as verb tense and voice
and sentence patterns when labeling sentences to
their six categories. For example, “was funded” is
followed by an “organization”. However, they only
label sentences and do not annotate them down to
1
https://web.archive.org/web/
20081023021111/http://opencalais.com/
calaisAPI#
2
https://web.archive.org/web/
20090921044923/http://www.alchemyapi.
com/
12
Figure 2: Architecture of ACKEXTRACT.
the entity level. Our system examines the relation-
ship between entities and the current work, with
the purpose of discriminating acknowledgement en-
tities from named entities. In this system, we focus
on people and organizations.
Recently, Dai et al. (2019) proposed GrantEx-
tractor, a pipeline system to extract grant support in-
formation from scientific papers. A model combin-
ing BiLSTM-CRF and pattern matching was used
to extract entities of grant numbers and agencies
from funding sentences, which are identified using
heuristic methods. The system achieves a micro-F1
up to 0.90 in extracting grant pairs (agency, num-
ber).
Kayal et al. (2019) proposed an ensemble ap-
proach called FUNDINGFINDER for extracting
funding information from text. The authors con-
struct feature vectors for candidate entities using
whether the entities are recognized by four NER
implementation: Stanford (Conditional Random
Field model), LingPipe (Hidden Markov model),
OpenNLP (Maximum Entropy model), and Else-
vier’s Fingerprint Engine. The F1-measure for
funding body is only 0.68 ± 0.3.
Our method is different from existing methods
in threefold. (1) It is built on top of state-of-the-art
neural NER methods, which results in a relatively
high recall. (2) It uses a heuristic method to filter
out entities that are just mentioned but not acknowl-
edged. (3) It extracts both organizations and people.
3 Dataset
On March 16th, 2020, Allen Institute of Artifi-
cial Intelligence, released the first version of the
COVID-19 Open Research Dataset (CORD-19)
(Wang et al., 2020), in collaboration with several
other institutions. The dataset contains metadata
and segmented full text of research articles selected
by searching a list of keywords about coronavirus,
SARS-CoV, MERS, and other related terms from
four digital libraries including WHO, PubMed Cen-
tral, BioRxiv, and MedRxiv. The initial dataset con-
tained about 28k papers and was updated weekly
with papers from new sources and the latest pub-
lication. We used the dataset released on April
10, 2020 containing over 59,312 metadata records,
among which 54,756 have the full text in JSON
format. The CORD-19 papers were generated by
processing PDFs using the S2ORD pipeline (Lo
et al., 2019), in which GROBID (Lopez, 2009) was
employed for document segmentation and metadata
extraction.
The full text in JSON files is directly used for
sentence segmentation. However, we observe that
GROBID extraction results are not perfect. In par-
ticular, we estimate the fraction of acknowledge-
ment sections omitted. We also estimate the num-
ber of acknowledgement entities omitted in the
data release due to the extraction error of GROBID.
To do this, we downloaded 45,916 full-text PDF
papers collected by the Internet Archive (IA) be-
cause the CORD-19 dataset does not include PDF
files3. We found 13,103 CORD-19 papers in the
IA dataset.
4 Acknowledgement Extraction
4.1 Overview
The architecture of our acknowledgement entity ex-
traction system is depicted in Figure 2. The system
can be divided into the following modules.
1. Document segmentation. CORD-19 pro-
vides full text as JSON files, but in general,
most research articles are published in PDF,
so our first step is converting a PDF docu-
ment to text and segment it into sections. We
use GROBID that has shown superior perfor-
mance over many other document extraction
methods (Lipinski et al., 2013). The output is
an XML file in TEI schema.
2. Sentence segmentation. Paragraphs are seg-
mented into sentences. We compare several
sentence segmentation software packages and
choose Stanza (Qi et al., 2020) because of its
relatively high accuracy.
3. Sentence classification. Sentences are
classified into acknowledgement and non-
acknowledgement statements. The result is
a set of acknowledgement statements inside
or outside the acknowledgement sections.
3
https://archive.org/download/covid19_
fatcat_20200410
13
4. Entity recognition. Named entities are ex-
tracted from acknowledgement statements.
We compare four commonly used NER soft-
ware packages and choose Stanza because of
its relatively high performance. In this work,
we focus on person and organization.
5. Entity classification. In this module we clas-
sify named entities by analyzing sentence
structures, aiming at discriminating named
entities that are actually acknowledged, rather
than just mentioned. We demonstrate that
triple extraction packages such as REVERB
and OLLIE fail to handle acknowledgement
statements with multiple entities in objects in
our dataset. The results are acknowledgement
entities including people or organizations.
4.2 Document Segmentation
The majority of scholarly papers are published in
PDF format, which are not readily readable by text
processors. Several attempts have been made to
convert PDF into text (Bast and Korzen, 2017) and
segment the document into section and sub-section
levels. GROBID is a machine learning library for
extracting, parsing, and re-structuring raw docu-
ments into TEI encoded documents. Other simi-
lar methods have been recently developed such as
OCR++ (Singh et al., 2016) and Science Parse4.
Lipinski et al. (2013) compared 7 metadata extrac-
tion methods and found that GROBID (version
0.4.0) achieved superior performance over the oth-
ers. GROBID trained a cascading of conditional
random field (CRF) models on PubMed and com-
puter science papers. The recent version (0.6.0)
has a set of powerful functionalities such as extract-
ing and parsing headers and segmenting full-text
extraction. GROBID supports a batch mode and
an API service mode, the latter enables large scale
document processing on multi-core servers such
as in CiteSeerX (Wu et al., 2015). A benchmark-
ing result for version 0.6.0 shows that the section
title parsing achieves and F1 = 0.70 under the
strict matching criteria and F1 = 0.75 under the
soft matching criteria5. Singh et al. (2016) claims
OCR++ achieves better performance than GROBID
in several fields evaluated on computer science pa-
pers. However, the lack of a service mode API
and multi-domain adaptability limits its usability.
Science-Parse only extracts key metadata such as
4
https://github.com/allenai/spv2
5
https://grobid.readthedocs.io/en/
latest/Benchmarking-pmc/
title, authors, year, and venue. Therefore, we adopt
GROBID to convert PDF documents into XML
files.
Depending on the structure and provenance of
PDFs, GROBID may miss acknowledgements in
certain papers. To estimate the fraction of papers
in which acknowledgements were missed by GRO-
BID, we visually inspected a random sample of 200
papers from the CORD-19 dataset, and found that
only 146 papers (73%) contain acknowledgement
statements, out of which GROBID successfully ex-
tracted all acknowledgement statements from 120
papers (82%). For the remaining 26 papers that
GROBID failed to parse, 17 papers are in sections,
9 papers are in footnotes. We developed a heuristic
method that can extract acknowledgement state-
ments from all 120 papers with acknowledgement
statements output by GROBID.
4.3 Sentence Segmentation
The acknowledgement sections or statements ex-
tracted above are paragraphs, which needs to be
segmented (or tokenized) into sentences. we com-
pared four software packages for sentence segmen-
tation including NLTK (Bird, 2006), Stanza (Qi
et al., 2020), Gensim (Řehůřek and Sojka, 2010),
and the Pragmatic Segmenter6.
NLTK includes a sentence tokenization method
sent tokenize(), which uses an unsupervised al-
gorithm to build a model for abbreviated words,
collocations, and words that start sentences; and
then uses that model to find sentence boundaries.
Stanza is a Python natural language analysis pack-
age developed by the Stanford NLP group. Sen-
tence segmentation is modeled as a tagging prob-
lem over character sequences, where the neural
model predicts whether a given character is the end
of a sentence. The split sentences() function in
Gensim package splits a text and returns list of sen-
tences from a given text string using unsupervised
pattern recognition. The Pragmatic Segmenter is
a rule-based sentence boundary detection gem that
works out-of-the-box across many languages.
To compare the above methods, we created a
ground truth corpus by randomly selecting ac-
knowledgment sections or statements from 47 pa-
pers and manually segmenting them, resulting in
100 sentences. Table 1 shows the comparison re-
sults for four methods. The precision is calculated
6
https://github.com/diasks2/pragmatic_
segmenter
14
Method Precision Recall F1
Gensim 0.65 0.64 0.65
NLTK 0.72 0.69 0.70
Pragmatic 0.86 0.76 0.81
Stanza 0.81 0.88 0.84
Table 1: Sentence segmentation performance.
by dividing the number of correctly segmented sen-
tences by the total number of sentences segmented.
The recall is calculated by dividing the number of
correctly segmented sentences by the total number
of manually segmented sentences. Stanza outper-
forms the other three, achieving an F1 = 0.84.
4.4 Sentence Classification
Not all sentences in acknowledgement sec-
tions express acknowledgement, such as the
following sentence, The funders played no
role in the study or preparation of the
manuscript Song et al. (2020). In this module,
we classify sentences into acknowledgement and
non-acknowledgement statements. We developed
a set of regular expressions that match both
verbs (e.g., thank, gratitude to, indebted to),
adjectives (e.g., grateful to), and nouns (e.g.,
helpful comments, useful feedback) to cover as
many cases as possible. To evaluate this method,
we manually selected 100 sentences, including 50
positive and negative samples from the sentences
obtained in Section 4.3. Our results show that
96 out of 100 sentences were classified correctly,
resulting accuracy of 0.96.
4.5 Entity Recognition
In this step, named entities are extracted using
state-of-the-art NER software packages, including
NLTK Bird (2006), Stanford CoreNLP Manning
et al. (2014), spaCy Honnibal and Montani (2017),
and Stanza (Qi et al., 2020). Stanza is a Python
library offering fully neural pre-trained models that
provide state-of-the-art performance on many raw
text processing tasks when it was released. The
NER model adopted the contextualized sequence
tagger in Akbik et al. (2018). The architecture in-
cludes a Bi-LSTM character-level language model,
followed by a one-layer Bi-LSTM sequence tagger
with a conditional random field (CRF) encoder. Al-
though Stanza was developed based on Stanford
CoreNLP, they exhibit differential performances in
our NER task.
The ground truth is built by randomly selecting
Method Entity Precision Recall F1
NLTK
Person 0.45 0.68 0.55
Org. 0.59 0.77 0.67
spaCy
Person 0.74 0.88 0.80
Org. 0.63 0.74 0.68
Stanford- Person 0.88 0.87 0.87
CoreNLP Org. 0.68 0.80 0.73
Stanza
Person 0.89 0.93 0.91
Org. 0.60 0.89 0.72
Table 2: Comparison of NER software package.
100 acknowledgement paragraphs from sections,
footnotes, and body text, and manually annotating
person and organization entities, without discrimi-
nating whether they are acknowledged or not. This
results in 146 person and 209 organization entities.
The comparison results indicate that overall Stanza
outperforms the other three, achieving F1 = 0.91
for person and F1 = 0.72 for organization. Es-
pecially, the recall of Stanza is 9% higher than
Stanford CoreNLP (Table 2).
4.6 Entity Classification
As we showed, not all named entities are acknowl-
edged, such as the “’Utrecht University” in Fig-
ure 1. Therefore, it is necessary to build a classifier
to discriminate acknowledgement entities – entities
that are thanked by the paper or the authors, from
named entities.
The majority of acknowledgement statements in
academic articles have a relatively standard subject-
predicate-object (SPO) structure. They use a lim-
ited number of words or phrases, such as “thank”,
“acknowledge”, “are grateful”, “is supported”, and
“is funded” as predicates. However, the object can
contain multiple named entities, some of which
are used as attributes of the others. In rare cases,
certain sentences may not have subjects and predi-
cates, such as the first sentence in Figure 4.
Our approach can be divided into three steps.
Two representative examples are illustrated in Fig-
ure 3. The pseudo-code is shown in Algorithm 1.
Step 1: We resolve the type of voice (active or
passive), subject, and predicate of a sentence using
dependency parsing by Stanza. This is because
named entities can appear as subjective or objec-
tive parts. We then locate all named entities. The
semantic meaning of a predicate and its type of
voice can be used to determine whether entities ac-
knowledged are in the objective part or subjective
part. In most cases the target entities are objects.
15
Step 2: A sentence with multiple entities in the
objective part is split into shorter sentences, called
“subsentences”, so that each subsentence is asso-
ciated with only up to one named entity. This is
done by first splitting the sentence by “and”. For
each subsentence, if the subject and predicate are
missing, we fill them up using the subject and pred-
icate of the original sentence. The object in each
subsentence does not necessarily contain an entity.
For example, in the right panel of Figure 3, because
“expertise” is not a named entity, it is replaced by
“none” in the subsentence. The SPO relations that
do not contain named entities are removed.
There are two scenarios. In the first scenario,
a sentence or a subsentence may contain a list of
entities, with only the first being acknowledged.
In the third example of Figure 4, the named en-
tities are [‘Qi Yang’, ‘Morehouse School of
Medicine’, ‘Atlanta’, ‘GA’] but only the first
entity is acknowledged. The rest entities, which
are recognized as organizations or locations, are
used for supplementing more information. In this
scenario, only the first named entity is extracted.
Step 3: In the second scenario, acknowledged
entities are connected by commas or “and”,
such as in We thank Shirley Hauta, Yurij
Popowych, Elaine van Moorlehem and Yan
Zhou for help in the virus production. In
this scenario, entities in this structure have the
same type, indicating that they play similar roles.
This parallel pattern can be captured by regular
expressions. The entities resolved in Step 2 and 3
will be merged to form the final set.
The method to find the parallel structure is
as follows. First, check each entity whether its
type is person. If so, the entities are substituted
with integer indexes. The sentence becomes
The authors would like to thank 0, 1 and
2, and the kind support of Bayer Animal
Health GmbH and Virbac Group. If there are
3 or more consecutive numbers in this form,
this is the parallel pattern, which is captured by
regular expressions. The pattern also allows text
between names (Figure 5). Next, the numbers
in this part will be extracted and mapped to
corresponding entities. In the example above, the
numbers [0,1,2] correspond to the index of the
entities [Norbert Mencke, Lourdes Mottier,
David McGahieand]. Similar pattern recognition
are performed for organization. The process is
depicted in Figure 5.
We would also like to thank the Canadian Wildlife
Health Cooperative at the University of Saskatchewan
for their support and expertise.
We would also like to thank the Canadian Wildlife
Health Cooperative at the University of Saskatchewan
for their support.
We would also like to thank expertise.
[‘we’, ‘thank’, ‘Canadian Wildlife
Health Cooperative’]
[‘we’, ‘thank’, ‘Nicolas Olivarez’]
[‘Canadian Wildlife Health Cooperative’]
We thank Anne Broadwater and Nicolas Olivarez of the
de Silva lab for technical assistance with the assays.
We thank Anne Broadwater of the de Silva lab for
technical assistance with the assays.
We thank Nicolas Olivarez of the de Silva lab for
technical assistance with the assays.
[‘we’, ‘thank’, ‘Anne Broadwater’]
[‘we’, ‘thank’, none]
[‘Anne Broadwater’, ‘Nicolas Olivarez’]
Figure 3: Process mapping multiple subject-predicate-
object relations to acknowledgement entities with two
representative examples. “None” means the object
does not contain named entities.
Method Precision Recall F1
Stanza 0.57 0.91 0.70
RB(without step 3) 0.94 0.78 0.85
RB 0.94 0.90 0.92
Table 3: Performance of Stanza and our relation-based
(RB) classifier. Precision and recall show overall re-
sults by combining person and organization.
We investigated open information extraction
methods such as REVERB (Fader et al., 2011) and
OLLIE (Mausam et al., 2012). We found that they
only work for sentences with relatively simple
structures such as The authors wish to thank
Ming-Wei Guo for reagents and technical
assistance, but fail with more complicated
sentences with long objective part or parallel
subsentences (Figure 4). We also investigated the
semantic role labeling (SRL) library in AllenNLP
toolkit (Shi and Lin, 2019). For the third sentence
in Figure 4, the AllenNLP SRL library resolves the
entire string after “thank” as an argument, but fails
to distinguish entity types. Our method can handle
all statements in Figure 4.
As a baseline, we use Stanza to extract all named
entities and compare its performance with our
relation-based (RB) classifier. The ground truth
is compiled using the same corpus described in
Section 4.5 except that only acknowledgement en-
tities (as opposed to all named entities) are labeled
positive. The results (Table 3) show that Stanza
achieves high recall but poor precision, indicating
that a significant fraction (∌ 40%) of named enti-
ties are not acknowledged. In contrast, our classi-
fier (RB) achieves a precision of 0.94, with a small
loss of recall, achieving an F1 = 0.92.
One limitation of the RB classifier is that it relies
on text quality. Sentences are expected to follow
16
To Aulio Costa Zambenedetti, for the art work, Nilson FidĂȘncio,
Silvio Marques, Tania Schepainski and Sibelli Tanjoni de Souza,
for technical assistance.
The authors also thank the Program for Technological
Development in Tools for Health-PDTIS FIOCRUZ, for the use
of its facilities (Platform RPT09H -Real Time PCR -Instituto
Carlos Chagas/Fiocruz-PR).
The authors wish to thank Ms. Qi Yang, DNA sequencing lab,
Morehouse School of Medicine, Atlanta, GA, and the Research
Cores at Morehouse School of Medicine supported through the
Research Centers at Minority Institutions (RCMI) Program,
NIH/NCRR/RCMI Grant G12-RR03034.
Figure 4: Sentence examples, from which REVERB or
OLLIE failed to extract acknowledgement entities un-
derlined, which can be identified by our method.
'Norbert Mencke'
'David McGahie'
'Bayer Animal Health'
'Virbac Group'
'Norbert Mencke'
'Lourdes Mottier'
'David McGahie'
step 3
step 2
The authors would like to thank Norbert Mencke, Lourdes Mottier
and David McGahie, and the kind support of Bayer Animal Health
GmbH and Virbac Group.
['Norbert Mencke', 'David McGahie', 'Bayer Animal Health',
'Virbac Group', 'Lourdes Mottier']
Figure 5: Merge the results given by step 2 and step 3
and obtain the final results.
the editorial convention, such that entity names are
clearly segmented by period. If two sentences are
not properly delimited, e.g., missing a period, the
classifier may make incorrect predictions.
5 Data Analysis
The CORD-19 dataset contains PMC and PDF
folders. We found that almost all papers in the
PMC folders are included in the PDF folders.
For consistency, we only work on papers in the
PDF folders, containing 37,612 unique PDF pa-
pers7. Using ACKEXTRACT, we extracted a total
of 102,965 named entities from 21,469 CORD-
19 papers. The fraction of papers with named
entities (21469/37612≈57%) is roughly consis-
tent with the fraction obtained from the 200 sam-
ples (120/200≈60% in Section 4.2). Among all
named entities, 58,742 are acknowledgement enti-
ties. These numbers suggest that using our model,
only about 50% of named entities are acknowl-
edged. Using only named entities, acknowledge-
ment studies could significantly overestimate the
7
We excluded 2003 duplicate papers with exactly the same
SHA1 values.
Algorithm 1: Relation-Based Classifier
1 Function find entity():
2 pre-process with punctuation and format
3 find candidate entities entity list by Stanza
4 for x ∈ entity list do
5 if x ∈
subject part (differs based on predicate)
then
6 x = none
7 entity list.remove(none)
8 return entity list
9 Input: sentence
10 Output: entity list all
11 find subject part, predicate and the object part
12 if “predicate value” ∈ reg expression 1 then
13 if “and” ∈ sentence then
14 split into subsentences by ‘and’
15 for each subsentence do
16 entity list
← find entity(subsentence)
17 if entity list 6= none then
18 entity list all
← append.entity list[0]
19 else if “and” /
∈ sentence then
20 entity list ← find entity(subsentence)
21 entity list all ← entity list[0]
22 else if “predicate value” ∈ reg expression 2 then
23 do the same in if part but focus on entities in
subject
24 for x ∈ candidate list by Stanza do
25 if x ∈ sentence&x.type = type then
26 orig list ← append x
27 sent ← replace x by index(x)
28 if reg format ∈ sent then
29 temp ← find reg expression in sent
30 numlist ← find numbers in temp
31 list ← index orig list by numlist
32 combine list and entity list all
number of acknowledgement entities by relying
on entities recognized by NER software packages
without further classification. Here, we analyze our
results and study some trends of acknowledgement
entities in the CORD-19 dataset.
The top 10 acknowledged organizations (Ta-
ble 4) are all funding agencies. Overall NIH (NI-
AID is an institute of NIH) is acknowledged the
most, but funding agencies in other countries are
also acknowledged a lot in CORD-19 papers.
Figure 6 shows the numbers of CORD-19 pa-
pers with vs. without acknowledgement (person
or organization) recognized from 1970 to 2020.
The huge leap around 2002 was due to the wave of
coronavirus studies during the SARS period. The
small drop-down in 2020 was due to data incom-
pleteness. The plot indicates that the fraction of
17
Organization Count
National Institutes of Health or NIH 1414
National Natural Science Foundation of China or NSFC 615
National Institute of Allergy & Infectious Diseases or NIAID 209
National Science Foundation or NSF 183
Ministry of Health 160
Wellcome Trust 146
Deutsche Forschungsgemeinschaft 119
Ministry of Education 119
National Science Council 104
Public Health Service 85
Table 4: Top 10 acknowledged organizations identified
in our CORD-19 dataset.
0
500
1000
1500
2000
2500
3000
3500
4000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
With Acknowledgement Without Acknowledgement
Figure 6: Numbers of CORD-19 papers with vs. with-
out acknowledgement recognized from 1970 to 2020.
papers with acknowledgement has been increasing
gradually over the last 20 years. Figure 7 shows
the numbers of the top 10 acknowledged organiza-
tions from 1983 to 2020. The figure indicates that
the number of acknowledgements to NIH has been
gradually decreasing over the past 10 years while
the acknowledgements to NSF is roughly constant.
In contrast, the number has been gradually increas-
ing from NSFC (a Chinese funding agency). Note
that the distribution of acknowledged organizations
has a long tail and organizations behind the top 10
actually dominate the total number. However, the
top 10 organizations are the biggest research agen-
cies and the trend to some extent reflects strategic
shifts of funding support.
6 Conclusion and Future Work
Here, we extended the work of Khabsa et al. (2012)
and built an acknowledgement extraction frame-
work denoted as ACKEXTRACT for research arti-
cles. ACKEXTRACT is based on heuristic meth-
ods and state-of-the-art text mining libraries (e.g.,
GROBID and Stanza) but features a classifier
that discriminates acknowledgement entities from
named entities by analyzing the multiple subject-
predicate-object relations in a sentence. Our ap-
0
50
100
150
200
250
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
Public Health Service National Science Council
Ministry of Education Deutsche Forschungsgemeinschaft
Wellcome Trust Ministry of Health
National Institute of Allergy & Infectious Diseases/NIAID National Science Foundation/NSF
National Natural Science Foundation of China/NSFC National Institutes of Health/NIH
Figure 7: Trend of acknowledged organizations of top
10 organizations from 1983 to 2020.
proach successfully recognizes acknowledgement
entities that cannot be recognized by OIE packages
such as REVERB, OLLIE, and the AllenNLP SRL
library.
This method is applied to the CORD-19 dataset
released on April 10, 2020, processing one PDF
document in 5 seconds on average. Our results
indicate that only 50–60% named entities are ac-
knowledged. The rest are mentioned to provide
additional information (e.g., affiliation or location)
about acknowledgement entities. Working on clean
data, our method achieves an overall F1 = 0.92
for person and organization entities. The trend
analysis of the CORD-19 papers verifies that more
and more papers include acknowledgement enti-
ties since 2002, when the SARS outbreak hap-
pened. The trend also reveals that the overall
number of acknowledgements to NIH is gradu-
ally decreasing over the past 10 years, while more
papers acknowledge NSFC, a Chinese funding
agency. One caveat of our method is that organi-
zations in different countries are not distinguished.
For example, many countries have agencies called
“Ministry of Health”. In the future, we plan to
build learning-based models for sentence classifi-
cation and entity classification. The code and data
of this project have been released on GitHub at:
https://github.com/lamps-lab/ackextract.
Acknowledgments
This work was partially supported by the Defense
Advanced Research Projects Agency (DARPA).
under cooperative agreement No. W911NF-19-2-
0272. The content of the information does not
necessarily reflect the position or the policy of the
Government, and no official endorsement should
be inferred.
18
References
Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual string embeddings for sequence
labeling. In Proceedings of the 27th International
Conference on Computational Linguistics, pages
1638–1649, Santa Fe, New Mexico, USA. Associ-
ation for Computational Linguistics.
James Andreoni and Laura K. Gee. 2015. Gunning for
efficiency with third party enforcement in threshold
public goods. Experimental Economics, 18(1):154–
171.
Hannah Bast and Claudius Korzen. 2017. A bench-
mark and evaluation for text extraction from PDF.
In 2017 ACM/IEEE Joint Conference on Digital Li-
braries, JCDL 2017, Toronto, ON, Canada, June 19-
23, 2017, pages 99–108. IEEE Computer Society.
Steven Bird. 2006. NLTK: the natural language toolkit.
In ACL 2006, 21st International Conference on Com-
putational Linguistics and 44th Annual Meeting of
the Association for Computational Linguistics, Pro-
ceedings of the Conference, Sydney, Australia, 17-21
July 2006. The Association for Computer Linguis-
tics.
Cronin Blaise. 2001. Acknowledgement trends
in the research literature of information science.
57(3):427–433.
Isaac G. Councill, C. Lee Giles, Hui Han, and Eren
Manavoglu. 2005. Automatic acknowledgement in-
dexing: Expanding the semantics of contribution in
the citeseer digital library. In Proceedings of the 3rd
International Conference on Knowledge Capture, K-
CAP ’05, pages 19–26, New York, NY, USA. ACM.
Blaise Cronin, Gail McKenzie, Lourdes Rubio, and
Sherrill Weaver-Wozniak. 1993. Accounting for in-
fluence: Acknowledgments in contemporary sociol-
ogy. Journal of the American Society for Informa-
tion Science, 44(7):406–412.
Blaise Cronin, Gail McKenzie, and Michael Stiffler.
1992. Patterns of acknowledgement. Journal of
Documentation.
S. Dai, Y. Ding, Z. Zhang, W. Zuo, X. Huang, and
S. Zhu. 2019. Grantextractor: Accurate grant sup-
port information extraction from biomedical fulltext
based on bi-lstm-crf. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, pages
1–1.
Anthony Fader, Stephen Soderland, and Oren Etzioni.
2011. Identifying relations for open information ex-
traction. In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing,
EMNLP 2011, 27-31 July 2011, John McIntyre Con-
ference Centre, Edinburgh, UK, A meeting of SIG-
DAT, a Special Interest Group of the ACL, pages
1535–1545.
C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence.
1998. CiteSeer: An automatic citation indexing sys-
tem. In Proceedings of the 3rd ACM International
Conference on Digital Libraries, June 23-26, 1998,
Pittsburgh, PA, USA, pages 89–98.
C Lee Giles and Isaac G Councill. 2004. Who
gets acknowledged: Measuring scientific contribu-
tions through automatic acknowledgment indexing.
Proceedings of the National Academy of Sciences,
101(51):17599–17604.
Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Natural language understanding with Bloom embed-
dings, convolutional neural networks and incremen-
tal parsing. To appear.
Subhradeep Kayal, Zubair Afzal, George Tsatsaronis,
Marius Doornenbal, Sophia Katrenko, and Michelle
Gregory. 2019. A framework to automatically ex-
tract funding information from text. In Machine
Learning, Optimization, and Data Science, pages
317–328, Cham. Springer International Publishing.
Madian Khabsa, Pucktada Treeratpituk, and C. Lee
Giles. 2012. Ackseer: a repository and search en-
gine for automatically extracted acknowledgments
from digital libraries. In Proceedings of the
12th ACM/IEEE-CS Joint Conference on Digital Li-
braries, JCDL ’12, Washington, DC, USA, June 10-
14, 2012, pages 185–194. ACM.
Mario Lipinski, Kevin Yao, Corinna Breitinger, Jo-
eran Beel, and Bela Gipp. 2013. Evaluation of
header metadata extraction approaches and tools for
scientific pdf documents. In Proceedings of the
13th ACM/IEEE-CS Joint Conference on Digital Li-
braries, JCDL ’13, pages 385–386, New York, NY,
USA. ACM.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin-
ney, and Dan S. Weld. 2019. S2orc: The semantic
scholar open research corpus.
Patrice Lopez. 2009. Grobid: Combining auto-
matic bibliographic data recognition and term ex-
traction for scholarship publications. In Proceed-
ings of the 13th European Conference on Re-
search and Advanced Technology for Digital Li-
braries, ECDL’09, pages 473–474, Berlin, Heidel-
berg. Springer-Verlag.
Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven J. Bethard, and David Mc-
Closky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Association for Compu-
tational Linguistics (ACL) System Demonstrations,
pages 55–60.
Mausam, Michael Schmitz, Robert Bart, Stephen
Soderland, and Oren Etzioni. 2012. Open language
learning for information extraction. In Proceedings
of the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, EMNLP-CoNLL ’12,
19
pages 523–534, Stroudsburg, PA, USA. Association
for Computational Linguistics.
Adèle Paul-Hus, Adrián A. Dı́az-Faes, Maxime Sainte-
Marie, Nadine Desrochers, Rodrigo Costas, and Vin-
cent Larivière. 2017. Beyond funding: Acknowl-
edgement patterns in biomedical, natural and social
sciences. PLOS ONE, 12(10):1–14.
Adèle Paul-Hus, Philippe Mongeon, Maxime Sainte-
Marie, and Vincent Larivière. 2020. Who are the
acknowledgees? an analysis of gender and academic
status. Quantitative Science Studies, 0(0):1–17.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
and Christopher D. Manning. 2020. Stanza: A
python natural language processing toolkit for many
human languages. CoRR, abs/2003.07082.
Radim Řehůřek and Petr Sojka. 2010. Software
Framework for Topic Modelling with Large Cor-
pora. In Proceedings of the LREC 2010 Workshop
on New Challenges for NLP Frameworks, pages 45–
50, Valletta, Malta. ELRA. http://is.muni.cz/
publication/884893/en.
Peng Shi and Jimmy Lin. 2019. Simple BERT mod-
els for relation extraction and semantic role labeling.
CoRR, abs/1904.05255.
Mayank Singh, Barnopriyo Barua, Priyank Palod,
Manvi Garg, Sidhartha Satapathy, Samuel Bushi,
Kumar Ayush, Krishna Sai Rohith, Tulasi Gamidi,
Pawan Goyal, and Animesh Mukherjee. 2016.
OCR++: A robust framework for information extrac-
tion from scholarly articles. In COLING 2016, 26th
International Conference on Computational Linguis-
tics, Proceedings of the Conference: Technical Pa-
pers, December 11-16, 2016, Osaka, Japan, pages
3390–3400. ACL.
E. Smith, B. Williams-Jones, Z. Master, V. Larivière,
C. R. Sugimoto, A. Paul-Hus, M. Shi, and D. B.
Resnik. 2019. Misconduct and misbehavior related
to authorship disagreements in collaborative science.
Sci Eng Ethics.
Min Song, Keun Young Kang, Tatsawan Timakum, and
Xinyuan Zhang. 2020. Examining influential fac-
tors for acknowledgements classification using su-
pervised learning. PLOS ONE, 15(2):1–21.
H. Stewart, K. Brown, A. M. Dinan, N. Irigoyen, E. J.
Snijder, and A. E. Firth. 2018. Transcriptional and
translational landscape of equine torovirus. J Virol,
92(17).
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn
Funk, Rodney Kinney, Ziyang Liu, William Mer-
rill, Paul Mooney, Dewey Murdick, Devvret Rishi,
Jerry Sheehan, Zhihong Shen, Brandon Stilson,
Alex D. Wade, Kuansan Wang, Chris Wilhelm, Boya
Xie, Douglas Raymond, Daniel S. Weld, Oren Et-
zioni, and Sebastian Kohlmeier. 2020. CORD-
19: the covid-19 open research dataset. CoRR,
abs/2004.10706.
Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams,
Sagnik Ray Choudhury, Suppawong Tuarob, Cor-
nelia Caragea, and C. Lee Giles. 2015. Pdfmef:
A multi-entity knowledge extraction framework for
scholarly documents and semantic search. In Pro-
ceedings of the 8th International Conference on
Knowledge Capture, K-CAP 2015, pages 13:1–13:8,
New York, NY, USA. ACM.

More Related Content

Similar to Acknowledgement Entity Recognition In CORD-19 Papers

Data science innovations
Data science innovations Data science innovations
Data science innovations suresh sood
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfWARCnet
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataAM Publications
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataAM Publications
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
 
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfA_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfLandingJatta1
 
Rule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsRule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsWaqas Tariq
 
An Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationAn Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationKayla Jones
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsIOSR Journals
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxfPhilippe Rocca-Serra
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfAkuhuruf
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldSeth Grimes
 
Apollo Towards Factfinding In Participatory Sensing
Apollo  Towards Factfinding In Participatory SensingApollo  Towards Factfinding In Participatory Sensing
Apollo Towards Factfinding In Participatory SensingCarmen Pell
 

Similar to Acknowledgement Entity Recognition In CORD-19 Papers (20)

G1803054653
G1803054653G1803054653
G1803054653
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
 
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataEvaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific Data
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfA_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
 
D018212428
D018212428D018212428
D018212428
 
Rule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsRule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak Reports
 
An Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation GenerationAn Unsupervised Approach For Reputation Generation
An Unsupervised Approach For Reputation Generation
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
Resume
ResumeResume
Resume
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxf
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New World
 
NIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATSNIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATS
 
Apollo Towards Factfinding In Participatory Sensing
Apollo  Towards Factfinding In Participatory SensingApollo  Towards Factfinding In Participatory Sensing
Apollo Towards Factfinding In Participatory Sensing
 

More from Martha Brown

Business Proposal Letter THE RESEARCH PROPO
Business Proposal Letter THE RESEARCH PROPOBusiness Proposal Letter THE RESEARCH PROPO
Business Proposal Letter THE RESEARCH PROPOMartha Brown
 
What Are The Best Research Methods For Writers
What Are The Best Research Methods For WritersWhat Are The Best Research Methods For Writers
What Are The Best Research Methods For WritersMartha Brown
 
(PDF) Editorial - Writing For Publication
(PDF) Editorial - Writing For Publication(PDF) Editorial - Writing For Publication
(PDF) Editorial - Writing For PublicationMartha Brown
 
Canada Role In World Essay United Nations Internati
Canada Role In World Essay United Nations InternatiCanada Role In World Essay United Nations Internati
Canada Role In World Essay United Nations InternatiMartha Brown
 
5 Best Images Of 12-Sided Snowflake Printable Templ
5 Best Images Of 12-Sided Snowflake Printable Templ5 Best Images Of 12-Sided Snowflake Printable Templ
5 Best Images Of 12-Sided Snowflake Printable TemplMartha Brown
 
Monster Page Borders (Teacher Made). Online assignment writing service.
Monster Page Borders (Teacher Made). Online assignment writing service.Monster Page Borders (Teacher Made). Online assignment writing service.
Monster Page Borders (Teacher Made). Online assignment writing service.Martha Brown
 
How To Resource In An Essay Salt Lake Juvenile Defense
How To Resource In An Essay Salt Lake Juvenile DefenseHow To Resource In An Essay Salt Lake Juvenile Defense
How To Resource In An Essay Salt Lake Juvenile DefenseMartha Brown
 
How To Write A Play Script (With Pictures) - WikiHow
How To Write A Play Script (With Pictures) - WikiHowHow To Write A Play Script (With Pictures) - WikiHow
How To Write A Play Script (With Pictures) - WikiHowMartha Brown
 
How To Write A Great Narrative Essay. How Do Y
How To Write A Great Narrative Essay. How Do YHow To Write A Great Narrative Essay. How Do Y
How To Write A Great Narrative Essay. How Do YMartha Brown
 
Apa Itu Template What Is Template Images
Apa Itu Template What Is Template ImagesApa Itu Template What Is Template Images
Apa Itu Template What Is Template ImagesMartha Brown
 
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.ComFake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.ComMartha Brown
 
Phenomenal How To Write A Satirical Essay Thatsnotus
Phenomenal How To Write A Satirical Essay ThatsnotusPhenomenal How To Write A Satirical Essay Thatsnotus
Phenomenal How To Write A Satirical Essay ThatsnotusMartha Brown
 
The Best Providers To Get Custom Term Paper Writing Service
The Best Providers To Get Custom Term Paper Writing ServiceThe Best Providers To Get Custom Term Paper Writing Service
The Best Providers To Get Custom Term Paper Writing ServiceMartha Brown
 
How To Choose A Perfect Topic For Essay. Online assignment writing service.
How To Choose A Perfect Topic For Essay. Online assignment writing service.How To Choose A Perfect Topic For Essay. Online assignment writing service.
How To Choose A Perfect Topic For Essay. Online assignment writing service.Martha Brown
 
Pin On Dissertation Help Online. Online assignment writing service.
Pin On Dissertation Help Online. Online assignment writing service.Pin On Dissertation Help Online. Online assignment writing service.
Pin On Dissertation Help Online. Online assignment writing service.Martha Brown
 
Cantest Sample Essay. Online assignment writing service.
Cantest Sample Essay. Online assignment writing service.Cantest Sample Essay. Online assignment writing service.
Cantest Sample Essay. Online assignment writing service.Martha Brown
 
Article Critique Example In His 1999 Article The - Ma
Article Critique Example In His 1999 Article The  - MaArticle Critique Example In His 1999 Article The  - Ma
Article Critique Example In His 1999 Article The - MaMartha Brown
 
College Essay Examples Of College Essays
College Essay Examples Of College EssaysCollege Essay Examples Of College Essays
College Essay Examples Of College EssaysMartha Brown
 
Writing A TOK Essay. Online assignment writing service.
Writing A TOK Essay. Online assignment writing service.Writing A TOK Essay. Online assignment writing service.
Writing A TOK Essay. Online assignment writing service.Martha Brown
 
How To Write A Good Classific. Online assignment writing service.
How To Write A Good Classific. Online assignment writing service.How To Write A Good Classific. Online assignment writing service.
How To Write A Good Classific. Online assignment writing service.Martha Brown
 

More from Martha Brown (20)

Business Proposal Letter THE RESEARCH PROPO
Business Proposal Letter THE RESEARCH PROPOBusiness Proposal Letter THE RESEARCH PROPO
Business Proposal Letter THE RESEARCH PROPO
 
What Are The Best Research Methods For Writers
What Are The Best Research Methods For WritersWhat Are The Best Research Methods For Writers
What Are The Best Research Methods For Writers
 
(PDF) Editorial - Writing For Publication
(PDF) Editorial - Writing For Publication(PDF) Editorial - Writing For Publication
(PDF) Editorial - Writing For Publication
 
Canada Role In World Essay United Nations Internati
Canada Role In World Essay United Nations InternatiCanada Role In World Essay United Nations Internati
Canada Role In World Essay United Nations Internati
 
5 Best Images Of 12-Sided Snowflake Printable Templ
5 Best Images Of 12-Sided Snowflake Printable Templ5 Best Images Of 12-Sided Snowflake Printable Templ
5 Best Images Of 12-Sided Snowflake Printable Templ
 
Monster Page Borders (Teacher Made). Online assignment writing service.
Monster Page Borders (Teacher Made). Online assignment writing service.Monster Page Borders (Teacher Made). Online assignment writing service.
Monster Page Borders (Teacher Made). Online assignment writing service.
 
How To Resource In An Essay Salt Lake Juvenile Defense
How To Resource In An Essay Salt Lake Juvenile DefenseHow To Resource In An Essay Salt Lake Juvenile Defense
How To Resource In An Essay Salt Lake Juvenile Defense
 
How To Write A Play Script (With Pictures) - WikiHow
How To Write A Play Script (With Pictures) - WikiHowHow To Write A Play Script (With Pictures) - WikiHow
How To Write A Play Script (With Pictures) - WikiHow
 
How To Write A Great Narrative Essay. How Do Y
How To Write A Great Narrative Essay. How Do YHow To Write A Great Narrative Essay. How Do Y
How To Write A Great Narrative Essay. How Do Y
 
Apa Itu Template What Is Template Images
Apa Itu Template What Is Template ImagesApa Itu Template What Is Template Images
Apa Itu Template What Is Template Images
 
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.ComFake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
 
Phenomenal How To Write A Satirical Essay Thatsnotus
Phenomenal How To Write A Satirical Essay ThatsnotusPhenomenal How To Write A Satirical Essay Thatsnotus
Phenomenal How To Write A Satirical Essay Thatsnotus
 
The Best Providers To Get Custom Term Paper Writing Service
The Best Providers To Get Custom Term Paper Writing ServiceThe Best Providers To Get Custom Term Paper Writing Service
The Best Providers To Get Custom Term Paper Writing Service
 
How To Choose A Perfect Topic For Essay. Online assignment writing service.
How To Choose A Perfect Topic For Essay. Online assignment writing service.How To Choose A Perfect Topic For Essay. Online assignment writing service.
How To Choose A Perfect Topic For Essay. Online assignment writing service.
 
Pin On Dissertation Help Online. Online assignment writing service.
Pin On Dissertation Help Online. Online assignment writing service.Pin On Dissertation Help Online. Online assignment writing service.
Pin On Dissertation Help Online. Online assignment writing service.
 
Cantest Sample Essay. Online assignment writing service.
Cantest Sample Essay. Online assignment writing service.Cantest Sample Essay. Online assignment writing service.
Cantest Sample Essay. Online assignment writing service.
 
Article Critique Example In His 1999 Article The - Ma
Article Critique Example In His 1999 Article The  - MaArticle Critique Example In His 1999 Article The  - Ma
Article Critique Example In His 1999 Article The - Ma
 
College Essay Examples Of College Essays
College Essay Examples Of College EssaysCollege Essay Examples Of College Essays
College Essay Examples Of College Essays
 
Writing A TOK Essay. Online assignment writing service.
Writing A TOK Essay. Online assignment writing service.Writing A TOK Essay. Online assignment writing service.
Writing A TOK Essay. Online assignment writing service.
 
How To Write A Good Classific. Online assignment writing service.
How To Write A Good Classific. Online assignment writing service.How To Write A Good Classific. Online assignment writing service.
How To Write A Good Classific. Online assignment writing service.
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 

Acknowledgement Entity Recognition In CORD-19 Papers

  • 1. Proceedings of the First Workshop on Scholarly Document Processing, pages 10–19 Online, November 19, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 10 Acknowledgement Entity Recognition in CORD-19 Papers Jian Wu Computer Science Old Dominion University Norfolk, VA, USA jwu@cs.odu.edu Pei Wang, Xin Wei Computer Science Old Dominion University Norfolk, VA, USA {pwang001,xwei001}@odu.edu Sarah Michele Rajtmajer, C. Lee Giles Information Sciences and Technology Pennsylvania State University University Park, PA, USA Christopher Griffin Applied Research Laboratory Pennsylvania State University University Park, PA, USA Abstract Acknowledgements are ubiquitous in schol- arly papers. Existing acknowledgement en- tity recognition methods assume all named entities are acknowledged. Here, we exam- ine the nuances between acknowledged and named entities by analyzing sentence struc- ture. We develop an acknowledgement ex- traction system, ACKEXTRACT based on open-source text mining software and eval- uate our method using manually labeled data. ACKEXTRACT uses the PDF of a scholarly paper as input and outputs ac- knowledgement entities. Results show an overall performance of F1 = 0.92. We built a supplementary database by link- ing CORD-19 papers with acknowledge- ment entities extracted by ACKEXTRACT including persons and organizations and find that only up to 50–60% of named enti- ties are actually acknowledged. We fur- ther analyze chronological trends of ac- knowledgement entities in CORD-19 pa- pers. All codes and labeled data are pub- licly available at https://github.com/ lamps-lab/ackextract. 1 Introduction Acknowledgements have been an institutionalized part of research publications for some time (Blaise, 2001). Acknowledgement statements show the au- thors’ public gratitude and recognition to individ- uals, organizations, and grants for various contri- butions. Acknowledged individuals and organiza- tions have been under-presented in author ranking and citation impact analysis mostly due to their presumed sub-authorship contribution. A recent survey found that discipline, academic rank, and gender have a significant effect on coauthorship disagreement rate (Smith et al., 2019), leading to non-author collaborators receiving less attention. Recently, the presence of non-author collaborators in the biomedical and social sciences (Paul-Hus et al., 2017) showed that non-author collaborators are not rare and their presence varies significantly by disciplines. Acknowledgements can be classified depending on the nature of the contribution. Song et al. (2020) classified sentences in acknowledgement sections into 6 categories: declaration, financial, peer inter- active communication and technical support, pre- sentation, general acknowledgement, and general statement. They can also be classified based on the type of entities such as individual, organization, and grant. Since 2008, funding acknowledgements have been indexed by Web of Science. However, there is still no dedicated software to accurately rec- ognize acknowledged people and organizations and generate a centralized acknowledgement database. Early works on acknowledgements were based on datasets manually extracted from specific journals, which was not scalable. Building such a large database can support further study of acknowledge- ments at a larger scale. There are several scenarios that make the ac- knowledgement entity recognition (AER) task chal- lenging. The upper panel of Figure 1 shows ex- amples of sentences appearing in an isolated ac- knowledgement section. The “Utrecht University” is mentioned but should not be counted as an ac- knowledgement entity because it is just the affil- iation of “Arno van Vliet” who is acknowledged. Acknowledgement statements can also appear at footnotes (Figure 1 Bottom), mixed with other foot- note and/or body text. Author names may also appear in the statements, such as “Andreoni” in this example, and should be excluded. Existing works on AER leverage off-the-shelf named entity recognition (NER) packages, such
  • 2. 11 Figure 1: Upper: Acknowledgement statements appear in an isolated section (Stewart et al., 2018). Bottom: acknowledgement statements appear in a footnote (An- dreoni and Gee, 2015). as the Natural Language Toolkits (NLTK) (Bird, 2006) e.g., Khabsa et al. (2012); Paul-Hus et al. (2020), followed by simple semi-manual data cleansing, resulting in a fraction of entities that are mentioned but not actually acknowledged. In this paper, we design an automatic AER system called ACKEXTRACT that further classifies extrac- tion results from open source NER packages that recognize people and organizations and distinguish entities that are actually acknowledged. The extrac- tor finds acknowledgement statements from iso- lated sections and other locations such as footnotes, which is common for papers in social and behav- ioral sciences. Our contributions are: 1. Develop ACKEXTRACT as open-source soft- ware to automatically extract acknowledge- ment entities from research papers. 2. Apply ACKEXTRACT on the CORD-19 dataset and supplement the dataset with a cor- pus of classified acknowledgement entities for further studies. 3. Use the CORD-19 dataset as a case study to demonstrate that acknowledgement studies without classifying named entities, can sig- nificantly overestimate the number of entities that are actually acknowledged because many people and organizations are mentioned but not explicitly acknowledged. 2 Related Works Early work on acknowledgement extraction was manually applied, which was labor-intensive. Cronin et al. (1993) extracted a total of 9561 peer interactive communication (PIC) names from a to- tal of 4200 research sociology articles, most were persons’ names. They also defined the following six categories of acknowledgement: moral support, financial support, editorial support, presentational support, instrumental/technical support, and con- ceptual support, or PIC (Cronin et al., 1992). Councill et al. (2005) used a hybrid method for automatic AER from research papers and automati- cally created an acknowledgement index(Giles and Councill, 2004). The algorithm first used a heuris- tic method for identifying acknowledgement pas- sages. It then uses an SVM model for identifying lines containing acknowledgement sentences out- side labeled acknowledgement sections. A regular expression was used to extract entity names from acknowledging text. This method achieved an over- all precision of about 0.785 and a recall of 0.896 on CiteSeer papers (Giles et al., 1998). The algorithm does not distinguish entity types. Khabsa et al. (2012) leveraged OpenCalais1 and AlchemyAPI2, free services at that time, to extract named entities from acknowledgement sections and built ACKSEER, a search engine for acknowledge- ment entities. They merged outputs of both NER APIs and generated a list and disambiguated entity mentions using the longest common subsequence (LCS) algorithm. The ground truth contains 200 top-cited CiteSeerX papers in which 130 had ac- knowledgement sections. They achieved 92.3% and 91.6% precision and recall for acknowledge- ment section extraction but did not evaluate entity extraction. Recent studies of acknowledgements tend to use results from off-the-shelf NER packages with sim- ple filters, assuming that named entities were ac- knowledged entities. For example, Paul-Hus et al. (2020) uses the Stanford NER module in NLTK to extract persons. Song et al. (2020) also directly use people and organizations recognized by the Stan- ford CoreNLP (Manning et al., 2014). These works achieved a high recall by recognizing most name entities in the acknowledgements but ignored their relations to the papers where they appear, resulting in a fraction of entities that are mentioned but not actually acknowledged. Song et al. (2020) consider grammar structure such as verb tense and voice and sentence patterns when labeling sentences to their six categories. For example, “was funded” is followed by an “organization”. However, they only label sentences and do not annotate them down to 1 https://web.archive.org/web/ 20081023021111/http://opencalais.com/ calaisAPI# 2 https://web.archive.org/web/ 20090921044923/http://www.alchemyapi. com/
  • 3. 12 Figure 2: Architecture of ACKEXTRACT. the entity level. Our system examines the relation- ship between entities and the current work, with the purpose of discriminating acknowledgement en- tities from named entities. In this system, we focus on people and organizations. Recently, Dai et al. (2019) proposed GrantEx- tractor, a pipeline system to extract grant support in- formation from scientific papers. A model combin- ing BiLSTM-CRF and pattern matching was used to extract entities of grant numbers and agencies from funding sentences, which are identified using heuristic methods. The system achieves a micro-F1 up to 0.90 in extracting grant pairs (agency, num- ber). Kayal et al. (2019) proposed an ensemble ap- proach called FUNDINGFINDER for extracting funding information from text. The authors con- struct feature vectors for candidate entities using whether the entities are recognized by four NER implementation: Stanford (Conditional Random Field model), LingPipe (Hidden Markov model), OpenNLP (Maximum Entropy model), and Else- vier’s Fingerprint Engine. The F1-measure for funding body is only 0.68 ± 0.3. Our method is different from existing methods in threefold. (1) It is built on top of state-of-the-art neural NER methods, which results in a relatively high recall. (2) It uses a heuristic method to filter out entities that are just mentioned but not acknowl- edged. (3) It extracts both organizations and people. 3 Dataset On March 16th, 2020, Allen Institute of Artifi- cial Intelligence, released the first version of the COVID-19 Open Research Dataset (CORD-19) (Wang et al., 2020), in collaboration with several other institutions. The dataset contains metadata and segmented full text of research articles selected by searching a list of keywords about coronavirus, SARS-CoV, MERS, and other related terms from four digital libraries including WHO, PubMed Cen- tral, BioRxiv, and MedRxiv. The initial dataset con- tained about 28k papers and was updated weekly with papers from new sources and the latest pub- lication. We used the dataset released on April 10, 2020 containing over 59,312 metadata records, among which 54,756 have the full text in JSON format. The CORD-19 papers were generated by processing PDFs using the S2ORD pipeline (Lo et al., 2019), in which GROBID (Lopez, 2009) was employed for document segmentation and metadata extraction. The full text in JSON files is directly used for sentence segmentation. However, we observe that GROBID extraction results are not perfect. In par- ticular, we estimate the fraction of acknowledge- ment sections omitted. We also estimate the num- ber of acknowledgement entities omitted in the data release due to the extraction error of GROBID. To do this, we downloaded 45,916 full-text PDF papers collected by the Internet Archive (IA) be- cause the CORD-19 dataset does not include PDF files3. We found 13,103 CORD-19 papers in the IA dataset. 4 Acknowledgement Extraction 4.1 Overview The architecture of our acknowledgement entity ex- traction system is depicted in Figure 2. The system can be divided into the following modules. 1. Document segmentation. CORD-19 pro- vides full text as JSON files, but in general, most research articles are published in PDF, so our first step is converting a PDF docu- ment to text and segment it into sections. We use GROBID that has shown superior perfor- mance over many other document extraction methods (Lipinski et al., 2013). The output is an XML file in TEI schema. 2. Sentence segmentation. Paragraphs are seg- mented into sentences. We compare several sentence segmentation software packages and choose Stanza (Qi et al., 2020) because of its relatively high accuracy. 3. Sentence classification. Sentences are classified into acknowledgement and non- acknowledgement statements. The result is a set of acknowledgement statements inside or outside the acknowledgement sections. 3 https://archive.org/download/covid19_ fatcat_20200410
  • 4. 13 4. Entity recognition. Named entities are ex- tracted from acknowledgement statements. We compare four commonly used NER soft- ware packages and choose Stanza because of its relatively high performance. In this work, we focus on person and organization. 5. Entity classification. In this module we clas- sify named entities by analyzing sentence structures, aiming at discriminating named entities that are actually acknowledged, rather than just mentioned. We demonstrate that triple extraction packages such as REVERB and OLLIE fail to handle acknowledgement statements with multiple entities in objects in our dataset. The results are acknowledgement entities including people or organizations. 4.2 Document Segmentation The majority of scholarly papers are published in PDF format, which are not readily readable by text processors. Several attempts have been made to convert PDF into text (Bast and Korzen, 2017) and segment the document into section and sub-section levels. GROBID is a machine learning library for extracting, parsing, and re-structuring raw docu- ments into TEI encoded documents. Other simi- lar methods have been recently developed such as OCR++ (Singh et al., 2016) and Science Parse4. Lipinski et al. (2013) compared 7 metadata extrac- tion methods and found that GROBID (version 0.4.0) achieved superior performance over the oth- ers. GROBID trained a cascading of conditional random field (CRF) models on PubMed and com- puter science papers. The recent version (0.6.0) has a set of powerful functionalities such as extract- ing and parsing headers and segmenting full-text extraction. GROBID supports a batch mode and an API service mode, the latter enables large scale document processing on multi-core servers such as in CiteSeerX (Wu et al., 2015). A benchmark- ing result for version 0.6.0 shows that the section title parsing achieves and F1 = 0.70 under the strict matching criteria and F1 = 0.75 under the soft matching criteria5. Singh et al. (2016) claims OCR++ achieves better performance than GROBID in several fields evaluated on computer science pa- pers. However, the lack of a service mode API and multi-domain adaptability limits its usability. Science-Parse only extracts key metadata such as 4 https://github.com/allenai/spv2 5 https://grobid.readthedocs.io/en/ latest/Benchmarking-pmc/ title, authors, year, and venue. Therefore, we adopt GROBID to convert PDF documents into XML files. Depending on the structure and provenance of PDFs, GROBID may miss acknowledgements in certain papers. To estimate the fraction of papers in which acknowledgements were missed by GRO- BID, we visually inspected a random sample of 200 papers from the CORD-19 dataset, and found that only 146 papers (73%) contain acknowledgement statements, out of which GROBID successfully ex- tracted all acknowledgement statements from 120 papers (82%). For the remaining 26 papers that GROBID failed to parse, 17 papers are in sections, 9 papers are in footnotes. We developed a heuristic method that can extract acknowledgement state- ments from all 120 papers with acknowledgement statements output by GROBID. 4.3 Sentence Segmentation The acknowledgement sections or statements ex- tracted above are paragraphs, which needs to be segmented (or tokenized) into sentences. we com- pared four software packages for sentence segmen- tation including NLTK (Bird, 2006), Stanza (Qi et al., 2020), Gensim (Řehůřek and Sojka, 2010), and the Pragmatic Segmenter6. NLTK includes a sentence tokenization method sent tokenize(), which uses an unsupervised al- gorithm to build a model for abbreviated words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. Stanza is a Python natural language analysis pack- age developed by the Stanford NLP group. Sen- tence segmentation is modeled as a tagging prob- lem over character sequences, where the neural model predicts whether a given character is the end of a sentence. The split sentences() function in Gensim package splits a text and returns list of sen- tences from a given text string using unsupervised pattern recognition. The Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages. To compare the above methods, we created a ground truth corpus by randomly selecting ac- knowledgment sections or statements from 47 pa- pers and manually segmenting them, resulting in 100 sentences. Table 1 shows the comparison re- sults for four methods. The precision is calculated 6 https://github.com/diasks2/pragmatic_ segmenter
  • 5. 14 Method Precision Recall F1 Gensim 0.65 0.64 0.65 NLTK 0.72 0.69 0.70 Pragmatic 0.86 0.76 0.81 Stanza 0.81 0.88 0.84 Table 1: Sentence segmentation performance. by dividing the number of correctly segmented sen- tences by the total number of sentences segmented. The recall is calculated by dividing the number of correctly segmented sentences by the total number of manually segmented sentences. Stanza outper- forms the other three, achieving an F1 = 0.84. 4.4 Sentence Classification Not all sentences in acknowledgement sec- tions express acknowledgement, such as the following sentence, The funders played no role in the study or preparation of the manuscript Song et al. (2020). In this module, we classify sentences into acknowledgement and non-acknowledgement statements. We developed a set of regular expressions that match both verbs (e.g., thank, gratitude to, indebted to), adjectives (e.g., grateful to), and nouns (e.g., helpful comments, useful feedback) to cover as many cases as possible. To evaluate this method, we manually selected 100 sentences, including 50 positive and negative samples from the sentences obtained in Section 4.3. Our results show that 96 out of 100 sentences were classified correctly, resulting accuracy of 0.96. 4.5 Entity Recognition In this step, named entities are extracted using state-of-the-art NER software packages, including NLTK Bird (2006), Stanford CoreNLP Manning et al. (2014), spaCy Honnibal and Montani (2017), and Stanza (Qi et al., 2020). Stanza is a Python library offering fully neural pre-trained models that provide state-of-the-art performance on many raw text processing tasks when it was released. The NER model adopted the contextualized sequence tagger in Akbik et al. (2018). The architecture in- cludes a Bi-LSTM character-level language model, followed by a one-layer Bi-LSTM sequence tagger with a conditional random field (CRF) encoder. Al- though Stanza was developed based on Stanford CoreNLP, they exhibit differential performances in our NER task. The ground truth is built by randomly selecting Method Entity Precision Recall F1 NLTK Person 0.45 0.68 0.55 Org. 0.59 0.77 0.67 spaCy Person 0.74 0.88 0.80 Org. 0.63 0.74 0.68 Stanford- Person 0.88 0.87 0.87 CoreNLP Org. 0.68 0.80 0.73 Stanza Person 0.89 0.93 0.91 Org. 0.60 0.89 0.72 Table 2: Comparison of NER software package. 100 acknowledgement paragraphs from sections, footnotes, and body text, and manually annotating person and organization entities, without discrimi- nating whether they are acknowledged or not. This results in 146 person and 209 organization entities. The comparison results indicate that overall Stanza outperforms the other three, achieving F1 = 0.91 for person and F1 = 0.72 for organization. Es- pecially, the recall of Stanza is 9% higher than Stanford CoreNLP (Table 2). 4.6 Entity Classification As we showed, not all named entities are acknowl- edged, such as the “’Utrecht University” in Fig- ure 1. Therefore, it is necessary to build a classifier to discriminate acknowledgement entities – entities that are thanked by the paper or the authors, from named entities. The majority of acknowledgement statements in academic articles have a relatively standard subject- predicate-object (SPO) structure. They use a lim- ited number of words or phrases, such as “thank”, “acknowledge”, “are grateful”, “is supported”, and “is funded” as predicates. However, the object can contain multiple named entities, some of which are used as attributes of the others. In rare cases, certain sentences may not have subjects and predi- cates, such as the first sentence in Figure 4. Our approach can be divided into three steps. Two representative examples are illustrated in Fig- ure 3. The pseudo-code is shown in Algorithm 1. Step 1: We resolve the type of voice (active or passive), subject, and predicate of a sentence using dependency parsing by Stanza. This is because named entities can appear as subjective or objec- tive parts. We then locate all named entities. The semantic meaning of a predicate and its type of voice can be used to determine whether entities ac- knowledged are in the objective part or subjective part. In most cases the target entities are objects.
  • 6. 15 Step 2: A sentence with multiple entities in the objective part is split into shorter sentences, called “subsentences”, so that each subsentence is asso- ciated with only up to one named entity. This is done by first splitting the sentence by “and”. For each subsentence, if the subject and predicate are missing, we fill them up using the subject and pred- icate of the original sentence. The object in each subsentence does not necessarily contain an entity. For example, in the right panel of Figure 3, because “expertise” is not a named entity, it is replaced by “none” in the subsentence. The SPO relations that do not contain named entities are removed. There are two scenarios. In the first scenario, a sentence or a subsentence may contain a list of entities, with only the first being acknowledged. In the third example of Figure 4, the named en- tities are [‘Qi Yang’, ‘Morehouse School of Medicine’, ‘Atlanta’, ‘GA’] but only the first entity is acknowledged. The rest entities, which are recognized as organizations or locations, are used for supplementing more information. In this scenario, only the first named entity is extracted. Step 3: In the second scenario, acknowledged entities are connected by commas or “and”, such as in We thank Shirley Hauta, Yurij Popowych, Elaine van Moorlehem and Yan Zhou for help in the virus production. In this scenario, entities in this structure have the same type, indicating that they play similar roles. This parallel pattern can be captured by regular expressions. The entities resolved in Step 2 and 3 will be merged to form the final set. The method to find the parallel structure is as follows. First, check each entity whether its type is person. If so, the entities are substituted with integer indexes. The sentence becomes The authors would like to thank 0, 1 and 2, and the kind support of Bayer Animal Health GmbH and Virbac Group. If there are 3 or more consecutive numbers in this form, this is the parallel pattern, which is captured by regular expressions. The pattern also allows text between names (Figure 5). Next, the numbers in this part will be extracted and mapped to corresponding entities. In the example above, the numbers [0,1,2] correspond to the index of the entities [Norbert Mencke, Lourdes Mottier, David McGahieand]. Similar pattern recognition are performed for organization. The process is depicted in Figure 5. We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan for their support and expertise. We would also like to thank the Canadian Wildlife Health Cooperative at the University of Saskatchewan for their support. We would also like to thank expertise. [‘we’, ‘thank’, ‘Canadian Wildlife Health Cooperative’] [‘we’, ‘thank’, ‘Nicolas Olivarez’] [‘Canadian Wildlife Health Cooperative’] We thank Anne Broadwater and Nicolas Olivarez of the de Silva lab for technical assistance with the assays. We thank Anne Broadwater of the de Silva lab for technical assistance with the assays. We thank Nicolas Olivarez of the de Silva lab for technical assistance with the assays. [‘we’, ‘thank’, ‘Anne Broadwater’] [‘we’, ‘thank’, none] [‘Anne Broadwater’, ‘Nicolas Olivarez’] Figure 3: Process mapping multiple subject-predicate- object relations to acknowledgement entities with two representative examples. “None” means the object does not contain named entities. Method Precision Recall F1 Stanza 0.57 0.91 0.70 RB(without step 3) 0.94 0.78 0.85 RB 0.94 0.90 0.92 Table 3: Performance of Stanza and our relation-based (RB) classifier. Precision and recall show overall re- sults by combining person and organization. We investigated open information extraction methods such as REVERB (Fader et al., 2011) and OLLIE (Mausam et al., 2012). We found that they only work for sentences with relatively simple structures such as The authors wish to thank Ming-Wei Guo for reagents and technical assistance, but fail with more complicated sentences with long objective part or parallel subsentences (Figure 4). We also investigated the semantic role labeling (SRL) library in AllenNLP toolkit (Shi and Lin, 2019). For the third sentence in Figure 4, the AllenNLP SRL library resolves the entire string after “thank” as an argument, but fails to distinguish entity types. Our method can handle all statements in Figure 4. As a baseline, we use Stanza to extract all named entities and compare its performance with our relation-based (RB) classifier. The ground truth is compiled using the same corpus described in Section 4.5 except that only acknowledgement en- tities (as opposed to all named entities) are labeled positive. The results (Table 3) show that Stanza achieves high recall but poor precision, indicating that a significant fraction (∌ 40%) of named enti- ties are not acknowledged. In contrast, our classi- fier (RB) achieves a precision of 0.94, with a small loss of recall, achieving an F1 = 0.92. One limitation of the RB classifier is that it relies on text quality. Sentences are expected to follow
  • 7. 16 To Aulio Costa Zambenedetti, for the art work, Nilson FidĂȘncio, Silvio Marques, Tania Schepainski and Sibelli Tanjoni de Souza, for technical assistance. The authors also thank the Program for Technological Development in Tools for Health-PDTIS FIOCRUZ, for the use of its facilities (Platform RPT09H -Real Time PCR -Instituto Carlos Chagas/Fiocruz-PR). The authors wish to thank Ms. Qi Yang, DNA sequencing lab, Morehouse School of Medicine, Atlanta, GA, and the Research Cores at Morehouse School of Medicine supported through the Research Centers at Minority Institutions (RCMI) Program, NIH/NCRR/RCMI Grant G12-RR03034. Figure 4: Sentence examples, from which REVERB or OLLIE failed to extract acknowledgement entities un- derlined, which can be identified by our method. 'Norbert Mencke' 'David McGahie' 'Bayer Animal Health' 'Virbac Group' 'Norbert Mencke' 'Lourdes Mottier' 'David McGahie' step 3 step 2 The authors would like to thank Norbert Mencke, Lourdes Mottier and David McGahie, and the kind support of Bayer Animal Health GmbH and Virbac Group. ['Norbert Mencke', 'David McGahie', 'Bayer Animal Health', 'Virbac Group', 'Lourdes Mottier'] Figure 5: Merge the results given by step 2 and step 3 and obtain the final results. the editorial convention, such that entity names are clearly segmented by period. If two sentences are not properly delimited, e.g., missing a period, the classifier may make incorrect predictions. 5 Data Analysis The CORD-19 dataset contains PMC and PDF folders. We found that almost all papers in the PMC folders are included in the PDF folders. For consistency, we only work on papers in the PDF folders, containing 37,612 unique PDF pa- pers7. Using ACKEXTRACT, we extracted a total of 102,965 named entities from 21,469 CORD- 19 papers. The fraction of papers with named entities (21469/37612≈57%) is roughly consis- tent with the fraction obtained from the 200 sam- ples (120/200≈60% in Section 4.2). Among all named entities, 58,742 are acknowledgement enti- ties. These numbers suggest that using our model, only about 50% of named entities are acknowl- edged. Using only named entities, acknowledge- ment studies could significantly overestimate the 7 We excluded 2003 duplicate papers with exactly the same SHA1 values. Algorithm 1: Relation-Based Classifier 1 Function find entity(): 2 pre-process with punctuation and format 3 find candidate entities entity list by Stanza 4 for x ∈ entity list do 5 if x ∈ subject part (differs based on predicate) then 6 x = none 7 entity list.remove(none) 8 return entity list 9 Input: sentence 10 Output: entity list all 11 find subject part, predicate and the object part 12 if “predicate value” ∈ reg expression 1 then 13 if “and” ∈ sentence then 14 split into subsentences by ‘and’ 15 for each subsentence do 16 entity list ← find entity(subsentence) 17 if entity list 6= none then 18 entity list all ← append.entity list[0] 19 else if “and” / ∈ sentence then 20 entity list ← find entity(subsentence) 21 entity list all ← entity list[0] 22 else if “predicate value” ∈ reg expression 2 then 23 do the same in if part but focus on entities in subject 24 for x ∈ candidate list by Stanza do 25 if x ∈ sentence&x.type = type then 26 orig list ← append x 27 sent ← replace x by index(x) 28 if reg format ∈ sent then 29 temp ← find reg expression in sent 30 numlist ← find numbers in temp 31 list ← index orig list by numlist 32 combine list and entity list all number of acknowledgement entities by relying on entities recognized by NER software packages without further classification. Here, we analyze our results and study some trends of acknowledgement entities in the CORD-19 dataset. The top 10 acknowledged organizations (Ta- ble 4) are all funding agencies. Overall NIH (NI- AID is an institute of NIH) is acknowledged the most, but funding agencies in other countries are also acknowledged a lot in CORD-19 papers. Figure 6 shows the numbers of CORD-19 pa- pers with vs. without acknowledgement (person or organization) recognized from 1970 to 2020. The huge leap around 2002 was due to the wave of coronavirus studies during the SARS period. The small drop-down in 2020 was due to data incom- pleteness. The plot indicates that the fraction of
  • 8. 17 Organization Count National Institutes of Health or NIH 1414 National Natural Science Foundation of China or NSFC 615 National Institute of Allergy & Infectious Diseases or NIAID 209 National Science Foundation or NSF 183 Ministry of Health 160 Wellcome Trust 146 Deutsche Forschungsgemeinschaft 119 Ministry of Education 119 National Science Council 104 Public Health Service 85 Table 4: Top 10 acknowledged organizations identified in our CORD-19 dataset. 0 500 1000 1500 2000 2500 3000 3500 4000 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 With Acknowledgement Without Acknowledgement Figure 6: Numbers of CORD-19 papers with vs. with- out acknowledgement recognized from 1970 to 2020. papers with acknowledgement has been increasing gradually over the last 20 years. Figure 7 shows the numbers of the top 10 acknowledged organiza- tions from 1983 to 2020. The figure indicates that the number of acknowledgements to NIH has been gradually decreasing over the past 10 years while the acknowledgements to NSF is roughly constant. In contrast, the number has been gradually increas- ing from NSFC (a Chinese funding agency). Note that the distribution of acknowledged organizations has a long tail and organizations behind the top 10 actually dominate the total number. However, the top 10 organizations are the biggest research agen- cies and the trend to some extent reflects strategic shifts of funding support. 6 Conclusion and Future Work Here, we extended the work of Khabsa et al. (2012) and built an acknowledgement extraction frame- work denoted as ACKEXTRACT for research arti- cles. ACKEXTRACT is based on heuristic meth- ods and state-of-the-art text mining libraries (e.g., GROBID and Stanza) but features a classifier that discriminates acknowledgement entities from named entities by analyzing the multiple subject- predicate-object relations in a sentence. Our ap- 0 50 100 150 200 250 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Public Health Service National Science Council Ministry of Education Deutsche Forschungsgemeinschaft Wellcome Trust Ministry of Health National Institute of Allergy & Infectious Diseases/NIAID National Science Foundation/NSF National Natural Science Foundation of China/NSFC National Institutes of Health/NIH Figure 7: Trend of acknowledged organizations of top 10 organizations from 1983 to 2020. proach successfully recognizes acknowledgement entities that cannot be recognized by OIE packages such as REVERB, OLLIE, and the AllenNLP SRL library. This method is applied to the CORD-19 dataset released on April 10, 2020, processing one PDF document in 5 seconds on average. Our results indicate that only 50–60% named entities are ac- knowledged. The rest are mentioned to provide additional information (e.g., affiliation or location) about acknowledgement entities. Working on clean data, our method achieves an overall F1 = 0.92 for person and organization entities. The trend analysis of the CORD-19 papers verifies that more and more papers include acknowledgement enti- ties since 2002, when the SARS outbreak hap- pened. The trend also reveals that the overall number of acknowledgements to NIH is gradu- ally decreasing over the past 10 years, while more papers acknowledge NSFC, a Chinese funding agency. One caveat of our method is that organi- zations in different countries are not distinguished. For example, many countries have agencies called “Ministry of Health”. In the future, we plan to build learning-based models for sentence classifi- cation and entity classification. The code and data of this project have been released on GitHub at: https://github.com/lamps-lab/ackextract. Acknowledgments This work was partially supported by the Defense Advanced Research Projects Agency (DARPA). under cooperative agreement No. W911NF-19-2- 0272. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
  • 9. 18 References Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Associ- ation for Computational Linguistics. James Andreoni and Laura K. Gee. 2015. Gunning for efficiency with third party enforcement in threshold public goods. Experimental Economics, 18(1):154– 171. Hannah Bast and Claudius Korzen. 2017. A bench- mark and evaluation for text extraction from PDF. In 2017 ACM/IEEE Joint Conference on Digital Li- braries, JCDL 2017, Toronto, ON, Canada, June 19- 23, 2017, pages 99–108. IEEE Computer Society. Steven Bird. 2006. NLTK: the natural language toolkit. In ACL 2006, 21st International Conference on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Pro- ceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguis- tics. Cronin Blaise. 2001. Acknowledgement trends in the research literature of information science. 57(3):427–433. Isaac G. Councill, C. Lee Giles, Hui Han, and Eren Manavoglu. 2005. Automatic acknowledgement in- dexing: Expanding the semantics of contribution in the citeseer digital library. In Proceedings of the 3rd International Conference on Knowledge Capture, K- CAP ’05, pages 19–26, New York, NY, USA. ACM. Blaise Cronin, Gail McKenzie, Lourdes Rubio, and Sherrill Weaver-Wozniak. 1993. Accounting for in- fluence: Acknowledgments in contemporary sociol- ogy. Journal of the American Society for Informa- tion Science, 44(7):406–412. Blaise Cronin, Gail McKenzie, and Michael Stiffler. 1992. Patterns of acknowledgement. Journal of Documentation. S. Dai, Y. Ding, Z. Zhang, W. Zuo, X. Huang, and S. Zhu. 2019. Grantextractor: Accurate grant sup- port information extraction from biomedical fulltext based on bi-lstm-crf. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 1–1. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information ex- traction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Con- ference Centre, Edinburgh, UK, A meeting of SIG- DAT, a Special Interest Group of the ACL, pages 1535–1545. C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998. CiteSeer: An automatic citation indexing sys- tem. In Proceedings of the 3rd ACM International Conference on Digital Libraries, June 23-26, 1998, Pittsburgh, PA, USA, pages 89–98. C Lee Giles and Isaac G Councill. 2004. Who gets acknowledged: Measuring scientific contribu- tions through automatic acknowledgment indexing. Proceedings of the National Academy of Sciences, 101(51):17599–17604. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embed- dings, convolutional neural networks and incremen- tal parsing. To appear. Subhradeep Kayal, Zubair Afzal, George Tsatsaronis, Marius Doornenbal, Sophia Katrenko, and Michelle Gregory. 2019. A framework to automatically ex- tract funding information from text. In Machine Learning, Optimization, and Data Science, pages 317–328, Cham. Springer International Publishing. Madian Khabsa, Pucktada Treeratpituk, and C. Lee Giles. 2012. Ackseer: a repository and search en- gine for automatically extracted acknowledgments from digital libraries. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Li- braries, JCDL ’12, Washington, DC, USA, June 10- 14, 2012, pages 185–194. ACM. Mario Lipinski, Kevin Yao, Corinna Breitinger, Jo- eran Beel, and Bela Gipp. 2013. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Li- braries, JCDL ’13, pages 385–386, New York, NY, USA. ACM. Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- ney, and Dan S. Weld. 2019. S2orc: The semantic scholar open research corpus. Patrice Lopez. 2009. Grobid: Combining auto- matic bibliographic data recognition and term ex- traction for scholarship publications. In Proceed- ings of the 13th European Conference on Re- search and Advanced Technology for Digital Li- braries, ECDL’09, pages 473–474, Berlin, Heidel- berg. Springer-Verlag. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Compu- tational Linguistics (ACL) System Demonstrations, pages 55–60. Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12,
  • 10. 19 pages 523–534, Stroudsburg, PA, USA. Association for Computational Linguistics. Adèle Paul-Hus, Adrián A. Dı́az-Faes, Maxime Sainte- Marie, Nadine Desrochers, Rodrigo Costas, and Vin- cent Larivière. 2017. Beyond funding: Acknowl- edgement patterns in biomedical, natural and social sciences. PLOS ONE, 12(10):1–14. Adèle Paul-Hus, Philippe Mongeon, Maxime Sainte- Marie, and Vincent Larivière. 2020. Who are the acknowledgees? an analysis of gender and academic status. Quantitative Science Studies, 0(0):1–17. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. CoRR, abs/2003.07082. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en. Peng Shi and Jimmy Lin. 2019. Simple BERT mod- els for relation extraction and semantic role labeling. CoRR, abs/1904.05255. Mayank Singh, Barnopriyo Barua, Priyank Palod, Manvi Garg, Sidhartha Satapathy, Samuel Bushi, Kumar Ayush, Krishna Sai Rohith, Tulasi Gamidi, Pawan Goyal, and Animesh Mukherjee. 2016. OCR++: A robust framework for information extrac- tion from scholarly articles. In COLING 2016, 26th International Conference on Computational Linguis- tics, Proceedings of the Conference: Technical Pa- pers, December 11-16, 2016, Osaka, Japan, pages 3390–3400. ACL. E. Smith, B. Williams-Jones, Z. Master, V. Larivière, C. R. Sugimoto, A. Paul-Hus, M. Shi, and D. B. Resnik. 2019. Misconduct and misbehavior related to authorship disagreements in collaborative science. Sci Eng Ethics. Min Song, Keun Young Kang, Tatsawan Timakum, and Xinyuan Zhang. 2020. Examining influential fac- tors for acknowledgements classification using su- pervised learning. PLOS ONE, 15(2):1–21. H. Stewart, K. Brown, A. M. Dinan, N. Irigoyen, E. J. Snijder, and A. E. Firth. 2018. Transcriptional and translational landscape of equine torovirus. J Virol, 92(17). Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Mer- rill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex D. Wade, Kuansan Wang, Chris Wilhelm, Boya Xie, Douglas Raymond, Daniel S. Weld, Oren Et- zioni, and Sebastian Kohlmeier. 2020. CORD- 19: the covid-19 open research dataset. CoRR, abs/2004.10706. Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cor- nelia Caragea, and C. Lee Giles. 2015. Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Pro- ceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, pages 13:1–13:8, New York, NY, USA. ACM.