Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Structured affiliations extraction from the scientific literature
1. Structured affiliations extraction
from scientific literature
D. Tkaczyk, B. Tarnawski and Ł. Bolikowski
Interdisciplinary Centre for Mathematical and Computational Modelling
University of Warsaw
24 June 2015
4. This presentation
The presentation focuses on the following tasks:
extracting a list of authors from a paper
extracting a list of affiliations from a paper
establishing relations between extracted authors and affiliations
detecting institution, address and country in affiliation strings
5. Motivation
CERMINE can be used to:
extract high-quality metadata
from large PDF collections,
when it is missing or fragmentary
provide intelligent user interfaces
for metadata acquisition
6. Requirements
The metadata extraction system should be:
comprehensive,
automatic,
modular,
open and widely available,
easily applicable,
flexible and able to adapt to new layouts,
well tested.
8. The workflow
PDF
BT
/F13 10 Tf
250 720 Td
(PDF) Tj
ET
<XML>
<author>
<aff>1</aff>
</author>
<aff id="1">
<inst>Instit...
<addr>Wars..
<country>P...
</aff>
structure
extraction
M.K.1
, J.I.2
, T.W
1 University of
2 Institute of
Institute of ...
Warsaw, 027...
Poland
XML record
generation
affiliation
parsing
splitting and
association
classification
10. Content Classification
general classification (labels: metadata, references,
body and other)
metadata classification (labels: abstract, bib_info, type,
title, affiliation, author, keywords, correspondence, dates
and editor)
SVM with 83 and 62 features: geometrical, lexical,
sequential, formatting, heuristics
the best SVM parameters were found automatically by
maximizing mean F-score on a validation dataset
classifiers are trained on 2,551 and 2,716 documents,
respectively
11. The output so far
TrueViz XML format:
hierarchical structure containing:
pages, zones, lines, words, characters
all elements have bounding boxes
reading order is given
zones have labels
<Page>...
<PageID Value="0"/>
<Zone>...
<ZoneID Value="0"/>
<ZoneCorners>
<Vertex x="55.320" y="34.295"/>
<Vertex x="235.704" y="58.295"/>
</ZoneCorners>
<ZoneNext Value="1"/>
<Category Value="TITLE"/>
<Line>...
<Word>...
<Character>...
12. Authors and Affiliations Extraction
authors are split based on a list
of separators
affiliations indexes are found using a list
of index symbols and superscript
association is done by detected indexes
affiliations are already assigned
to authors
first line is assumed to be the author
email is found by a regexp
the remaining part is treated
as the affiliation
13. Affiliation Parsing
Interdisciplinary Centre for Mathematical and
Computational Modelling, University of Warsaw,
ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland
affiliation parsing detects institution, address, country
the implementation is based on a CRF token classifier with features:
the classified word itself,
whether the token is a number, all uppercase word, all lowercase word, a lowercase
word that starts with an uppercase letter,
whether the token is contained by dictionaries of countries or words commonly
appearing in institutions or addresses,
the features of two preceding and two following tokens.
15. Datasets
GROTOAP2: the evaluation and training
of the zone classifiers
GROTOAP2-affiliations: the evaluation
and training of the affiliation parser
PubMed Central Open Access Subset:
the evaluation of the entire workflow
PDF
<NLM>
PubMed
Central CERMINE
tools
zone text
matching
rules
PDF
<NLM>
PDF
<NLM>
16. Zone Classification
2,551 documents from
GROTOAP2, containing:
355,779 zones
68,557 metadata zones
5-fold cross-validation
metadata other labels precision recall
metadata 66,372 2,185 96.8 % 97.0 %
other labels 2,052 285,170 - -
affiliation other labels precision recall
affiliation 3,496 185 95.0 % 95.3 %
other labels 173 64,703 - -
17. Affiliation Parsing
8,267 affiliations
from PubMed Central
5-fold cross-validation
Token classification:
address country institution precision recall
address 44,481 12 1,225 96.8 % 97.3 %
country 50 8,108 8 99.6 % 99.3 %
institution 1,434 18 92,457 98.7 % 98.5 %
Affiliation metadata extraction:
institution recognized in 92.4% of cases
address recognized in 92.2% of cases
country recognized in 99.5% of cases
92.1% of affiliations entirely correctly parsed
18. Workflow Evaluation
1,943 documents from PMC
evaluated tasks:
extracting author strings
extracting affiliation strings
determining author-affiliation relations
determining author-affiliation relations,
if authors and affiliations extracted
flawlessly
0
20
40
60
80
100
authors affiliations relations
(total)
relations
(perfect input)
F1
System
CERMINE
GROBID
ParsCit
PDFX
20. System Features
CERMINE extracts metadata and content from scholarly articles in PDF format
the system is based on a modular workflow
the implementation uses machine learning and heuristics
the default system is trained on large and diverse datasets
the source code is open and available on GitHub
CERMINE is available as a web service and RESTful services