Structured affiliations extraction from the scientific literature

Structured aﬃliations extraction
from scientiﬁc literature
D. Tkaczyk, B. Tarnawski and Ł. Bolikowski
Interdisciplinary Centre for Mathematical and Computational Modelling
University of Warsaw
24 June 2015

CERMINE system
TITLE
AUTHORS
AFFILIATIONS
EMAILS
ABSTRACT
KEYWORDS AUTHOR
TITLE
SOURCE
YEAR
PAGES
URL
VOLUME
CERMINE analyses born-digital scientiﬁc articles and extracts:
document metadata, eg. title, authors, abstract, keywords, publication date, ...
a list of parsed bibliographic references
full text with sections hierarchy

This presentation
The presentation focuses on the following tasks:
extracting a list of authors from a paper
extracting a list of affiliations from a paper
establishing relations between extracted authors and affiliations
detecting institution, address and country in affiliation strings

Motivation
CERMINE can be used to:
extract high-quality metadata
from large PDF collections,
when it is missing or fragmentary
provide intelligent user interfaces
for metadata acquisition

Requirements
The metadata extraction system should be:
comprehensive,
automatic,
modular,
open and widely available,
easily applicable,
ﬂexible and able to adapt to new layouts,
well tested.

Architecture and Implementation

The workflow
PDF
BT
/F13 10 Tf
250 720 Td
(PDF) Tj
ET
<XML>
<author>
<aff>1</aff>
</author>
<aff id="1">
<inst>Instit...
<addr>Wars..
<country>P...
</aff>
structure
extraction
M.K.1
, J.I.2
, T.W
1 University of
2 Institute of
Institute of ...
Warsaw, 027...
Poland
XML record
generation
affiliation
parsing
splitting and
association
classification

Layout Extraction
1 Character extraction — iText library
2 Page segmentation — Docstrum
3 Reading order resolving — bottom-up
heuristic-based

Content Classification
general classification (labels: metadata, references,
body and other)
metadata classification (labels: abstract, bib_info, type,
title, affiliation, author, keywords, correspondence, dates
and editor)
SVM with 83 and 62 features: geometrical, lexical,
sequential, formatting, heuristics
the best SVM parameters were found automatically by
maximizing mean F-score on a validation dataset
classifiers are trained on 2,551 and 2,716 documents,
respectively

The output so far
TrueViz XML format:
hierarchical structure containing:
pages, zones, lines, words, characters
all elements have bounding boxes
reading order is given
zones have labels
<Page>...
<PageID Value="0"/>
<Zone>...
<ZoneID Value="0"/>
<ZoneCorners>
<Vertex x="55.320" y="34.295"/>
<Vertex x="235.704" y="58.295"/>
</ZoneCorners>
<ZoneNext Value="1"/>
<Category Value="TITLE"/>
<Line>...
<Word>...
<Character>...

Authors and Affiliations Extraction
authors are split based on a list
of separators
affiliations indexes are found using a list
of index symbols and superscript
association is done by detected indexes
affiliations are already assigned
to authors
first line is assumed to be the author
email is found by a regexp
the remaining part is treated
as the affiliation

Affiliation Parsing
Interdisciplinary Centre for Mathematical and
Computational Modelling, University of Warsaw,
ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland
affiliation parsing detects institution, address, country
the implementation is based on a CRF token classifier with features:
the classified word itself,
whether the token is a number, all uppercase word, all lowercase word, a lowercase
word that starts with an uppercase letter,
whether the token is contained by dictionaries of countries or words commonly
appearing in institutions or addresses,
the features of two preceding and two following tokens.

Datasets
GROTOAP2: the evaluation and training
of the zone classifiers
GROTOAP2-affiliations: the evaluation
and training of the affiliation parser
PubMed Central Open Access Subset:
the evaluation of the entire workflow
PDF
<NLM>
PubMed
Central CERMINE
tools
zone text
matching
rules
PDF
<NLM>
PDF
<NLM>

Zone Classification
2,551 documents from
GROTOAP2, containing:
355,779 zones
68,557 metadata zones
5-fold cross-validation
metadata other labels precision recall
metadata 66,372 2,185 96.8 % 97.0 %
other labels 2,052 285,170 - -
affiliation other labels precision recall
affiliation 3,496 185 95.0 % 95.3 %
other labels 173 64,703 - -

Affiliation Parsing
8,267 affiliations
from PubMed Central
5-fold cross-validation
Token classification:
address country institution precision recall
address 44,481 12 1,225 96.8 % 97.3 %
country 50 8,108 8 99.6 % 99.3 %
institution 1,434 18 92,457 98.7 % 98.5 %
Affiliation metadata extraction:
institution recognized in 92.4% of cases
address recognized in 92.2% of cases
country recognized in 99.5% of cases
92.1% of affiliations entirely correctly parsed

Workflow Evaluation
1,943 documents from PMC
evaluated tasks:
extracting author strings
extracting affiliation strings
determining author-affiliation relations
determining author-affiliation relations,
if authors and affiliations extracted
flawlessly
0
20
40
60
80
100
authors affiliations relations
(total)
relations
(perfect input)
F1
System
CERMINE
GROBID
ParsCit
PDFX

System Features
CERMINE extracts metadata and content from scholarly articles in PDF format
the system is based on a modular workﬂow
the implementation uses machine learning and heuristics
the default system is trained on large and diverse datasets
the source code is open and available on GitHub
CERMINE is available as a web service and RESTful services

System Usage
Java + Maven
JAR ﬁle
RESTful services:
$ curl -X POST –data-binary @article.pdf
–header "Content-Type: application/binary"
http://cermine.ceon.pl/extract.do
$ curl -X POST –data "affiliation=the text
of the affiliation" http://cermine.ceon.pl/parse.do
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0
20
40
0 10 20 30 40
Number of pages
Time[s]

Links
CERMINE web service: http://cermine.ceon.pl
CERMINE source code: https://github.com/CeON/CERMINE
GROTOAP2: http://cermine.ceon.pl/grotoap2/
GROTOAP2-aﬃliations: http://cermine.ceon.pl/grotoap2/affiliations/

Thank you!
linkedin.com/in/bolikowski
twitter.com/bolikowski
lukasz.bolikowski@icm.edu.pl
c 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/

Structured affiliations extraction from the scientific literature

Recommended

Recommended

More Related Content

Similar to Structured affiliations extraction from the scientific literature

Similar to Structured affiliations extraction from the scientific literature (20)

Recently uploaded

Recently uploaded (20)

Structured affiliations extraction from the scientific literature