SlideShare a Scribd company logo
1 of 23
Download to read offline
Structured affiliations extraction
from scientific literature
D. Tkaczyk, B. Tarnawski and Ł. Bolikowski
Interdisciplinary Centre for Mathematical and Computational Modelling
University of Warsaw
24 June 2015
Introduction
CERMINE system
TITLE
AUTHORS
AFFILIATIONS
EMAILS
ABSTRACT
KEYWORDS AUTHOR
TITLE
SOURCE
YEAR
PAGES
URL
VOLUME
CERMINE analyses born-digital scientific articles and extracts:
document metadata, eg. title, authors, abstract, keywords, publication date, ...
a list of parsed bibliographic references
full text with sections hierarchy
This presentation
The presentation focuses on the following tasks:
extracting a list of authors from a paper
extracting a list of affiliations from a paper
establishing relations between extracted authors and affiliations
detecting institution, address and country in affiliation strings
Motivation
CERMINE can be used to:
extract high-quality metadata
from large PDF collections,
when it is missing or fragmentary
provide intelligent user interfaces
for metadata acquisition
Requirements
The metadata extraction system should be:
comprehensive,
automatic,
modular,
open and widely available,
easily applicable,
flexible and able to adapt to new layouts,
well tested.
Architecture and Implementation
The workflow
PDF
BT
/F13 10 Tf
250 720 Td
(PDF) Tj
ET
<XML>
<author>
<aff>1</aff>
</author>
<aff id="1">
<inst>Instit...
<addr>Wars..
<country>P...
</aff>
structure
extraction
M.K.1
, J.I.2
, T.W
1 University of
2 Institute of
Institute of ...
Warsaw, 027...
Poland
XML record
generation
affiliation
parsing
splitting and
association
classification
Layout Extraction
1 Character extraction — iText library
2 Page segmentation — Docstrum
3 Reading order resolving — bottom-up
heuristic-based
Content Classification
general classification (labels: metadata, references,
body and other)
metadata classification (labels: abstract, bib_info, type,
title, affiliation, author, keywords, correspondence, dates
and editor)
SVM with 83 and 62 features: geometrical, lexical,
sequential, formatting, heuristics
the best SVM parameters were found automatically by
maximizing mean F-score on a validation dataset
classifiers are trained on 2,551 and 2,716 documents,
respectively
The output so far
TrueViz XML format:
hierarchical structure containing:
pages, zones, lines, words, characters
all elements have bounding boxes
reading order is given
zones have labels
<Page>...
<PageID Value="0"/>
<Zone>...
<ZoneID Value="0"/>
<ZoneCorners>
<Vertex x="55.320" y="34.295"/>
<Vertex x="235.704" y="58.295"/>
</ZoneCorners>
<ZoneNext Value="1"/>
<Category Value="TITLE"/>
<Line>...
<Word>...
<Character>...
Authors and Affiliations Extraction
authors are split based on a list
of separators
affiliations indexes are found using a list
of index symbols and superscript
association is done by detected indexes
affiliations are already assigned
to authors
first line is assumed to be the author
email is found by a regexp
the remaining part is treated
as the affiliation
Affiliation Parsing
Interdisciplinary Centre for Mathematical and
Computational Modelling, University of Warsaw,
ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland
affiliation parsing detects institution, address, country
the implementation is based on a CRF token classifier with features:
the classified word itself,
whether the token is a number, all uppercase word, all lowercase word, a lowercase
word that starts with an uppercase letter,
whether the token is contained by dictionaries of countries or words commonly
appearing in institutions or addresses,
the features of two preceding and two following tokens.
Evaluation
Datasets
GROTOAP2: the evaluation and training
of the zone classifiers
GROTOAP2-affiliations: the evaluation
and training of the affiliation parser
PubMed Central Open Access Subset:
the evaluation of the entire workflow
PDF
<NLM>
PubMed
Central CERMINE
tools
zone text
matching
rules
PDF
<NLM>
PDF
<NLM>
Zone Classification
2,551 documents from
GROTOAP2, containing:
355,779 zones
68,557 metadata zones
5-fold cross-validation
metadata other labels precision recall
metadata 66,372 2,185 96.8 % 97.0 %
other labels 2,052 285,170 - -
affiliation other labels precision recall
affiliation 3,496 185 95.0 % 95.3 %
other labels 173 64,703 - -
Affiliation Parsing
8,267 affiliations
from PubMed Central
5-fold cross-validation
Token classification:
address country institution precision recall
address 44,481 12 1,225 96.8 % 97.3 %
country 50 8,108 8 99.6 % 99.3 %
institution 1,434 18 92,457 98.7 % 98.5 %
Affiliation metadata extraction:
institution recognized in 92.4% of cases
address recognized in 92.2% of cases
country recognized in 99.5% of cases
92.1% of affiliations entirely correctly parsed
Workflow Evaluation
1,943 documents from PMC
evaluated tasks:
extracting author strings
extracting affiliation strings
determining author-affiliation relations
determining author-affiliation relations,
if authors and affiliations extracted
flawlessly
0
20
40
60
80
100
authors affiliations relations
(total)
relations
(perfect input)
F1
System
CERMINE
GROBID
ParsCit
PDFX
Summary
System Features
CERMINE extracts metadata and content from scholarly articles in PDF format
the system is based on a modular workflow
the implementation uses machine learning and heuristics
the default system is trained on large and diverse datasets
the source code is open and available on GitHub
CERMINE is available as a web service and RESTful services
System Usage
Java + Maven
JAR file
RESTful services:
$ curl -X POST –data-binary @article.pdf
–header "Content-Type: application/binary"
http://cermine.ceon.pl/extract.do
$ curl -X POST –data "affiliation=the text
of the affiliation" http://cermine.ceon.pl/parse.do
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0
20
40
0 10 20 30 40
Number of pages
Time[s]
Links
CERMINE web service: http://cermine.ceon.pl
CERMINE source code: https://github.com/CeON/CERMINE
GROTOAP2: http://cermine.ceon.pl/grotoap2/
GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/
Thank you!
linkedin.com/in/bolikowski
twitter.com/bolikowski
lukasz.bolikowski@icm.edu.pl
c 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/

More Related Content

Similar to Structured affiliations extraction from the scientific literature

Query based summarization
Query based summarizationQuery based summarization
Query based summarizationdamom77
 
Query based summarization
Query based summarizationQuery based summarization
Query based summarizationdamom77
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
The Structure of Computer Science Knowledge Network
The Structure of Computer Science Knowledge NetworkThe Structure of Computer Science Knowledge Network
The Structure of Computer Science Knowledge NetworkPham Cuong
 
Tovek Presentation 3 by Livio Costantini
Tovek Presentation 3 by Livio CostantiniTovek Presentation 3 by Livio Costantini
Tovek Presentation 3 by Livio Costantinimaxfalc
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesGESIS
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Stat2 25 09
Stat2 25 09Stat2 25 09
Stat2 25 09stat
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...Pham Cuong
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISijseajournal
 

Similar to Structured affiliations extraction from the scientific literature (20)

Query Based Summarization
Query Based SummarizationQuery Based Summarization
Query Based Summarization
 
Query based summarization
Query based summarizationQuery based summarization
Query based summarization
 
Query based summarization
Query based summarizationQuery based summarization
Query based summarization
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
The Structure of Computer Science Knowledge Network
The Structure of Computer Science Knowledge NetworkThe Structure of Computer Science Knowledge Network
The Structure of Computer Science Knowledge Network
 
Tovek Presentation 3 by Livio Costantini
Tovek Presentation 3 by Livio CostantiniTovek Presentation 3 by Livio Costantini
Tovek Presentation 3 by Livio Costantini
 
114 sem 3_j-walker
114 sem 3_j-walker114 sem 3_j-walker
114 sem 3_j-walker
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Kuliman "Content Profiles & linked documents"
Kuliman "Content Profiles & linked documents"Kuliman "Content Profiles & linked documents"
Kuliman "Content Profiles & linked documents"
 
Lecture5.pptx
Lecture5.pptxLecture5.pptx
Lecture5.pptx
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital Libraries
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE Mumbai
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
Stat2 25 09
Stat2 25 09Stat2 25 09
Stat2 25 09
 
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribeEfficient Filtering Algorithms for Location- Aware Publish/subscribe
Efficient Filtering Algorithms for Location- Aware Publish/subscribe
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
 

Recently uploaded

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 

Recently uploaded (20)

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 

Structured affiliations extraction from the scientific literature

  • 1. Structured affiliations extraction from scientific literature D. Tkaczyk, B. Tarnawski and Ł. Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 24 June 2015
  • 3. CERMINE system TITLE AUTHORS AFFILIATIONS EMAILS ABSTRACT KEYWORDS AUTHOR TITLE SOURCE YEAR PAGES URL VOLUME CERMINE analyses born-digital scientific articles and extracts: document metadata, eg. title, authors, abstract, keywords, publication date, ... a list of parsed bibliographic references full text with sections hierarchy
  • 4. This presentation The presentation focuses on the following tasks: extracting a list of authors from a paper extracting a list of affiliations from a paper establishing relations between extracted authors and affiliations detecting institution, address and country in affiliation strings
  • 5. Motivation CERMINE can be used to: extract high-quality metadata from large PDF collections, when it is missing or fragmentary provide intelligent user interfaces for metadata acquisition
  • 6. Requirements The metadata extraction system should be: comprehensive, automatic, modular, open and widely available, easily applicable, flexible and able to adapt to new layouts, well tested.
  • 8. The workflow PDF BT /F13 10 Tf 250 720 Td (PDF) Tj ET <XML> <author> <aff>1</aff> </author> <aff id="1"> <inst>Instit... <addr>Wars.. <country>P... </aff> structure extraction M.K.1 , J.I.2 , T.W 1 University of 2 Institute of Institute of ... Warsaw, 027... Poland XML record generation affiliation parsing splitting and association classification
  • 9. Layout Extraction 1 Character extraction — iText library 2 Page segmentation — Docstrum 3 Reading order resolving — bottom-up heuristic-based
  • 10. Content Classification general classification (labels: metadata, references, body and other) metadata classification (labels: abstract, bib_info, type, title, affiliation, author, keywords, correspondence, dates and editor) SVM with 83 and 62 features: geometrical, lexical, sequential, formatting, heuristics the best SVM parameters were found automatically by maximizing mean F-score on a validation dataset classifiers are trained on 2,551 and 2,716 documents, respectively
  • 11. The output so far TrueViz XML format: hierarchical structure containing: pages, zones, lines, words, characters all elements have bounding boxes reading order is given zones have labels <Page>... <PageID Value="0"/> <Zone>... <ZoneID Value="0"/> <ZoneCorners> <Vertex x="55.320" y="34.295"/> <Vertex x="235.704" y="58.295"/> </ZoneCorners> <ZoneNext Value="1"/> <Category Value="TITLE"/> <Line>... <Word>... <Character>...
  • 12. Authors and Affiliations Extraction authors are split based on a list of separators affiliations indexes are found using a list of index symbols and superscript association is done by detected indexes affiliations are already assigned to authors first line is assumed to be the author email is found by a regexp the remaining part is treated as the affiliation
  • 13. Affiliation Parsing Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland affiliation parsing detects institution, address, country the implementation is based on a CRF token classifier with features: the classified word itself, whether the token is a number, all uppercase word, all lowercase word, a lowercase word that starts with an uppercase letter, whether the token is contained by dictionaries of countries or words commonly appearing in institutions or addresses, the features of two preceding and two following tokens.
  • 15. Datasets GROTOAP2: the evaluation and training of the zone classifiers GROTOAP2-affiliations: the evaluation and training of the affiliation parser PubMed Central Open Access Subset: the evaluation of the entire workflow PDF <NLM> PubMed Central CERMINE tools zone text matching rules PDF <NLM> PDF <NLM>
  • 16. Zone Classification 2,551 documents from GROTOAP2, containing: 355,779 zones 68,557 metadata zones 5-fold cross-validation metadata other labels precision recall metadata 66,372 2,185 96.8 % 97.0 % other labels 2,052 285,170 - - affiliation other labels precision recall affiliation 3,496 185 95.0 % 95.3 % other labels 173 64,703 - -
  • 17. Affiliation Parsing 8,267 affiliations from PubMed Central 5-fold cross-validation Token classification: address country institution precision recall address 44,481 12 1,225 96.8 % 97.3 % country 50 8,108 8 99.6 % 99.3 % institution 1,434 18 92,457 98.7 % 98.5 % Affiliation metadata extraction: institution recognized in 92.4% of cases address recognized in 92.2% of cases country recognized in 99.5% of cases 92.1% of affiliations entirely correctly parsed
  • 18. Workflow Evaluation 1,943 documents from PMC evaluated tasks: extracting author strings extracting affiliation strings determining author-affiliation relations determining author-affiliation relations, if authors and affiliations extracted flawlessly 0 20 40 60 80 100 authors affiliations relations (total) relations (perfect input) F1 System CERMINE GROBID ParsCit PDFX
  • 20. System Features CERMINE extracts metadata and content from scholarly articles in PDF format the system is based on a modular workflow the implementation uses machine learning and heuristics the default system is trained on large and diverse datasets the source code is open and available on GitHub CERMINE is available as a web service and RESTful services
  • 21. System Usage Java + Maven JAR file RESTful services: $ curl -X POST –data-binary @article.pdf –header "Content-Type: application/binary" http://cermine.ceon.pl/extract.do $ curl -X POST –data "affiliation=the text of the affiliation" http://cermine.ceon.pl/parse.do q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 0 10 20 30 40 Number of pages Time[s]
  • 22. Links CERMINE web service: http://cermine.ceon.pl CERMINE source code: https://github.com/CeON/CERMINE GROTOAP2: http://cermine.ceon.pl/grotoap2/ GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/
  • 23. Thank you! linkedin.com/in/bolikowski twitter.com/bolikowski lukasz.bolikowski@icm.edu.pl c 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/