SlideShare a Scribd company logo
1 of 36
Download to read offline
GROTOAP2 | The methodology of creating 
a large ground truth dataset of scienti
c articles 
Dominika Tkaczyk, Pawe l Szostek and  Lukasz Bolikowski 
Interdisciplinary Centre for Mathematical and Computational Modelling 
University of Warsaw 
3rd International Workshop on Mining Scienti
c Publications 
12 September 2014 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23
Background 
CERMINE extracts: 
document's 
metadata, 
bibliographic 
references, 
structured full 
text. 
CERMINE needs 
a training set for 
its zone classi
ers! 
PDF 
BT 
/F13 10 Tf 
250 720 Td 
(PDF) Tj 
ET 
<XML> 
<title>Syst... 
<author>M... 
<author>J.I... 
<journal>J... 
<date>2009.. 
<XML> 
<ref> 
<author>M.. 
<title>Sys... 
<journal>J... 
</ref> 
<ref>... 
Basic 
structure 
extraction 
Metadata 
extraction 
Text 
extraction 
<JATS> 
<front> 
<meta><title 
</front> 
<body> 
<sec><title> 
</body> 
<back> 
<ref>1. <aut 
</back> 
<XML> 
<body> 
<sec> 
<title>1. In 
<p>The ... 
... 
</body> 
References 
extraction 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23
Requirements 
A good dataset for document 
region classi
cation should be: 
large, 
diverse, 
preserving document text, 
and the way text is displayed, 
with
ne-grained labels, 
open. 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23
GROTOAP 
GROTOAP dataset: 
113 documents 
1,031 pages 
20,121 zones 
20 zone labels 
12 publishers 
created by automatic tools 
+ manual correction of every 
document = non-scalable 
100% accurate 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23
GROTOAP vs. GROTOAP2 
GROTOAP dataset: 
113 documents 
1,031 pages 
20,121 zones 
20 zone labels 
12 publishers 
created by automatic tools 
+ manual correction of every 
document = non-scalable 
100% accurate 
GROTOAP2 dataset: 
13,210 documents 
119,334 pages 
1,640,973 zones 
22 zone labels 
208 publishers 
created by automatic tools 
+ manually developed 
correction rules = scalable 
93% accurate 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23
The content 
GROTOAP2 is composed of: 
13,210 ground-truth
les in XML format storing the 
content of scienti
c publications from PubMed Central, 
a list of URLs to corresponding PDF
les, 
a bash script for downloading PDF
les from PMC repository. 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23
The model 
The document's model in GROTOAP2 
contains: 
geometric hierarchical structure: 
pages, zones, lines, words and 
characters, 
the text content of all the objects, 
the dimentions and positions, 
the reading order, 
zone labels. 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23
Zone labels 
front: type, title, author, 
title author, editor, 
aliation, abstract, 
keywords, bib info, dates, 
correspondence, glossary, 
copyright 
body: body content,
gure, 
table, equation 
back: references, 
acknowledgment, 
con
ict statement 
other: page number, 
unknown 
BIB_INFO 
BODY_CONTENT 
REFERENCES 
AFFILIATION 
PAGE_NUMBER 
ABSTRACT 
TITLE 
COPYRIGHT 
ACKNOWLEDGMENT 
AUTHOR 
DATES 
UNKNOWN 
TABLE 
TYPE 
KEYWORDS 
FIGURE 
CORRESPONDENCE 
CONFLICT_STATEMENT 
EDITOR 
TITLE_AUTHOR 
GLOSSARY 
EQUATION 
100 
80 
60 
40 
20 
0 
% of documents 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classi
cation 
Category Value=TITLE/ 
Type Value=/ 
/Classi
cation 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
TrueViz format 
Document 
Page 
PageID Value=0/ 
PageNext Value=1/ 
Zone 
ZoneID Value=0/ 
ZoneNext Value=1/ 
ZoneCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5 y=58.3/ 
/ZoneCorners 
Classification 
Category Value=TITLE/ 
Type Value=/ 
/Classification 
Line 
LineID Value=0/ 
LineNext Value=1/ 
LineCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=250.5y=58.3/ 
/LineCorners 
Word 
WordID Value=0/ 
WordNext Value=1/ 
WordCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=115.3 y=58.3/ 
/WordCorners 
Character 
CharacterID Value=0/ 
CharacterNext Value=1/ 
CharacterCorners 
Vertex x=55.4 y=34.3/ 
Vertex x=74.1 y=58.3/ 
/CharacterCorners 
GT_Text Value=B/ 
/Character[...] 
/Word[...] 
/Line[...] 
/Zone[...] 
/Page[...] 
/Document 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
The method 
NLM 
PubMed 
Central 
CERMINE 
tools 
zone text 
matching 
rules 
PDF 
NLM 
PDF 
NLM 
PDF 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
The method 
NLM 
PubMed 
Central 
CERMINE 
tools 
zone text 
matching 
rules 
PDF 
NLM 
PDF 
NLM 
PDF 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
Structure extraction 
CERMINE tools were used to: 
extract individual characters and their bounding boxes from 
PDF
les, 
group individual characters into words, lines and zones, 
compute the reading order of all the elements. 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
The method 
NLM 
PubMed 
Central 
CERMINE 
tools 
zone text 
matching 
rules 
PDF 
NLM 
PDF 
NLM 
PDF 
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23

More Related Content

Similar to GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Breaking down data silos with the open data protocol
Breaking down data silos with the open data protocolBreaking down data silos with the open data protocol
Breaking down data silos with the open data protocol
Woodruff Solutions LLC
 
Web administration lab file
Web administration lab fileWeb administration lab file
Web administration lab file
Ankit Dixit
 
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
OWASP Russia
 

Similar to GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles (20)

Breaking down data silos with the open data protocol
Breaking down data silos with the open data protocolBreaking down data silos with the open data protocol
Breaking down data silos with the open data protocol
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
Xaml programming
Xaml programmingXaml programming
Xaml programming
 
SAX PARSER
SAX PARSER SAX PARSER
SAX PARSER
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)
 
MPhil CS IT -Thesis guidelines.docx
MPhil CS IT -Thesis guidelines.docxMPhil CS IT -Thesis guidelines.docx
MPhil CS IT -Thesis guidelines.docx
 
ICDM2019 table tutorial
ICDM2019 table tutorialICDM2019 table tutorial
ICDM2019 table tutorial
 
Web administration lab file
Web administration lab fileWeb administration lab file
Web administration lab file
 
Idoc script beginner guide
Idoc script beginner guide Idoc script beginner guide
Idoc script beginner guide
 
Xsd Basics R&D with ORACLE SOA
Xsd Basics R&D with ORACLE SOAXsd Basics R&D with ORACLE SOA
Xsd Basics R&D with ORACLE SOA
 
Xsd basics
Xsd basicsXsd basics
Xsd basics
 
CTDA Metadata Application Profile
CTDA Metadata Application ProfileCTDA Metadata Application Profile
CTDA Metadata Application Profile
 
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
[3.3] Detection & exploitation of Xpath/Xquery Injections - Boris Savkov
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
Building Reactive Microservices with Vert.x
Building Reactive Microservices with Vert.xBuilding Reactive Microservices with Vert.x
Building Reactive Microservices with Vert.x
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
 
Vaadin Components @ Angular U
Vaadin Components @ Angular UVaadin Components @ Angular U
Vaadin Components @ Angular U
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Recently uploaded (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

  • 1. GROTOAP2 | The methodology of creating a large ground truth dataset of scienti
  • 2. c articles Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 3rd International Workshop on Mining Scienti
  • 3. c Publications 12 September 2014 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23
  • 4. Background CERMINE extracts: document's metadata, bibliographic references, structured full text. CERMINE needs a training set for its zone classi
  • 5. ers! PDF BT /F13 10 Tf 250 720 Td (PDF) Tj ET <XML> <title>Syst... <author>M... <author>J.I... <journal>J... <date>2009.. <XML> <ref> <author>M.. <title>Sys... <journal>J... </ref> <ref>... Basic structure extraction Metadata extraction Text extraction <JATS> <front> <meta><title </front> <body> <sec><title> </body> <back> <ref>1. <aut </back> <XML> <body> <sec> <title>1. In <p>The ... ... </body> References extraction D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23
  • 6. Requirements A good dataset for document region classi
  • 7. cation should be: large, diverse, preserving document text, and the way text is displayed, with
  • 8. ne-grained labels, open. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23
  • 9. GROTOAP GROTOAP dataset: 113 documents 1,031 pages 20,121 zones 20 zone labels 12 publishers created by automatic tools + manual correction of every document = non-scalable 100% accurate D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23
  • 10. GROTOAP vs. GROTOAP2 GROTOAP dataset: 113 documents 1,031 pages 20,121 zones 20 zone labels 12 publishers created by automatic tools + manual correction of every document = non-scalable 100% accurate GROTOAP2 dataset: 13,210 documents 119,334 pages 1,640,973 zones 22 zone labels 208 publishers created by automatic tools + manually developed correction rules = scalable 93% accurate D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23
  • 11. The content GROTOAP2 is composed of: 13,210 ground-truth
  • 12. les in XML format storing the content of scienti
  • 13. c publications from PubMed Central, a list of URLs to corresponding PDF
  • 14. les, a bash script for downloading PDF
  • 15. les from PMC repository. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23
  • 16. The model The document's model in GROTOAP2 contains: geometric hierarchical structure: pages, zones, lines, words and characters, the text content of all the objects, the dimentions and positions, the reading order, zone labels. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23
  • 17. Zone labels front: type, title, author, title author, editor, aliation, abstract, keywords, bib info, dates, correspondence, glossary, copyright body: body content,
  • 18. gure, table, equation back: references, acknowledgment, con ict statement other: page number, unknown BIB_INFO BODY_CONTENT REFERENCES AFFILIATION PAGE_NUMBER ABSTRACT TITLE COPYRIGHT ACKNOWLEDGMENT AUTHOR DATES UNKNOWN TABLE TYPE KEYWORDS FIGURE CORRESPONDENCE CONFLICT_STATEMENT EDITOR TITLE_AUTHOR GLOSSARY EQUATION 100 80 60 40 20 0 % of documents D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23
  • 19. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 20. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 21. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 22. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 23. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 24. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 25. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 26. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 27. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 28. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classi
  • 29. cation Category Value=TITLE/ Type Value=/ /Classi
  • 30. cation Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 31. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23
  • 32. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
  • 33. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
  • 34. Structure extraction CERMINE tools were used to: extract individual characters and their bounding boxes from PDF
  • 35. les, group individual characters into words, lines and zones, compute the reading order of all the elements. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
  • 36. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
  • 37. Zone text matching Labels were assigned to zones: the text content of zones was matched with corresponding NLM
  • 38. les, Smith-Watermann sequence alignment algorithm was used to measure string similarity, the label was chosed by selecting a string with the highest similarity score above a threshold, additional attempt to assign a label to every unknown zone based on the labels of the neighbouring zones was made. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23
  • 40. ltering 43% of all processed documents have at least 90% of zones labelled. 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 20 40 60 80 100 Percentage of labelled zones 0.00 Fraction of documents in bin D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23
  • 41. Distribution similarity Publisher distribution similarity of two datasets A and B can be calculated as: sim(A;B) = X p2P min(dA(p); dB(p)) where P is the set of all publishers in A [ B and dA(p) and dB(p) are the percentage share of a given publisher in sets A and B, respectively. Some examples: sim(f60% X, 40% Yg, f60% X, 40% Yg) = 1.0 sim(f60% X, 40% Yg, f40% X, 60% Yg) = 0.8 sim(entire processes set, selected set) = 0.78 sim(f30% X, 70% Yg, f100% Zg) = 0.0 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
  • 42. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
  • 43. Rules a zone containing both title and authors ! title author pages numbers from range 1{n ! page number
  • 45. gure tables captions ! table small zones lying in the close neighbourhood of table zones ! table zones that occur on every page or every odd/even page and are placed close to the top or bottom of the page ! bib info D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23
  • 46. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 16 / 23
  • 47. The evaluation manual evaluation | using a small random sample of documents indirect evaluation | evaluating the performance of CERMINE trained on GROTOAP2 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23
  • 48. Manual evaluation without rules with rules prec. recall F-score prec. recall F-score abstract 0.93 0.96 0.94 0.98 0.98 0.98 acknowledgement 0.98 0.67 0.80 1.0 0.90 0.95 aliation 0.77 0.90 0.83 0.95 0.95 0.95 author 0.85 0.95 0.90 1.0 0.98 0.99 bib info 0.95 0.45 0.62 0.96 0.94 0.95 body content 0.65 0.98 0.79 0.88 0.99 0.93 con ict statement 0.63 0.24 0.35 0.82 0.89 0.85 copyright 0.71 0.94 0.81 0.93 0.78 0.85 correspondence 1.0 0.72 0.84 1.0 0.97 0.99 dates 0.28 1.0 0.44 0.94 1.0 0.97 editor - 0 - 1.0 1.0 1.0 equation - - - - - -
  • 49. gure 0.99 0.36 0.53 0.99 0.46 0.63 glossary 1.0 1.0 1.0 1.0 1.0 1.0 keywords 0.94 0.94 0.94 1.0 0.94 0.97 page number 0.99 0.53 0.69 0.98 0.97 0.98 references 0.91 0.95 0.93 0.99 0.95 0.97 table 0.98 0.83 0.90 0.98 0.96 0.97 title 0.51 1.0 0.67 1.0 1.0 1.0 title author - 0 - 1.0 1.0 1.0 type 0.76 0.46 0.57 0.89 0.47 0.62 unknown 0.22 0.46 0.30 0.62 0.94 0.75 average 0.79 0.68 0.73 0.95 0.91 0.92 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 18 / 23
  • 50. CERMINE-based evaluation precision recall F-score title 93.05% 88.40% 90.67% author 94.38% 90.01% 92.14% aliation 84.20% 78.03% 81.00% abstract 85.24% 83.67% 84.45% keywords 87.98% 65.30% 74.96% journal name 71.88% 63.40% 67.38% volume 96.28% 93.20% 94.72% issue 49.12% 55.67% 52.19% pages 47.41% 45.79% 46.59% year 99.79% 97.80% 98.29% DOI 96.12% 85.34% 90.41% average 82.22% 76.96% 79.34% D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 19 / 23
  • 51. CERMINE-based evaluation GROTOAP GROTOAP2 without with rules rules Precision 77.13% 81.88% 82.22% Recall 55.99% 70.94% 76.96% F-score 62.41% 75.38% 79.34% D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 20 / 23
  • 52. Future work enriching the ground truth
  • 53. les with the names of the fonts, assigning more speci
  • 54. c body labels, eg. section titles, generating a dataset of parsed bibliographic references in a similar way. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23
  • 55. Links GROTOAP2: http://cermine.ceon.pl/grotoap2/ CERMINE web service: http://cermine.ceon.pl D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 22 / 23
  • 56. Thank you Thank you! Questions? Dominika Tkaczyk d.tkaczyk@icm.edu.pl c 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license. The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/ D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23