9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles
1. GROTOAP2 | The methodology of creating
a large ground truth dataset of scienti
2. c articles
Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski
Interdisciplinary Centre for Mathematical and Computational Modelling
University of Warsaw
3rd International Workshop on Mining Scienti
3. c Publications
12 September 2014
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23
4. Background
CERMINE extracts:
document's
metadata,
bibliographic
references,
structured full
text.
CERMINE needs
a training set for
its zone classi
9. GROTOAP
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools
+ manual correction of every
document = non-scalable
100% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23
10. GROTOAP vs. GROTOAP2
GROTOAP dataset:
113 documents
1,031 pages
20,121 zones
20 zone labels
12 publishers
created by automatic tools
+ manual correction of every
document = non-scalable
100% accurate
GROTOAP2 dataset:
13,210 documents
119,334 pages
1,640,973 zones
22 zone labels
208 publishers
created by automatic tools
+ manually developed
correction rules = scalable
93% accurate
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23
15. les from PMC repository.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23
16. The model
The document's model in GROTOAP2
contains:
geometric hierarchical structure:
pages, zones, lines, words and
characters,
the text content of all the objects,
the dimentions and positions,
the reading order,
zone labels.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23
17. Zone labels
front: type, title, author,
title author, editor,
aliation, abstract,
keywords, bib info, dates,
correspondence, glossary,
copyright
body: body content,
18. gure,
table, equation
back: references,
acknowledgment,
con
ict statement
other: page number,
unknown
BIB_INFO
BODY_CONTENT
REFERENCES
AFFILIATION
PAGE_NUMBER
ABSTRACT
TITLE
COPYRIGHT
ACKNOWLEDGMENT
AUTHOR
DATES
UNKNOWN
TABLE
TYPE
KEYWORDS
FIGURE
CORRESPONDENCE
CONFLICT_STATEMENT
EDITOR
TITLE_AUTHOR
GLOSSARY
EQUATION
100
80
60
40
20
0
% of documents
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23
32. The method
NLM
PubMed
Central
CERMINE
tools
zone text
matching
rules
PDF
NLM
PDF
NLM
PDF
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
33. The method
NLM
PubMed
Central
CERMINE
tools
zone text
matching
rules
PDF
NLM
PDF
NLM
PDF
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23
34. Structure extraction
CERMINE tools were used to:
extract individual characters and their bounding boxes from
PDF
35. les,
group individual characters into words, lines and zones,
compute the reading order of all the elements.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
36. The method
NLM
PubMed
Central
CERMINE
tools
zone text
matching
rules
PDF
NLM
PDF
NLM
PDF
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23
37. Zone text matching
Labels were assigned to zones:
the text content of zones was matched with corresponding
NLM
38. les,
Smith-Watermann sequence alignment algorithm was used
to measure string similarity,
the label was chosed by selecting a string with the highest
similarity score above a threshold,
additional attempt to assign a label to every unknown
zone based on the labels of the neighbouring zones was
made.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23
40. ltering
43% of all processed
documents have at
least 90% of zones
labelled.
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0 20 40 60 80 100
Percentage of labelled zones
0.00
Fraction of documents in bin
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23
41. Distribution similarity
Publisher distribution similarity of two datasets A and B can be calculated as:
sim(A;B) =
X
p2P
min(dA(p); dB(p))
where P is the set of all publishers in A [ B and dA(p) and dB(p) are the
percentage share of a given publisher in sets A and B, respectively.
Some examples:
sim(f60% X, 40% Yg, f60% X, 40% Yg) = 1.0
sim(f60% X, 40% Yg, f40% X, 60% Yg) = 0.8
sim(entire processes set, selected set) = 0.78
sim(f30% X, 70% Yg, f100% Zg) = 0.0
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
42. The method
NLM
PubMed
Central
CERMINE
tools
zone text
matching
rules
PDF
NLM
PDF
NLM
PDF
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23
43. Rules
a zone containing both title and authors ! title author
pages numbers from range 1{n ! page number
45. gure
tables captions ! table
small zones lying in the close neighbourhood of table zones ! table
zones that occur on every page or every odd/even page and are
placed close to the top or bottom of the page ! bib info
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23
46. The method
NLM
PubMed
Central
CERMINE
tools
zone text
matching
rules
PDF
NLM
PDF
NLM
PDF
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 16 / 23
47. The evaluation
manual evaluation | using a small random sample of documents
indirect evaluation | evaluating the performance of CERMINE
trained on GROTOAP2
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23
53. les with the names of the fonts,
assigning more speci
54. c body labels, eg. section titles,
generating a dataset of parsed bibliographic references
in a similar way.
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23
56. Thank you
Thank you!
Questions?
Dominika Tkaczyk
d.tkaczyk@icm.edu.pl
c 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license.
The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/
D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23