GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

1. GROTOAP2 | The methodology of creating a large ground truth dataset of scienti

2. c articles Dominika Tkaczyk, Pawe l Szostek and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 3rd International Workshop on Mining Scienti

3. c Publications 12 September 2014 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 1 / 23

4. Background CERMINE extracts: document's metadata, bibliographic references, structured full text. CERMINE needs a training set for its zone classi

5. ers! PDF BT /F13 10 Tf 250 720 Td (PDF) Tj ET <XML> <title>Syst... <author>M... <author>J.I... <journal>J... <date>2009.. <XML> <ref> <author>M.. <title>Sys... <journal>J... </ref> <ref>... Basic structure extraction Metadata extraction Text extraction <JATS> <front> <meta><title </front> <body> <sec><title> </body> <back> <ref>1. <aut </back> <XML> <body> <sec> <title>1. In <p>The ... ... </body> References extraction D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 2 / 23

6. Requirements A good dataset for document region classi

7. cation should be: large, diverse, preserving document text, and the way text is displayed, with

8. ne-grained labels, open. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 3 / 23

9. GROTOAP GROTOAP dataset: 113 documents 1,031 pages 20,121 zones 20 zone labels 12 publishers created by automatic tools + manual correction of every document = non-scalable 100% accurate D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 4 / 23

10. GROTOAP vs. GROTOAP2 GROTOAP dataset: 113 documents 1,031 pages 20,121 zones 20 zone labels 12 publishers created by automatic tools + manual correction of every document = non-scalable 100% accurate GROTOAP2 dataset: 13,210 documents 119,334 pages 1,640,973 zones 22 zone labels 208 publishers created by automatic tools + manually developed correction rules = scalable 93% accurate D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 5 / 23

11. The content GROTOAP2 is composed of: 13,210 ground-truth

12. les in XML format storing the content of scienti

13. c publications from PubMed Central, a list of URLs to corresponding PDF

14. les, a bash script for downloading PDF

15. les from PMC repository. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 6 / 23

16. The model The document's model in GROTOAP2 contains: geometric hierarchical structure: pages, zones, lines, words and characters, the text content of all the objects, the dimentions and positions, the reading order, zone labels. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 7 / 23

17. Zone labels front: type, title, author, title author, editor, aliation, abstract, keywords, bib info, dates, correspondence, glossary, copyright body: body content,

18. gure, table, equation back: references, acknowledgment, con ict statement other: page number, unknown BIB_INFO BODY_CONTENT REFERENCES AFFILIATION PAGE_NUMBER ABSTRACT TITLE COPYRIGHT ACKNOWLEDGMENT AUTHOR DATES UNKNOWN TABLE TYPE KEYWORDS FIGURE CORRESPONDENCE CONFLICT_STATEMENT EDITOR TITLE_AUTHOR GLOSSARY EQUATION 100 80 60 40 20 0 % of documents D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 8 / 23

19. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

26. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classification Category Value=TITLE/ Type Value=/ /Classification Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

28. TrueViz format Document Page PageID Value=0/ PageNext Value=1/ Zone ZoneID Value=0/ ZoneNext Value=1/ ZoneCorners Vertex x=55.4 y=34.3/ Vertex x=250.5 y=58.3/ /ZoneCorners Classi

29. cation Category Value=TITLE/ Type Value=/ /Classi

30. cation Line LineID Value=0/ LineNext Value=1/ LineCorners Vertex x=55.4 y=34.3/ Vertex x=250.5y=58.3/ /LineCorners Word WordID Value=0/ WordNext Value=1/ WordCorners Vertex x=55.4 y=34.3/ Vertex x=115.3 y=58.3/ /WordCorners Character CharacterID Value=0/ CharacterNext Value=1/ CharacterCorners Vertex x=55.4 y=34.3/ Vertex x=74.1 y=58.3/ /CharacterCorners GT_Text Value=B/ /Character[...] /Word[...] /Line[...] /Zone[...] /Page[...] /Document D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 9 / 23

32. The method NLM PubMed Central CERMINE tools zone text matching rules PDF NLM PDF NLM PDF D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 10 / 23

34. Structure extraction CERMINE tools were used to: extract individual characters and their bounding boxes from PDF

35. les, group individual characters into words, lines and zones, compute the reading order of all the elements. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 11 / 23

37. Zone text matching Labels were assigned to zones: the text content of zones was matched with corresponding NLM

38. les, Smith-Watermann sequence alignment algorithm was used to measure string similarity, the label was chosed by selecting a string with the highest similarity score above a threshold, additional attempt to assign a label to every unknown zone based on the labels of the neighbouring zones was made. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 12 / 23

39. Document

40. ltering 43% of all processed documents have at least 90% of zones labelled. 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 20 40 60 80 100 Percentage of labelled zones 0.00 Fraction of documents in bin D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 13 / 23

41. Distribution similarity Publisher distribution similarity of two datasets A and B can be calculated as: sim(A;B) = X p2P min(dA(p); dB(p)) where P is the set of all publishers in A [ B and dA(p) and dB(p) are the percentage share of a given publisher in sets A and B, respectively. Some examples: sim(f60% X, 40% Yg, f60% X, 40% Yg) = 1.0 sim(f60% X, 40% Yg, f40% X, 60% Yg) = 0.8 sim(entire processes set, selected set) = 0.78 sim(f30% X, 70% Yg, f100% Zg) = 0.0 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 14 / 23

43. Rules a zone containing both title and authors ! title author pages numbers from range 1{n ! page number

44. gures captions !

45. gure tables captions ! table small zones lying in the close neighbourhood of table zones ! table zones that occur on every page or every odd/even page and are placed close to the top or bottom of the page ! bib info D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 15 / 23

47. The evaluation manual evaluation | using a small random sample of documents indirect evaluation | evaluating the performance of CERMINE trained on GROTOAP2 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 17 / 23

48. Manual evaluation without rules with rules prec. recall F-score prec. recall F-score abstract 0.93 0.96 0.94 0.98 0.98 0.98 acknowledgement 0.98 0.67 0.80 1.0 0.90 0.95 aliation 0.77 0.90 0.83 0.95 0.95 0.95 author 0.85 0.95 0.90 1.0 0.98 0.99 bib info 0.95 0.45 0.62 0.96 0.94 0.95 body content 0.65 0.98 0.79 0.88 0.99 0.93 con ict statement 0.63 0.24 0.35 0.82 0.89 0.85 copyright 0.71 0.94 0.81 0.93 0.78 0.85 correspondence 1.0 0.72 0.84 1.0 0.97 0.99 dates 0.28 1.0 0.44 0.94 1.0 0.97 editor - 0 - 1.0 1.0 1.0 equation - - - - - -

49. gure 0.99 0.36 0.53 0.99 0.46 0.63 glossary 1.0 1.0 1.0 1.0 1.0 1.0 keywords 0.94 0.94 0.94 1.0 0.94 0.97 page number 0.99 0.53 0.69 0.98 0.97 0.98 references 0.91 0.95 0.93 0.99 0.95 0.97 table 0.98 0.83 0.90 0.98 0.96 0.97 title 0.51 1.0 0.67 1.0 1.0 1.0 title author - 0 - 1.0 1.0 1.0 type 0.76 0.46 0.57 0.89 0.47 0.62 unknown 0.22 0.46 0.30 0.62 0.94 0.75 average 0.79 0.68 0.73 0.95 0.91 0.92 D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 18 / 23

50. CERMINE-based evaluation precision recall F-score title 93.05% 88.40% 90.67% author 94.38% 90.01% 92.14% aliation 84.20% 78.03% 81.00% abstract 85.24% 83.67% 84.45% keywords 87.98% 65.30% 74.96% journal name 71.88% 63.40% 67.38% volume 96.28% 93.20% 94.72% issue 49.12% 55.67% 52.19% pages 47.41% 45.79% 46.59% year 99.79% 97.80% 98.29% DOI 96.12% 85.34% 90.41% average 82.22% 76.96% 79.34% D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 19 / 23

51. CERMINE-based evaluation GROTOAP GROTOAP2 without with rules rules Precision 77.13% 81.88% 82.22% Recall 55.99% 70.94% 76.96% F-score 62.41% 75.38% 79.34% D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 20 / 23

52. Future work enriching the ground truth

53. les with the names of the fonts, assigning more speci

54. c body labels, eg. section titles, generating a dataset of parsed bibliographic references in a similar way. D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 21 / 23

55. Links GROTOAP2: http://cermine.ceon.pl/grotoap2/ CERMINE web service: http://cermine.ceon.pl D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 22 / 23

56. Thank you Thank you! Questions? Dominika Tkaczyk d.tkaczyk@icm.edu.pl c 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license. The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/ D.Tkaczyk et al. (ICM UW) CERMINE WOSP 12 September 2014 23 / 23

GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Recommended

Recommended

More Related Content

Similar to GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles

Similar to GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles (20)

Recently uploaded

Recently uploaded (20)

GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles