Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Generic Approach for Reference Extraction from PDF Documents

78 views

Published on

My invited talk in the Workshop on Open Citation, taking place in Bologna, Italy

  • Be the first to comment

A Generic Approach for Reference Extraction from PDF Documents

  1. 1. WP1 Statusboukhers@uni-koblenz.de A Generic Approach for Reference Extraction from PDF Documents Zeyd Boukhers Bologna, September 04, 2018
  2. 2. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser 2
  3. 3. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser Reference String Extraction Reference String Segmentation 2
  4. 4. WP1 Statusboukhers@uni-koblenz.deboukhers@uni-koblenz.de In 2015: • About 2,5 million scholarly articles published worldwide in 2015. • The publications in Elsevier from 2009 to 2014 were cited 11.5 million times in the same period. Introduction: Motivation Source: https://www.elsevier.com/connect/elsevier-publishing-a-look-at-the-numbers-and-more 3
  5. 5. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  6. 6. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  7. 7. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  8. 8. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 5
  9. 9. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  10. 10. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  11. 11. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  12. 12. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Extrinsic Differences 7
  13. 13. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 8
  14. 14. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 9
  15. 15. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 9
  16. 16. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 9
  17. 17. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  18. 18. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  19. 19. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  20. 20. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections P14 P40 P101 11
  21. 21. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 12
  22. 22. WP1 Statusboukhers@uni-koblenz.de Problem 3/5: Different Representations 13
  23. 23. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 14
  24. 24. WP1 Statusboukhers@uni-koblenz.de Problem 4/5: Other Types of References 15
  25. 25. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 16
  26. 26. WP1 Statusboukhers@uni-koblenz.de Problem 5/5: Different Languages 17
  27. 27. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems 18
  28. 28. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Optimized pipeline • Generic features • More correlation among the pipeline phases • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems Objectives 18
  29. 29. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction Each line is classified into either: • 0: non reference. • 1: first line reference. • 2: intermediate line reference. • 3: last line reference. Lines are combined and segmented simultaneously until forming a consistent reference. 19
  30. 30. WP1 Statusboukhers@uni-koblenz.de Line Classification • Number of tokens. • Number of digits. • Amount of poncutations. …. • Whether it starts with capital letter. • Whether it contains a year format. …. • Whether it contains a city (from a large data-table). • Whether it contains an author name (from a large data-table). 20
  31. 31. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  32. 32. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  33. 33. WP1 Statusboukhers@uni-koblenz.de Classification: Training •The features extracted from the training dataset are used to train a Random Forest model. 22
  34. 34. WP1 Statusboukhers@uni-koblenz.de Classification: Testing •The model is employed to classify each line into: –Non-ref line (0), First-ref line (1), –Intermediate-ref line (2) and Last-ref line. 23
  35. 35. WP1 Statusboukhers@uni-koblenz.de Classification: Filtering •The irrelevant lines are discarded with a filtering process. 24
  36. 36. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation • Number of characters. • Ratio of capital letters. • Whether it contains digits. • Followed by a comma. • Between parentheses. • Whether is a city. • Whether is a stop word. • etc. [*] For more details about the considered features: https://github.com/exciteproject/Exparser 25
  37. 37. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability 26
  38. 38. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Segmentation Probability Completeness Probability Line-Extraction Probability 26
  39. 39. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 𝑎 old 𝑎 new Randomly add neighbour line (up or down) and compute 𝑎 26
  40. 40. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  41. 41. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 old 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 26
  42. 42. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  43. 43. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  44. 44. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  45. 45. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  46. 46. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  47. 47. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  48. 48. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  49. 49. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  50. 50. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  51. 51. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) 27
  52. 52. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The product of the components’ probabilities of the initial line given the previous line combination. 27
  53. 53. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. 27
  54. 54. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The probability that the new reference string is determined with borderlines. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. The probability that the previous reference string is determined with borderlines 27
  55. 55. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  56. 56. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  57. 57. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  58. 58. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  59. 59. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  60. 60. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  61. 61. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  62. 62. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  63. 63. WP1 Statusboukhers@uni-koblenz.de Results: Reference Line Extraction Metric CER-D CER-T Pars-D Pars-M GRO-D GRO-T RefExt-T Proposed Precision 0.296 0.303 0.558 0.617 0.627 0.847 0.879 0,874 Recall 0.233 0.220 0.552 0.595 0.718 0.839 0.906 0,973 F1-Score 0.245 0.235 0.542 0.590 0.650 0.837 0.885 0,921 Table1. Evaluation of reference string extraction using 10-fold cross-validation for Proposed and baseline methods. Metric SVM (C=100) SVM (Default Parameters) Random Forest Gaussian Naive Bayes Precision 0,713 0,624 0,874 0,809 Recall 0,925 0,898 0,973 0,8 F1-Score 0,805 0,736 0,921 0,804 Table2. Evaluation of reference string extraction using 10-fold cross-validation for different classifiers. 29
  64. 64. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall F score Tag Proposed Cermine Proposed Cermine Proposed Cermine Publisher 0.805 0.455 0.581 Editor 0.902 0.711 0.795 Page (inc FP & LP) 0.959 0.765 0.932 0.890 0.946 0.823 Volume 0.806 0.871 0.830 0.675 0.818 0.761 First Name 0.865 0.216 0.824 0.761 0.844 0.336 Last Name 0.869 0.596 0.917 0.955 0.892 0.734 Source 0.631 0.669 0.783 0.543 0.699 0.6 Year 0.903 0.862 0.980 0.884 0.940 0.873 Title 0.942 0.872 0.901 0.856 0.921 0.864 Other 0.770 0.789 0.779 Average / Total 0.85357143 0.693 0.881 0.79485714 0.86571429 0.713 Table3. Evaluation of reference parsing on 304 references (Cermine with default training). 30
  65. 65. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.8367 0.8415 0.8805 0.6879 Editor 0.6722 0.5683 Author (inc FN & LN ) 0.8611 0.7792 0.7410 0.6260 Page (inc FP & LP ) 0.8489 0.8072 0.5616 0.5915 Issue 0.4688 0.3833 0.6511 0.2164 Other 0.6872 0.7951 Publisher 0.7459 0.8578 Source 0.5957 0.3198 0.6906 0.3012 URL 0.6370 0.3350 Volume 0.6611 0.6199 0.7891 0.3130 Year 0.8649 0.7832 0.8315 0.8959 Average / Total 0.73388571 0.64772857 0.73505714 0.51884286 Table4. Evaluation of reference parsing on 2023 references (Cermine is trained with the same training set as proposed) using 10-fold cross-validation. 30
  66. 66. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.85953517 0.79936523 0.85837475 0.86318882 Editor 0.61669282 0.66614764 Author (inc FN & LN ) 0.81999131 0.71916667 0.62271211 0.82 Issue 0.69521568 0.66815163 0.81991421 0.58327425 Publisher 0.63557433 0.86028141 Source 0.56182764 0.53182803 0.78004004 0.61664321 URL 0.56915911 0.24519389 Volume 0.72311321 0.63454301 0.79538527 0.8476872 Year 0.8031926 0.79916696 0.86452723 0.90456963 Average / Total 0.7438126 0.69203692 0.79015893 0.77256052 Table5. Evaluation of reference parsing on 100 English articles [2860 references] (Cermine is trained with the same training set as proposed) using 10-fold cross- validation. 31
  67. 67. WP1 Statusboukhers@uni-koblenz.de Conclusion • A generic approach to extract and parse references. • The approach is standardized as long as similar training data is available. • The approach works in a coherent mechanism for avoiding error accumulation. • The output of each phase is given with confidence scores to improve the subsequent one. 32
  68. 68. WP1 Statusboukhers@uni-koblenz.de Thank you for your attention! Questions? Contact: Zeyd Boukhers Institute for Web Science and Technologies, University of Koblenz-Landau boukhers@uni-koblenz.de Or excite@uni-koblenz.de Or 

×