SlideShare a Scribd company logo
1 of 68
Download to read offline
WP1 Statusboukhers@uni-koblenz.de
A Generic Approach for Reference
Extraction from PDF Documents
Zeyd Boukhers
Bologna, September 04, 2018
WP1 Statusboukhers@uni-koblenz.de
Reference Extraction and Segmentation
EXParser: http://excite.west.uni-koblenz.de:8081/excite
Code: https://github.com/exciteproject/Exparser
2
WP1 Statusboukhers@uni-koblenz.de
Reference Extraction and Segmentation
EXParser: http://excite.west.uni-koblenz.de:8081/excite
Code: https://github.com/exciteproject/Exparser
Reference String
Extraction
Reference String
Segmentation
2
WP1 Statusboukhers@uni-koblenz.deboukhers@uni-koblenz.de
In 2015:
• About 2,5 million scholarly articles published worldwide in
2015.
• The publications in Elsevier from 2009 to 2014 were cited 11.5
million times in the same period.
Introduction: Motivation
Source: https://www.elsevier.com/connect/elsevier-publishing-a-look-at-the-numbers-and-more
3
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
5
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Extrinsic Differences
7
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
8
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
P14 P40 P101
11
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
12
WP1 Statusboukhers@uni-koblenz.de
Problem 3/5: Different Representations
13
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
14
WP1 Statusboukhers@uni-koblenz.de
Problem 4/5: Other Types of References
15
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
16
WP1 Statusboukhers@uni-koblenz.de
Problem 5/5: Different Languages
17
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
• Error accumulation
• Intrinsic style
differences.
• Extrinsic style
differences.
• Different locations of
references.
Problems
18
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
• Optimized pipeline
• Generic features
• More correlation among
the pipeline phases
• Error accumulation
• Intrinsic style
differences.
• Extrinsic style
differences.
• Different locations of
references.
Problems Objectives
18
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
Each line is classified into either:
• 0: non reference.
• 1: first line reference.
• 2: intermediate line reference.
• 3: last line reference.
Lines are combined and segmented simultaneously
until forming a consistent reference.
19
WP1 Statusboukhers@uni-koblenz.de
Line Classification
• Number of tokens.
• Number of digits.
• Amount of poncutations.
….
• Whether it starts with capital letter.
• Whether it contains a year format.
….
• Whether it contains a city (from a large data-table).
• Whether it contains an author name (from a large data-table).
20
WP1 Statusboukhers@uni-koblenz.de
Example of Generic Characteristics
0
500
1000
1500
2000
2500
3000
3500
4000
und der hrsg das verlag unter eds
Freq in Ref
Freq in non-Ref x0.1
Frequency of most frequent words in reference strings and their frequency in
non-reference strings.
21
WP1 Statusboukhers@uni-koblenz.de
Example of Generic Characteristics
0
500
1000
1500
2000
2500
3000
3500
4000
und der hrsg das verlag unter eds
Freq in Ref
Freq in non-Ref x0.1
Frequency of most frequent words in reference strings and their frequency in
non-reference strings.
21
WP1 Statusboukhers@uni-koblenz.de
Classification: Training
•The features extracted from the training dataset are used to train a
Random Forest model.
22
WP1 Statusboukhers@uni-koblenz.de
Classification: Testing
•The model is employed to classify each line into:
–Non-ref line (0), First-ref line (1),
–Intermediate-ref line (2) and Last-ref line.
23
WP1 Statusboukhers@uni-koblenz.de
Classification: Filtering
•The irrelevant lines are discarded with a filtering process.
24
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation
• Number of characters.
• Ratio of capital letters.
• Whether it contains digits.
• Followed by a comma.
• Between parentheses.
• Whether is a city.
• Whether is a stop word.
• etc. [*] For more details about the considered features:
https://github.com/exciteproject/Exparser
25
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
𝑎 newStarting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Segmentation
Probability
Completeness
Probability
Line-Extraction
Probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
𝑎 old
𝑎 new
Randomly add neighbour line (up or down) and compute 𝑎
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Randomly add neighbour line (up or down) and compute 𝑎
The new sample is accepted if it is better, rejected otherwise.
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
𝑎 old
𝑎 newStarting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Randomly add neighbour line (up or down) and compute 𝑎
The new sample is accepted if it is better, rejected otherwise.
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The probability that the new
reference sting is complete.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
The probability that the
previous reference sting is
complete.
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The probability that the new
reference sting is complete.
The probability that the new
reference string is
determined with
borderlines.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
The probability that the
previous reference sting is
complete.
The probability that the
previous reference string is
determined with borderlines
27
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Line Extraction
Metric CER-D CER-T Pars-D Pars-M GRO-D GRO-T RefExt-T Proposed
Precision 0.296 0.303 0.558 0.617 0.627 0.847 0.879 0,874
Recall 0.233 0.220 0.552 0.595 0.718 0.839 0.906 0,973
F1-Score 0.245 0.235 0.542 0.590 0.650 0.837 0.885 0,921
Table1. Evaluation of reference string extraction using 10-fold cross-validation
for Proposed and baseline methods.
Metric SVM (C=100) SVM (Default Parameters) Random Forest Gaussian Naive
Bayes
Precision 0,713 0,624 0,874 0,809
Recall 0,925 0,898 0,973 0,8
F1-Score 0,805 0,736 0,921 0,804
Table2. Evaluation of reference string extraction using 10-fold cross-validation
for different classifiers.
29
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall F score
Tag Proposed Cermine Proposed Cermine Proposed Cermine
Publisher 0.805 0.455 0.581
Editor 0.902 0.711 0.795
Page (inc FP & LP) 0.959 0.765 0.932 0.890 0.946 0.823
Volume 0.806 0.871 0.830 0.675 0.818 0.761
First Name 0.865 0.216 0.824 0.761 0.844 0.336
Last Name 0.869 0.596 0.917 0.955 0.892 0.734
Source 0.631 0.669 0.783 0.543 0.699 0.6
Year 0.903 0.862 0.980 0.884 0.940 0.873
Title 0.942 0.872 0.901 0.856 0.921 0.864
Other 0.770 0.789 0.779
Average / Total 0.85357143 0.693 0.881 0.79485714 0.86571429 0.713
Table3. Evaluation of reference parsing on 304 references (Cermine with
default training).
30
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall
Tag Proposed Cermine Proposed Cermine
Article Title 0.8367 0.8415 0.8805 0.6879
Editor 0.6722 0.5683
Author (inc FN & LN ) 0.8611 0.7792 0.7410 0.6260
Page (inc FP & LP ) 0.8489 0.8072 0.5616 0.5915
Issue 0.4688 0.3833 0.6511 0.2164
Other 0.6872 0.7951
Publisher 0.7459 0.8578
Source 0.5957 0.3198 0.6906 0.3012
URL 0.6370 0.3350
Volume 0.6611 0.6199 0.7891 0.3130
Year 0.8649 0.7832 0.8315 0.8959
Average / Total 0.73388571 0.64772857 0.73505714 0.51884286
Table4. Evaluation of reference parsing on 2023 references (Cermine is trained with
the same training set as proposed) using 10-fold cross-validation.
30
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall
Tag Proposed Cermine Proposed Cermine
Article Title 0.85953517 0.79936523 0.85837475 0.86318882
Editor 0.61669282 0.66614764
Author (inc FN & LN )
0.81999131 0.71916667 0.62271211 0.82
Issue 0.69521568 0.66815163 0.81991421 0.58327425
Publisher 0.63557433 0.86028141
Source 0.56182764 0.53182803 0.78004004 0.61664321
URL 0.56915911 0.24519389
Volume 0.72311321 0.63454301 0.79538527 0.8476872
Year 0.8031926 0.79916696 0.86452723 0.90456963
Average / Total 0.7438126 0.69203692 0.79015893 0.77256052
Table5. Evaluation of reference parsing on 100 English articles [2860 references]
(Cermine is trained with the same training set as proposed) using 10-fold cross-
validation.
31
WP1 Statusboukhers@uni-koblenz.de
Conclusion
• A generic approach to extract and parse references.
• The approach is standardized as long as similar training
data is available.
• The approach works in a coherent mechanism for
avoiding error accumulation.
• The output of each phase is given with confidence
scores to improve the subsequent one.
32
WP1 Statusboukhers@uni-koblenz.de
Thank you for your attention!
Questions?
Contact:
Zeyd Boukhers
Institute for Web Science and Technologies, University of Koblenz-Landau
boukhers@uni-koblenz.de
Or excite@uni-koblenz.de Or


More Related Content

Similar to A Generic Approach for Reference Extraction from PDF Documents

The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern LabDave Olsen
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials catalogingJennifer Young
 
CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression VinaOconner450
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]Study Hub
 
EndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsEndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsNUS Libraries
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرنمركز البحوث الأقسام العلمية
 
Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Olivia Fournier
 
System and Problem for a Library Management System .docx
System and Problem for a  Library Management System  .docxSystem and Problem for a  Library Management System  .docx
System and Problem for a Library Management System .docxmattinsonjanel
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...CIGScotland
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
How to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxHow to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxpooleavelina
 
MBA 504 Module Four Power BI Assignment User Manual M
MBA 504 Module Four Power BI Assignment User Manual  MMBA 504 Module Four Power BI Assignment User Manual  M
MBA 504 Module Four Power BI Assignment User Manual MAbramMartino96
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropeFlip Kromer
 

Similar to A Generic Approach for Reference Extraction from PDF Documents (19)

The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern Lab
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials cataloging
 
presentation
presentationpresentation
presentation
 
CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
 
EndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsEndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate Students
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2
 
System and Problem for a Library Management System .docx
System and Problem for a  Library Management System  .docxSystem and Problem for a  Library Management System  .docx
System and Problem for a Library Management System .docx
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
COMM 1180 - Mechanical Techniques
COMM 1180 - Mechanical TechniquesCOMM 1180 - Mechanical Techniques
COMM 1180 - Mechanical Techniques
 
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
How to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxHow to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docx
 
Organize your research with EndNote
Organize your research with EndNoteOrganize your research with EndNote
Organize your research with EndNote
 
MBA 504 Module Four Power BI Assignment User Manual M
MBA 504 Module Four Power BI Assignment User Manual  MMBA 504 Module Four Power BI Assignment User Manual  M
MBA 504 Module Four Power BI Assignment User Manual M
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
Ref works 2011
Ref works 2011Ref works 2011
Ref works 2011
 

Recently uploaded

Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.bazilnaeem7
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnairebazilnaeem7
 
ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024SkillCertProExams
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxFamilyWorshipCenterD
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxMogul Press
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxFamilyWorshipCenterD
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfKinben Innovation Private Limited
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfSkillCertProExams
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfSkillCertProExams
 

Recently uploaded (9)

Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.Deciding The Topic of our Magazine.pptx.
Deciding The Topic of our Magazine.pptx.
 
Understanding Poverty: A Community Questionnaire
Understanding Poverty: A Community QuestionnaireUnderstanding Poverty: A Community Questionnaire
Understanding Poverty: A Community Questionnaire
 
ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024ServiceNow CIS-Discovery Exam Dumps 2024
ServiceNow CIS-Discovery Exam Dumps 2024
 
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptxDAY 0 8 A Revelation 05-19-2024 PPT.pptx
DAY 0 8 A Revelation 05-19-2024 PPT.pptx
 
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docxThe Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
The Influence and Evolution of Mogul Press in Contemporary Public Relations.docx
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptx
 
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdfACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
ACM CHT Best Inspection Practices Kinben Innovation MIC Slideshare.pdf
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
 

A Generic Approach for Reference Extraction from PDF Documents

  • 1. WP1 Statusboukhers@uni-koblenz.de A Generic Approach for Reference Extraction from PDF Documents Zeyd Boukhers Bologna, September 04, 2018
  • 2. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser 2
  • 3. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser Reference String Extraction Reference String Segmentation 2
  • 4. WP1 Statusboukhers@uni-koblenz.deboukhers@uni-koblenz.de In 2015: • About 2,5 million scholarly articles published worldwide in 2015. • The publications in Elsevier from 2009 to 2014 were cited 11.5 million times in the same period. Introduction: Motivation Source: https://www.elsevier.com/connect/elsevier-publishing-a-look-at-the-numbers-and-more 3
  • 5. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 6. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 7. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 8. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 5
  • 9. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 10. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 11. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 12. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Extrinsic Differences 7
  • 13. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 8
  • 17. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 18. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 19. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 20. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections P14 P40 P101 11
  • 21. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 12
  • 22. WP1 Statusboukhers@uni-koblenz.de Problem 3/5: Different Representations 13
  • 23. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 14
  • 24. WP1 Statusboukhers@uni-koblenz.de Problem 4/5: Other Types of References 15
  • 25. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 16
  • 27. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems 18
  • 28. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Optimized pipeline • Generic features • More correlation among the pipeline phases • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems Objectives 18
  • 29. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction Each line is classified into either: • 0: non reference. • 1: first line reference. • 2: intermediate line reference. • 3: last line reference. Lines are combined and segmented simultaneously until forming a consistent reference. 19
  • 30. WP1 Statusboukhers@uni-koblenz.de Line Classification • Number of tokens. • Number of digits. • Amount of poncutations. …. • Whether it starts with capital letter. • Whether it contains a year format. …. • Whether it contains a city (from a large data-table). • Whether it contains an author name (from a large data-table). 20
  • 31. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  • 32. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  • 33. WP1 Statusboukhers@uni-koblenz.de Classification: Training •The features extracted from the training dataset are used to train a Random Forest model. 22
  • 34. WP1 Statusboukhers@uni-koblenz.de Classification: Testing •The model is employed to classify each line into: –Non-ref line (0), First-ref line (1), –Intermediate-ref line (2) and Last-ref line. 23
  • 35. WP1 Statusboukhers@uni-koblenz.de Classification: Filtering •The irrelevant lines are discarded with a filtering process. 24
  • 36. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation • Number of characters. • Ratio of capital letters. • Whether it contains digits. • Followed by a comma. • Between parentheses. • Whether is a city. • Whether is a stop word. • etc. [*] For more details about the considered features: https://github.com/exciteproject/Exparser 25
  • 37. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability 26
  • 38. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Segmentation Probability Completeness Probability Line-Extraction Probability 26
  • 39. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 𝑎 old 𝑎 new Randomly add neighbour line (up or down) and compute 𝑎 26
  • 40. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  • 41. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 old 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 26
  • 42. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  • 43. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 44. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 45. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 46. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 47. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 48. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 49. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 50. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 51. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) 27
  • 52. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The product of the components’ probabilities of the initial line given the previous line combination. 27
  • 53. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. 27
  • 54. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The probability that the new reference string is determined with borderlines. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. The probability that the previous reference string is determined with borderlines 27
  • 55. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 56. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 57. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 58. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 59. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 60. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 61. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 62. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 63. WP1 Statusboukhers@uni-koblenz.de Results: Reference Line Extraction Metric CER-D CER-T Pars-D Pars-M GRO-D GRO-T RefExt-T Proposed Precision 0.296 0.303 0.558 0.617 0.627 0.847 0.879 0,874 Recall 0.233 0.220 0.552 0.595 0.718 0.839 0.906 0,973 F1-Score 0.245 0.235 0.542 0.590 0.650 0.837 0.885 0,921 Table1. Evaluation of reference string extraction using 10-fold cross-validation for Proposed and baseline methods. Metric SVM (C=100) SVM (Default Parameters) Random Forest Gaussian Naive Bayes Precision 0,713 0,624 0,874 0,809 Recall 0,925 0,898 0,973 0,8 F1-Score 0,805 0,736 0,921 0,804 Table2. Evaluation of reference string extraction using 10-fold cross-validation for different classifiers. 29
  • 64. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall F score Tag Proposed Cermine Proposed Cermine Proposed Cermine Publisher 0.805 0.455 0.581 Editor 0.902 0.711 0.795 Page (inc FP & LP) 0.959 0.765 0.932 0.890 0.946 0.823 Volume 0.806 0.871 0.830 0.675 0.818 0.761 First Name 0.865 0.216 0.824 0.761 0.844 0.336 Last Name 0.869 0.596 0.917 0.955 0.892 0.734 Source 0.631 0.669 0.783 0.543 0.699 0.6 Year 0.903 0.862 0.980 0.884 0.940 0.873 Title 0.942 0.872 0.901 0.856 0.921 0.864 Other 0.770 0.789 0.779 Average / Total 0.85357143 0.693 0.881 0.79485714 0.86571429 0.713 Table3. Evaluation of reference parsing on 304 references (Cermine with default training). 30
  • 65. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.8367 0.8415 0.8805 0.6879 Editor 0.6722 0.5683 Author (inc FN & LN ) 0.8611 0.7792 0.7410 0.6260 Page (inc FP & LP ) 0.8489 0.8072 0.5616 0.5915 Issue 0.4688 0.3833 0.6511 0.2164 Other 0.6872 0.7951 Publisher 0.7459 0.8578 Source 0.5957 0.3198 0.6906 0.3012 URL 0.6370 0.3350 Volume 0.6611 0.6199 0.7891 0.3130 Year 0.8649 0.7832 0.8315 0.8959 Average / Total 0.73388571 0.64772857 0.73505714 0.51884286 Table4. Evaluation of reference parsing on 2023 references (Cermine is trained with the same training set as proposed) using 10-fold cross-validation. 30
  • 66. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.85953517 0.79936523 0.85837475 0.86318882 Editor 0.61669282 0.66614764 Author (inc FN & LN ) 0.81999131 0.71916667 0.62271211 0.82 Issue 0.69521568 0.66815163 0.81991421 0.58327425 Publisher 0.63557433 0.86028141 Source 0.56182764 0.53182803 0.78004004 0.61664321 URL 0.56915911 0.24519389 Volume 0.72311321 0.63454301 0.79538527 0.8476872 Year 0.8031926 0.79916696 0.86452723 0.90456963 Average / Total 0.7438126 0.69203692 0.79015893 0.77256052 Table5. Evaluation of reference parsing on 100 English articles [2860 references] (Cermine is trained with the same training set as proposed) using 10-fold cross- validation. 31
  • 67. WP1 Statusboukhers@uni-koblenz.de Conclusion • A generic approach to extract and parse references. • The approach is standardized as long as similar training data is available. • The approach works in a coherent mechanism for avoiding error accumulation. • The output of each phase is given with confidence scores to improve the subsequent one. 32
  • 68. WP1 Statusboukhers@uni-koblenz.de Thank you for your attention! Questions? Contact: Zeyd Boukhers Institute for Web Science and Technologies, University of Koblenz-Landau boukhers@uni-koblenz.de Or excite@uni-koblenz.de Or 