SlideShare a Scribd company logo
1 of 68
Download to read offline
WP1 Statusboukhers@uni-koblenz.de
A Generic Approach for Reference
Extraction from PDF Documents
Zeyd Boukhers
Bologna, September 04, 2018
WP1 Statusboukhers@uni-koblenz.de
Reference Extraction and Segmentation
EXParser: http://excite.west.uni-koblenz.de:8081/excite
Code: https://github.com/exciteproject/Exparser
2
WP1 Statusboukhers@uni-koblenz.de
Reference Extraction and Segmentation
EXParser: http://excite.west.uni-koblenz.de:8081/excite
Code: https://github.com/exciteproject/Exparser
Reference String
Extraction
Reference String
Segmentation
2
WP1 Statusboukhers@uni-koblenz.deboukhers@uni-koblenz.de
In 2015:
• About 2,5 million scholarly articles published worldwide in
2015.
• The publications in Elsevier from 2009 to 2014 were cited 11.5
million times in the same period.
Introduction: Motivation
Source: https://www.elsevier.com/connect/elsevier-publishing-a-look-at-the-numbers-and-more
3
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Standard Pipeline For Reference Extraction
[*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source
Bibliographic Reference and Citation Parsers. In Proc of JCDL '18.
4
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
5
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Intrinsic Differences
P26 P27 P28 P29
6
WP1 Statusboukhers@uni-koblenz.de
Problem 1/5: Example of Extrinsic Differences
7
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
8
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
9
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
10
WP1 Statusboukhers@uni-koblenz.de
Problem 2/5: Multi-Reference Sections
P14 P40 P101
11
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
12
WP1 Statusboukhers@uni-koblenz.de
Problem 3/5: Different Representations
13
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
14
WP1 Statusboukhers@uni-koblenz.de
Problem 4/5: Other Types of References
15
WP1 Statusboukhers@uni-koblenz.de
Introduction: Motivation
• Different styles of references (i.e. intrinsic and extrinsic
differences).
• More than one section containing the references.
• Different representations of references (e.g. abbreviated
contents).
• Other types of references (e.g. grey literature, databases and
websites).
• Other languages (e.g. German, French).
16
WP1 Statusboukhers@uni-koblenz.de
Problem 5/5: Different Languages
17
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
• Error accumulation
• Intrinsic style
differences.
• Extrinsic style
differences.
• Different locations of
references.
Problems
18
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
• Optimized pipeline
• Generic features
• More correlation among
the pipeline phases
• Error accumulation
• Intrinsic style
differences.
• Extrinsic style
differences.
• Different locations of
references.
Problems Objectives
18
WP1 Statusboukhers@uni-koblenz.de
Generic Pipeline For Reference Extraction
Each line is classified into either:
• 0: non reference.
• 1: first line reference.
• 2: intermediate line reference.
• 3: last line reference.
Lines are combined and segmented simultaneously
until forming a consistent reference.
19
WP1 Statusboukhers@uni-koblenz.de
Line Classification
• Number of tokens.
• Number of digits.
• Amount of poncutations.
….
• Whether it starts with capital letter.
• Whether it contains a year format.
….
• Whether it contains a city (from a large data-table).
• Whether it contains an author name (from a large data-table).
20
WP1 Statusboukhers@uni-koblenz.de
Example of Generic Characteristics
0
500
1000
1500
2000
2500
3000
3500
4000
und der hrsg das verlag unter eds
Freq in Ref
Freq in non-Ref x0.1
Frequency of most frequent words in reference strings and their frequency in
non-reference strings.
21
WP1 Statusboukhers@uni-koblenz.de
Example of Generic Characteristics
0
500
1000
1500
2000
2500
3000
3500
4000
und der hrsg das verlag unter eds
Freq in Ref
Freq in non-Ref x0.1
Frequency of most frequent words in reference strings and their frequency in
non-reference strings.
21
WP1 Statusboukhers@uni-koblenz.de
Classification: Training
•The features extracted from the training dataset are used to train a
Random Forest model.
22
WP1 Statusboukhers@uni-koblenz.de
Classification: Testing
•The model is employed to classify each line into:
–Non-ref line (0), First-ref line (1),
–Intermediate-ref line (2) and Last-ref line.
23
WP1 Statusboukhers@uni-koblenz.de
Classification: Filtering
•The irrelevant lines are discarded with a filtering process.
24
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation
• Number of characters.
• Ratio of capital letters.
• Whether it contains digits.
• Followed by a comma.
• Between parentheses.
• Whether is a city.
• Whether is a stop word.
• etc. [*] For more details about the considered features:
https://github.com/exciteproject/Exparser
25
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
𝑎 newStarting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Segmentation
Probability
Completeness
Probability
Line-Extraction
Probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
𝑎 old
𝑎 new
Randomly add neighbour line (up or down) and compute 𝑎
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Randomly add neighbour line (up or down) and compute 𝑎
The new sample is accepted if it is better, rejected otherwise.
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
𝑎 old
𝑎 newStarting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
Starting with the line having the highest reference probability
Compute the acceptance ratio 𝑎
Randomly add neighbour line (up or down) and compute 𝑎
The new sample is accepted if it is better, rejected otherwise.
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Reference Segmentation & Identification
The new sample is accepted if it is better, rejected otherwise.
Randomly add neighbour line (up or down) and compute 𝑎
Compute the acceptance ratio 𝑎
Starting with the line having the highest reference probability
26
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The probability that the new
reference sting is complete.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
The probability that the
previous reference sting is
complete.
27
WP1 Statusboukhers@uni-koblenz.de
Acceptance Ratio
𝑎 =
𝑃𝑜(𝑙|𝑟 𝑗+1
)
𝑃𝑜(𝑙|𝑟 𝑗)
𝑃𝑐(𝑟 𝑗+1
)
𝑃𝑐(𝑟 𝑗)
𝑃𝑏(𝑟 𝑗+1
)
𝑃𝑏(𝑟 𝑗)
The product of the
components’ probabilities of
the initial line given the new
line combination.
The probability that the new
reference sting is complete.
The probability that the new
reference string is
determined with
borderlines.
The product of the
components’ probabilities of
the initial line given the
previous line combination.
The probability that the
previous reference sting is
complete.
The probability that the
previous reference string is
determined with borderlines
27
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Probability of Completeness
0
2
4
6
8
10
12
14
Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n
Existence of Components
28
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Line Extraction
Metric CER-D CER-T Pars-D Pars-M GRO-D GRO-T RefExt-T Proposed
Precision 0.296 0.303 0.558 0.617 0.627 0.847 0.879 0,874
Recall 0.233 0.220 0.552 0.595 0.718 0.839 0.906 0,973
F1-Score 0.245 0.235 0.542 0.590 0.650 0.837 0.885 0,921
Table1. Evaluation of reference string extraction using 10-fold cross-validation
for Proposed and baseline methods.
Metric SVM (C=100) SVM (Default Parameters) Random Forest Gaussian Naive
Bayes
Precision 0,713 0,624 0,874 0,809
Recall 0,925 0,898 0,973 0,8
F1-Score 0,805 0,736 0,921 0,804
Table2. Evaluation of reference string extraction using 10-fold cross-validation
for different classifiers.
29
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall F score
Tag Proposed Cermine Proposed Cermine Proposed Cermine
Publisher 0.805 0.455 0.581
Editor 0.902 0.711 0.795
Page (inc FP & LP) 0.959 0.765 0.932 0.890 0.946 0.823
Volume 0.806 0.871 0.830 0.675 0.818 0.761
First Name 0.865 0.216 0.824 0.761 0.844 0.336
Last Name 0.869 0.596 0.917 0.955 0.892 0.734
Source 0.631 0.669 0.783 0.543 0.699 0.6
Year 0.903 0.862 0.980 0.884 0.940 0.873
Title 0.942 0.872 0.901 0.856 0.921 0.864
Other 0.770 0.789 0.779
Average / Total 0.85357143 0.693 0.881 0.79485714 0.86571429 0.713
Table3. Evaluation of reference parsing on 304 references (Cermine with
default training).
30
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall
Tag Proposed Cermine Proposed Cermine
Article Title 0.8367 0.8415 0.8805 0.6879
Editor 0.6722 0.5683
Author (inc FN & LN ) 0.8611 0.7792 0.7410 0.6260
Page (inc FP & LP ) 0.8489 0.8072 0.5616 0.5915
Issue 0.4688 0.3833 0.6511 0.2164
Other 0.6872 0.7951
Publisher 0.7459 0.8578
Source 0.5957 0.3198 0.6906 0.3012
URL 0.6370 0.3350
Volume 0.6611 0.6199 0.7891 0.3130
Year 0.8649 0.7832 0.8315 0.8959
Average / Total 0.73388571 0.64772857 0.73505714 0.51884286
Table4. Evaluation of reference parsing on 2023 references (Cermine is trained with
the same training set as proposed) using 10-fold cross-validation.
30
WP1 Statusboukhers@uni-koblenz.de
Results: Reference Segmentation
Mean Precision Mean Recall
Tag Proposed Cermine Proposed Cermine
Article Title 0.85953517 0.79936523 0.85837475 0.86318882
Editor 0.61669282 0.66614764
Author (inc FN & LN )
0.81999131 0.71916667 0.62271211 0.82
Issue 0.69521568 0.66815163 0.81991421 0.58327425
Publisher 0.63557433 0.86028141
Source 0.56182764 0.53182803 0.78004004 0.61664321
URL 0.56915911 0.24519389
Volume 0.72311321 0.63454301 0.79538527 0.8476872
Year 0.8031926 0.79916696 0.86452723 0.90456963
Average / Total 0.7438126 0.69203692 0.79015893 0.77256052
Table5. Evaluation of reference parsing on 100 English articles [2860 references]
(Cermine is trained with the same training set as proposed) using 10-fold cross-
validation.
31
WP1 Statusboukhers@uni-koblenz.de
Conclusion
• A generic approach to extract and parse references.
• The approach is standardized as long as similar training
data is available.
• The approach works in a coherent mechanism for
avoiding error accumulation.
• The output of each phase is given with confidence
scores to improve the subsequent one.
32
WP1 Statusboukhers@uni-koblenz.de
Thank you for your attention!
Questions?
Contact:
Zeyd Boukhers
Institute for Web Science and Technologies, University of Koblenz-Landau
boukhers@uni-koblenz.de
Or excite@uni-koblenz.de Or


More Related Content

Similar to A Generic Approach for Reference Extraction from PDF Documents

The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern LabDave Olsen
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials catalogingJennifer Young
 
CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression VinaOconner450
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]Study Hub
 
EndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsEndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsNUS Libraries
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرنمركز البحوث الأقسام العلمية
 
Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Olivia Fournier
 
System and Problem for a Library Management System .docx
System and Problem for a  Library Management System  .docxSystem and Problem for a  Library Management System  .docx
System and Problem for a Library Management System .docxmattinsonjanel
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...CIGScotland
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
How to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxHow to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxpooleavelina
 
MBA 504 Module Four Power BI Assignment User Manual M
MBA 504 Module Four Power BI Assignment User Manual  MMBA 504 Module Four Power BI Assignment User Manual  M
MBA 504 Module Four Power BI Assignment User Manual MAbramMartino96
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropeFlip Kromer
 

Similar to A Generic Approach for Reference Extraction from PDF Documents (19)

The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern Lab
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials cataloging
 
presentation
presentationpresentation
presentation
 
CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression CJUS 745Quantitative Analysis Report Multiple Regression
CJUS 745Quantitative Analysis Report Multiple Regression
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
 
EndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate StudentsEndNote X2 Workshop for FASS Graduate Students
EndNote X2 Workshop for FASS Graduate Students
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2Kaplan university cm 220 homework help 2
Kaplan university cm 220 homework help 2
 
System and Problem for a Library Management System .docx
System and Problem for a  Library Management System  .docxSystem and Problem for a  Library Management System  .docx
System and Problem for a Library Management System .docx
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
COMM 1180 - Mechanical Techniques
COMM 1180 - Mechanical TechniquesCOMM 1180 - Mechanical Techniques
COMM 1180 - Mechanical Techniques
 
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...RDA Implementation at Edinburgh University Library, 2014/ Alasdair MacDonald...
RDA Implementation at Edinburgh University Library, 2014 / Alasdair MacDonald...
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
How to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docxHow to submit this assignment Your submission should be .docx
How to submit this assignment Your submission should be .docx
 
Organize your research with EndNote
Organize your research with EndNoteOrganize your research with EndNote
Organize your research with EndNote
 
MBA 504 Module Four Power BI Assignment User Manual M
MBA 504 Module Four Power BI Assignment User Manual  MMBA 504 Module Four Power BI Assignment User Manual  M
MBA 504 Module Four Power BI Assignment User Manual M
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
Ref works 2011
Ref works 2011Ref works 2011
Ref works 2011
 

Recently uploaded

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 

Recently uploaded (15)

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 

A Generic Approach for Reference Extraction from PDF Documents

  • 1. WP1 Statusboukhers@uni-koblenz.de A Generic Approach for Reference Extraction from PDF Documents Zeyd Boukhers Bologna, September 04, 2018
  • 2. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser 2
  • 3. WP1 Statusboukhers@uni-koblenz.de Reference Extraction and Segmentation EXParser: http://excite.west.uni-koblenz.de:8081/excite Code: https://github.com/exciteproject/Exparser Reference String Extraction Reference String Segmentation 2
  • 4. WP1 Statusboukhers@uni-koblenz.deboukhers@uni-koblenz.de In 2015: • About 2,5 million scholarly articles published worldwide in 2015. • The publications in Elsevier from 2009 to 2014 were cited 11.5 million times in the same period. Introduction: Motivation Source: https://www.elsevier.com/connect/elsevier-publishing-a-look-at-the-numbers-and-more 3
  • 5. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 6. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 7. WP1 Statusboukhers@uni-koblenz.de Standard Pipeline For Reference Extraction [*] Dominika Tkaczyk et al. 2018. Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers. In Proc of JCDL '18. 4
  • 8. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 5
  • 9. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 10. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 11. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Intrinsic Differences P26 P27 P28 P29 6
  • 12. WP1 Statusboukhers@uni-koblenz.de Problem 1/5: Example of Extrinsic Differences 7
  • 13. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 8
  • 17. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 18. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 19. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections 10
  • 20. WP1 Statusboukhers@uni-koblenz.de Problem 2/5: Multi-Reference Sections P14 P40 P101 11
  • 21. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 12
  • 22. WP1 Statusboukhers@uni-koblenz.de Problem 3/5: Different Representations 13
  • 23. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 14
  • 24. WP1 Statusboukhers@uni-koblenz.de Problem 4/5: Other Types of References 15
  • 25. WP1 Statusboukhers@uni-koblenz.de Introduction: Motivation • Different styles of references (i.e. intrinsic and extrinsic differences). • More than one section containing the references. • Different representations of references (e.g. abbreviated contents). • Other types of references (e.g. grey literature, databases and websites). • Other languages (e.g. German, French). 16
  • 27. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems 18
  • 28. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction • Optimized pipeline • Generic features • More correlation among the pipeline phases • Error accumulation • Intrinsic style differences. • Extrinsic style differences. • Different locations of references. Problems Objectives 18
  • 29. WP1 Statusboukhers@uni-koblenz.de Generic Pipeline For Reference Extraction Each line is classified into either: • 0: non reference. • 1: first line reference. • 2: intermediate line reference. • 3: last line reference. Lines are combined and segmented simultaneously until forming a consistent reference. 19
  • 30. WP1 Statusboukhers@uni-koblenz.de Line Classification • Number of tokens. • Number of digits. • Amount of poncutations. …. • Whether it starts with capital letter. • Whether it contains a year format. …. • Whether it contains a city (from a large data-table). • Whether it contains an author name (from a large data-table). 20
  • 31. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  • 32. WP1 Statusboukhers@uni-koblenz.de Example of Generic Characteristics 0 500 1000 1500 2000 2500 3000 3500 4000 und der hrsg das verlag unter eds Freq in Ref Freq in non-Ref x0.1 Frequency of most frequent words in reference strings and their frequency in non-reference strings. 21
  • 33. WP1 Statusboukhers@uni-koblenz.de Classification: Training •The features extracted from the training dataset are used to train a Random Forest model. 22
  • 34. WP1 Statusboukhers@uni-koblenz.de Classification: Testing •The model is employed to classify each line into: –Non-ref line (0), First-ref line (1), –Intermediate-ref line (2) and Last-ref line. 23
  • 35. WP1 Statusboukhers@uni-koblenz.de Classification: Filtering •The irrelevant lines are discarded with a filtering process. 24
  • 36. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation • Number of characters. • Ratio of capital letters. • Whether it contains digits. • Followed by a comma. • Between parentheses. • Whether is a city. • Whether is a stop word. • etc. [*] For more details about the considered features: https://github.com/exciteproject/Exparser 25
  • 37. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability 26
  • 38. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Segmentation Probability Completeness Probability Line-Extraction Probability 26
  • 39. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 𝑎 old 𝑎 new Randomly add neighbour line (up or down) and compute 𝑎 26
  • 40. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  • 41. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification 𝑎 old 𝑎 newStarting with the line having the highest reference probability Compute the acceptance ratio 𝑎 The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 26
  • 42. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification Starting with the line having the highest reference probability Compute the acceptance ratio 𝑎 Randomly add neighbour line (up or down) and compute 𝑎 The new sample is accepted if it is better, rejected otherwise. 26
  • 43. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 44. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 45. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 46. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 47. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 48. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 49. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 50. WP1 Statusboukhers@uni-koblenz.de Reference Segmentation & Identification The new sample is accepted if it is better, rejected otherwise. Randomly add neighbour line (up or down) and compute 𝑎 Compute the acceptance ratio 𝑎 Starting with the line having the highest reference probability 26
  • 51. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) 27
  • 52. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The product of the components’ probabilities of the initial line given the previous line combination. 27
  • 53. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. 27
  • 54. WP1 Statusboukhers@uni-koblenz.de Acceptance Ratio 𝑎 = 𝑃𝑜(𝑙|𝑟 𝑗+1 ) 𝑃𝑜(𝑙|𝑟 𝑗) 𝑃𝑐(𝑟 𝑗+1 ) 𝑃𝑐(𝑟 𝑗) 𝑃𝑏(𝑟 𝑗+1 ) 𝑃𝑏(𝑟 𝑗) The product of the components’ probabilities of the initial line given the new line combination. The probability that the new reference sting is complete. The probability that the new reference string is determined with borderlines. The product of the components’ probabilities of the initial line given the previous line combination. The probability that the previous reference sting is complete. The probability that the previous reference string is determined with borderlines 27
  • 55. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 56. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 57. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 58. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 59. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 60. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 61. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 62. WP1 Statusboukhers@uni-koblenz.de Probability of Completeness 0 2 4 6 8 10 12 14 Comb.1 Comb.2 Comb.3 Comb.4 Comb.5 Comb.6 Comb.7 … Comb.n Existence of Components 28
  • 63. WP1 Statusboukhers@uni-koblenz.de Results: Reference Line Extraction Metric CER-D CER-T Pars-D Pars-M GRO-D GRO-T RefExt-T Proposed Precision 0.296 0.303 0.558 0.617 0.627 0.847 0.879 0,874 Recall 0.233 0.220 0.552 0.595 0.718 0.839 0.906 0,973 F1-Score 0.245 0.235 0.542 0.590 0.650 0.837 0.885 0,921 Table1. Evaluation of reference string extraction using 10-fold cross-validation for Proposed and baseline methods. Metric SVM (C=100) SVM (Default Parameters) Random Forest Gaussian Naive Bayes Precision 0,713 0,624 0,874 0,809 Recall 0,925 0,898 0,973 0,8 F1-Score 0,805 0,736 0,921 0,804 Table2. Evaluation of reference string extraction using 10-fold cross-validation for different classifiers. 29
  • 64. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall F score Tag Proposed Cermine Proposed Cermine Proposed Cermine Publisher 0.805 0.455 0.581 Editor 0.902 0.711 0.795 Page (inc FP & LP) 0.959 0.765 0.932 0.890 0.946 0.823 Volume 0.806 0.871 0.830 0.675 0.818 0.761 First Name 0.865 0.216 0.824 0.761 0.844 0.336 Last Name 0.869 0.596 0.917 0.955 0.892 0.734 Source 0.631 0.669 0.783 0.543 0.699 0.6 Year 0.903 0.862 0.980 0.884 0.940 0.873 Title 0.942 0.872 0.901 0.856 0.921 0.864 Other 0.770 0.789 0.779 Average / Total 0.85357143 0.693 0.881 0.79485714 0.86571429 0.713 Table3. Evaluation of reference parsing on 304 references (Cermine with default training). 30
  • 65. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.8367 0.8415 0.8805 0.6879 Editor 0.6722 0.5683 Author (inc FN & LN ) 0.8611 0.7792 0.7410 0.6260 Page (inc FP & LP ) 0.8489 0.8072 0.5616 0.5915 Issue 0.4688 0.3833 0.6511 0.2164 Other 0.6872 0.7951 Publisher 0.7459 0.8578 Source 0.5957 0.3198 0.6906 0.3012 URL 0.6370 0.3350 Volume 0.6611 0.6199 0.7891 0.3130 Year 0.8649 0.7832 0.8315 0.8959 Average / Total 0.73388571 0.64772857 0.73505714 0.51884286 Table4. Evaluation of reference parsing on 2023 references (Cermine is trained with the same training set as proposed) using 10-fold cross-validation. 30
  • 66. WP1 Statusboukhers@uni-koblenz.de Results: Reference Segmentation Mean Precision Mean Recall Tag Proposed Cermine Proposed Cermine Article Title 0.85953517 0.79936523 0.85837475 0.86318882 Editor 0.61669282 0.66614764 Author (inc FN & LN ) 0.81999131 0.71916667 0.62271211 0.82 Issue 0.69521568 0.66815163 0.81991421 0.58327425 Publisher 0.63557433 0.86028141 Source 0.56182764 0.53182803 0.78004004 0.61664321 URL 0.56915911 0.24519389 Volume 0.72311321 0.63454301 0.79538527 0.8476872 Year 0.8031926 0.79916696 0.86452723 0.90456963 Average / Total 0.7438126 0.69203692 0.79015893 0.77256052 Table5. Evaluation of reference parsing on 100 English articles [2860 references] (Cermine is trained with the same training set as proposed) using 10-fold cross- validation. 31
  • 67. WP1 Statusboukhers@uni-koblenz.de Conclusion • A generic approach to extract and parse references. • The approach is standardized as long as similar training data is available. • The approach works in a coherent mechanism for avoiding error accumulation. • The output of each phase is given with confidence scores to improve the subsequent one. 32
  • 68. WP1 Statusboukhers@uni-koblenz.de Thank you for your attention! Questions? Contact: Zeyd Boukhers Institute for Web Science and Technologies, University of Koblenz-Landau boukhers@uni-koblenz.de Or excite@uni-koblenz.de Or 