Efficient blocking method for a large scale citation matching

Efficient blocking method for
a large scale citation matching
Mateusz Fedoryszak & Łukasz Bolikowski
{matfed,bolo}@icm.edu.pl
Interdisciplinary Centre for Mathematical and
Computational Modelling
University of Warsaw

Citation matching
References
[1] I. Newton, Philosophiae naturalis...
[2] N. Copernicus, De revolutionibus...
ID Title Author
11 Στοιχεῖα Εὐκλείδης
14 De revolutionibus... Copernicus
• Note: it's an instance of data linkage problem

Why important?
• Clickable interfaces
• Bibliometrics
(think: H-index)
• Further analysis
(e.g. similarities)
Why difficult?
• Citation extraction errors
(in both digital-born and
retro-born docs)
• Countless citation styles
used inconsistently
• Typos and other human
errors

The Problem
References
ID Title Author

Naïve approach
References
ID Title Author
For 1.3M documents and 12M citations it's
15.6 × 1012 comparisons

Select the best candidates
References
ID Title Author
• I'll present a method of candidate selection and how to
implement it using Apache Hadoop

Blocking
References
ID Title Author

Fingerprints
References
CCCC
ID Title Author AAAA
BBBB CCCC
AAAA
EEEE
AAAA FFFF

Workflow
citation document
hash citation ID hash document ID
hash document ID
citation ID
citation ID
hash document ID
document ID
hash citation ID
citation ID document ID
Reduce Map

Workflow with tuning
• Before:
• Compute bucket sizes
• Reject too big ones
• Use DistributedCache
disseminate
• After:
• For each citation
choose only the most
popular candidates
citation ID document ID Map Reduce
citation document
hash citation ID hash document ID
hash document ID
citation ID
citation ID
hash document ID
document ID
hash citation ID

Normalisation
• Lowercase
• Remove
• diacritics
• punctuation marks
• Filter out tokens shorter than 3
characters
(except numbers)

Normalisation
Pawlak, Zdzisław (1982). "Rough sets".
Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
pawlak zdzislaw 1982 rough sets
internat comput inform sci 11 5 341 356

Examples
Pawlak, Zdzisław (1982).
"Rough sets".
Internat. J. Comput. Inform. Sci.
11 (5): 341–356.
{
author: "Zdzisław Pawlak",
year: "1982",
title: "Rough sets",
journal: "International Journal
of Computer & Information
Sciences",
volume: "11",
issue: "5",
pages: "341–356"
}

Baseline
pawlak
zdzislaw
1982
rough
...
internat
...
zdzislaw
pawlak
1982
rough
...
international
journal
...

Bigrams
• For document we use only authors and title fields
pawlak zdzislaw
zdzislaw 1982
1982 rough
rough sets
...
zdzislaw pawlak
rough sets

name-year
• For citation:
• name: any of first 4 distinct text tokens
• year: any number between 1900 and 2050
pawlak#1982
zdzislaw#1982
rough#1982
sets#1982
zdzislaw#1982
pawlak#1982
+approximate variant zdzislaw#1981
pawlak#1981
zdzislaw#1983
pawlak#1983

name-year-pages
• For citation:
• pages: any sorted pair of numbers, not year
pawlak#1982#5#11
pawlak#1982#5#341
pawlak#1982#...
pawlak#1982#341#356
zdzislaw#...
zdzislaw#1982#341#356
rough#...
sets#...
zdzislaw#1982#341#356
pawlak#1982#341#356
+approximate & optimistic variant

Intermezzo: citation parsing
Pawlak , Zdzisław ( 1982 ) .
author other author other year other other
...
...
Pawlak, Zdzisław (1982). "Rough sets".
Internat. J. Comput. Inform. Sci. 11 (5): 341–356.

name-year-numn
• n = 1..3
• For citation:
• numn: any sorted tuple of numbers, not year
pawlak#1982#5#11#341
pawlak#1982#5#341#356
pawlak#1982#5#11#356#
pawlak#1982#11#341#356
zdzislaw#...
rough#...
sets#...
pawlak#1982#5#11#341
pawlak#1982#5#341#356
pawlak#1982#5#11#356#
pawlak#1982#11#341#356
zdzislaw#...
+approximate variant

Test dataset
<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation
publication-type="journal">
<name><surname>Jemal</surname><given-names>A</given-names></
name>,
<name><surname>Bray</surname><given-names>F</given-names></
name>,
<name><surname>Center</surname><given-names>MM</given-names></
name>,
<name><surname>Ferlay</surname><given-names>J</given-names></
name>,
<name><surname>Ward</surname><given-names>E</given-names></
name>,
<etal>et al</etal>
(<year>2011</year>)
<article-title>Global cancer statistics</article-title>.
<source>CA Cancer J Clin</source>
<volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage>
<pub-id pub-id-type="pmid">21296855</pub-id>
</mixed-citation></ref>

Test dataset
2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global
cancer statistics. CA Cancer J Clin 61: 69–90

Test dataset
• Based on Open Access Subset of PMC
• Only citations preserving original formatting
• Only citations with PMID assigned
• 528k documents
• 3.6M citation out of which 321k resolvable

Metrics
• Recall — the percentage of true citation → document links
that are maintained by the heuristic
• Precision — the percentage of citation → document links
returned by algorithm that are correct
• Intermediate data — total number of hashes and pairs
generated (before selecting the most popular ones)
• Candidate pairs — number of pairs returned by heuristic for
further assessment
• F-measure not included intentionally

Limits
• Candidate documents per citation
• 30
• no limit
• Bucket size
• 10
• 100
• 1000
• 10000
• no limit

Recall
hash precision recall intermediate data to assess
bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212
name-year (strict) 0.1% 90.2% 322,015,088 290,940,929
baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677

Precision
name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734
name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128
name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182
name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208
name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314
name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074
bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228

Recall/intermediate data
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080
name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777

Recall/to assess
name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734
name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995
name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042
name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261
name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997
name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980
name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883

Lost citations
Hash Lost fraction
name-year (approx., 1000, 30) 12.4%
name-year-num2 (approx., 1000, 30) 12.3%
name-year (strict, 1000, 30) 9.8%
name-year-pages (approx., pessimistic, 1000, 30) 9.0%
baseline (10000, 10) 6.7%
name-year-num (approx., 1000, 30) 6.0%
name-year (strict) 5.8%
name-year-num2 (strict., 1000, 30) 5.6%
name-year (approx.) 5.1%
name-year-num (strict., 1000, 30) 4.4%
name-year-num3 (approx., 1000, 30) 4.2%
baseline (1000, 30) 3.7%

Results
Hash sequence Recall Intermediate data To assess
bigrams (10000, 30) 98.17% 285,908,900 79,329,459
name-year-pages (strict, optimistic)
87.64% 187,394,452 41,152,278
name-year (strict, 1000, 30)
name-year (strict, 10000, 30)
bigrams (10000, 30)
name-year-pages (strict, pessimistic)
bigrams (100, 30)
bigrams (10000, 30)
96.86% 333,701,109 29,818,635
bigrams (100, 30)
bigrams (10000, 30)
97.76% 202,590,413 30,582,488
name-year-num3 (strict)
bigrams (10, 10)
bigrams (100, 30)
bigrams (10000, 30)
97.73% 398,895,930 25,123,164

Future work
• Other combinations
• After fine-grained assessment
• Various hash functions at the same time
• Further efficiency tuning
• Limit number of generated hashes

CoAnSys Project
• An open source framework for mining very large
collections of scientific publications
• Contains implementation of the presented
workflow
• http://coansys.ceon.pl/

Thank you! Questions?
Mateusz Fedoryszak
matfed@icm.edu.pl
http://coansys.ceon.pl/
http://adalab.icm.edu.pl/

Efficient blocking method for a large scale citation matching

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Efficient blocking method for a large scale citation matching

Similar to Efficient blocking method for a large scale citation matching (20)

Recently uploaded

Recently uploaded (20)

Efficient blocking method for a large scale citation matching