Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Efficient blocking method for a large scale citation matching
1. Efficient blocking method for
a large scale citation matching
Mateusz Fedoryszak & Łukasz Bolikowski
{matfed,bolo}@icm.edu.pl
Interdisciplinary Centre for Mathematical and
Computational Modelling
University of Warsaw
2. Citation matching
References
[1] I. Newton, Philosophiae naturalis...
[2] N. Copernicus, De revolutionibus...
ID Title Author
11 Στοιχεῖα Εὐκλείδης
14 De revolutionibus... Copernicus
• Note: it's an instance of data linkage problem
3. Why important?
• Clickable interfaces
• Bibliometrics
(think: H-index)
• Further analysis
(e.g. similarities)
Why difficult?
• Citation extraction errors
(in both digital-born and
retro-born docs)
• Countless citation styles
used inconsistently
• Typos and other human
errors
9. Workflow
citation document
hash citation ID hash document ID
hash document ID
citation ID
citation ID
hash document ID
document ID
hash citation ID
citation ID document ID
citation ID document ID
citation ID document ID
Reduce Map
10. Workflow with tuning
• Before:
• Compute bucket sizes
• Reject too big ones
• Use DistributedCache
disseminate
• After:
• For each citation
choose only the most
popular candidates
citation ID document ID Map Reduce
citation document
hash citation ID hash document ID
hash document ID
citation ID
citation ID
hash document ID
document ID
hash citation ID
citation ID document ID
citation ID document ID
16. Bigrams
• For document we use only authors and title fields
pawlak zdzislaw
zdzislaw 1982
1982 rough
rough sets
...
zdzislaw pawlak
rough sets
17. name-year
• For citation:
• name: any of first 4 distinct text tokens
• year: any number between 1900 and 2050
pawlak#1982
zdzislaw#1982
rough#1982
sets#1982
zdzislaw#1982
pawlak#1982
+approximate variant zdzislaw#1981
pawlak#1981
zdzislaw#1983
pawlak#1983
18. name-year-pages
• For citation:
• pages: any sorted pair of numbers, not year
pawlak#1982#5#11
pawlak#1982#5#341
pawlak#1982#...
pawlak#1982#341#356
zdzislaw#...
zdzislaw#1982#341#356
rough#...
sets#...
zdzislaw#1982#341#356
pawlak#1982#341#356
+approximate & optimistic variant
19. Intermezzo: citation parsing
Pawlak , Zdzisław ( 1982 ) .
author other author other year other other
...
...
Pawlak, Zdzisław (1982). "Rough sets".
Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
20. name-year-numn
• n = 1..3
• For citation:
• numn: any sorted tuple of numbers, not year
pawlak#1982#5#11#341
pawlak#1982#5#341#356
pawlak#1982#5#11#356#
pawlak#1982#11#341#356
zdzislaw#...
rough#...
sets#...
pawlak#1982#5#11#341
pawlak#1982#5#341#356
pawlak#1982#5#11#356#
pawlak#1982#11#341#356
zdzislaw#...
+approximate variant
25. Test dataset
2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global
cancer statistics. CA Cancer J Clin 61: 69–90
26. Test dataset
• Based on Open Access Subset of PMC
• Only citations preserving original formatting
• Only citations with PMID assigned
• 528k documents
• 3.6M citation out of which 321k resolvable
27. Metrics
• Recall — the percentage of true citation → document links
that are maintained by the heuristic
• Precision — the percentage of citation → document links
returned by algorithm that are correct
• Intermediate data — total number of hashes and pairs
generated (before selecting the most popular ones)
• Candidate pairs — number of pairs returned by heuristic for
further assessment
• F-measure not included intentionally
28. Limits
• Candidate documents per citation
• 30
• no limit
• Bucket size
• 10
• 100
• 1000
• 10000
• no limit
38. Future work
• Other combinations
• After fine-grained assessment
• Various hash functions at the same time
• Further efficiency tuning
• Limit number of generated hashes
39. CoAnSys Project
• An open source framework for mining very large
collections of scientific publications
• Contains implementation of the presented
workflow
• http://coansys.ceon.pl/