SlideShare a Scribd company logo
1 of 40
Download to read offline
Efficient blocking method for 
a large scale citation matching 
Mateusz Fedoryszak & Łukasz Bolikowski 
{matfed,bolo}@icm.edu.pl 
Interdisciplinary Centre for Mathematical and 
Computational Modelling 
University of Warsaw
Citation matching 
References 
[1] I. Newton, Philosophiae naturalis... 
[2] N. Copernicus, De revolutionibus... 
ID Title Author 
11 Στοιχεῖα Εὐκλείδης 
14 De revolutionibus... Copernicus 
• Note: it's an instance of data linkage problem
Why important? 
• Clickable interfaces 
• Bibliometrics 
(think: H-index) 
• Further analysis 
(e.g. similarities) 
Why difficult? 
• Citation extraction errors 
(in both digital-born and 
retro-born docs) 
• Countless citation styles 
used inconsistently 
• Typos and other human 
errors
The Problem 
References 
ID Title Author
Naïve approach 
References 
ID Title Author 
For 1.3M documents and 12M citations it's 
15.6 × 1012 comparisons
Select the best candidates 
References 
ID Title Author 
• I'll present a method of candidate selection and how to 
implement it using Apache Hadoop
Blocking 
References 
ID Title Author
Fingerprints 
References 
CCCC 
ID Title Author AAAA 
BBBB CCCC 
AAAA 
EEEE 
AAAA FFFF
Workflow 
citation document 
hash citation ID hash document ID 
hash document ID 
citation ID 
citation ID 
hash document ID 
document ID 
hash citation ID 
citation ID document ID 
citation ID document ID 
citation ID document ID 
Reduce Map
Workflow with tuning 
• Before: 
• Compute bucket sizes 
• Reject too big ones 
• Use DistributedCache 
disseminate 
• After: 
• For each citation 
choose only the most 
popular candidates 
citation ID document ID Map Reduce 
citation document 
hash citation ID hash document ID 
hash document ID 
citation ID 
citation ID 
hash document ID 
document ID 
hash citation ID 
citation ID document ID 
citation ID document ID
Hash functions
Normalisation 
• Lowercase 
• Remove 
• diacritics 
• punctuation marks 
• Filter out tokens shorter than 3 
characters 
(except numbers)
Normalisation 
Pawlak, Zdzisław (1982). "Rough sets". 
Internat. J. Comput. Inform. Sci. 11 (5): 341–356. 
pawlak zdzislaw 1982 rough sets 
internat comput inform sci 11 5 341 356
Examples 
Pawlak, Zdzisław (1982). 
"Rough sets". 
Internat. J. Comput. Inform. Sci. 
11 (5): 341–356. 
{ 
author: "Zdzisław Pawlak", 
year: "1982", 
title: "Rough sets", 
journal: "International Journal 
of Computer & Information 
Sciences", 
volume: "11", 
issue: "5", 
pages: "341–356" 
}
Baseline 
pawlak 
zdzislaw 
1982 
rough 
... 
internat 
... 
zdzislaw 
pawlak 
1982 
rough 
... 
international 
journal 
...
Bigrams 
• For document we use only authors and title fields 
pawlak zdzislaw 
zdzislaw 1982 
1982 rough 
rough sets 
... 
zdzislaw pawlak 
rough sets
name-year 
• For citation: 
• name: any of first 4 distinct text tokens 
• year: any number between 1900 and 2050 
pawlak#1982 
zdzislaw#1982 
rough#1982 
sets#1982 
zdzislaw#1982 
pawlak#1982 
+approximate variant zdzislaw#1981 
pawlak#1981 
zdzislaw#1983 
pawlak#1983
name-year-pages 
• For citation: 
• pages: any sorted pair of numbers, not year 
pawlak#1982#5#11 
pawlak#1982#5#341 
pawlak#1982#... 
pawlak#1982#341#356 
zdzislaw#... 
zdzislaw#1982#341#356 
rough#... 
sets#... 
zdzislaw#1982#341#356 
pawlak#1982#341#356 
+approximate & optimistic variant
Intermezzo: citation parsing 
Pawlak , Zdzisław ( 1982 ) . 
author other author other year other other 
... 
... 
Pawlak, Zdzisław (1982). "Rough sets". 
Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
name-year-numn 
• n = 1..3 
• For citation: 
• numn: any sorted tuple of numbers, not year 
pawlak#1982#5#11#341 
pawlak#1982#5#341#356 
pawlak#1982#5#11#356# 
pawlak#1982#11#341#356 
zdzislaw#... 
rough#... 
sets#... 
pawlak#1982#5#11#341 
pawlak#1982#5#341#356 
pawlak#1982#5#11#356# 
pawlak#1982#11#341#356 
zdzislaw#... 
+approximate variant
Evaluation
Test dataset 
<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation 
publication-type="journal"> 
<name><surname>Jemal</surname><given-names>A</given-names></ 
name>, 
<name><surname>Bray</surname><given-names>F</given-names></ 
name>, 
<name><surname>Center</surname><given-names>MM</given-names></ 
name>, 
<name><surname>Ferlay</surname><given-names>J</given-names></ 
name>, 
<name><surname>Ward</surname><given-names>E</given-names></ 
name>, 
<etal>et al</etal> 
(<year>2011</year>) 
<article-title>Global cancer statistics</article-title>. 
<source>CA Cancer J Clin</source> 
<volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> 
<pub-id pub-id-type="pmid">21296855</pub-id> 
</mixed-citation></ref>
Test dataset 
<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation 
publication-type="journal"> 
<name><surname>Jemal</surname><given-names>A</given-names></ 
name>, 
<name><surname>Bray</surname><given-names>F</given-names></ 
name>, 
<name><surname>Center</surname><given-names>MM</given-names></ 
name>, 
<name><surname>Ferlay</surname><given-names>J</given-names></ 
name>, 
<name><surname>Ward</surname><given-names>E</given-names></ 
name>, 
<etal>et al</etal> 
(<year>2011</year>) 
<article-title>Global cancer statistics</article-title>. 
<source>CA Cancer J Clin</source> 
<volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> 
<pub-id pub-id-type="pmid">21296855</pub-id> 
</mixed-citation></ref>
Test dataset 
<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation 
publication-type="journal"> 
<name><surname>Jemal</surname><given-names>A</given-names></ 
name>, 
<name><surname>Bray</surname><given-names>F</given-names></ 
name>, 
<name><surname>Center</surname><given-names>MM</given-names></ 
name>, 
<name><surname>Ferlay</surname><given-names>J</given-names></ 
name>, 
<name><surname>Ward</surname><given-names>E</given-names></ 
name>, 
<etal>et al</etal> 
(<year>2011</year>) 
<article-title>Global cancer statistics</article-title>. 
<source>CA Cancer J Clin</source> 
<volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> 
<pub-id pub-id-type="pmid">21296855</pub-id> 
</mixed-citation></ref>
Test dataset 
2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global 
cancer statistics. CA Cancer J Clin 61: 69–90
Test dataset 
• Based on Open Access Subset of PMC 
• Only citations preserving original formatting 
• Only citations with PMID assigned 
• 528k documents 
• 3.6M citation out of which 321k resolvable
Metrics 
• Recall — the percentage of true citation → document links 
that are maintained by the heuristic 
• Precision — the percentage of citation → document links 
returned by algorithm that are correct 
• Intermediate data — total number of hashes and pairs 
generated (before selecting the most popular ones) 
• Candidate pairs — number of pairs returned by heuristic for 
further assessment 
• F-measure not included intentionally
Limits 
• Candidate documents per citation 
• 30 
• no limit 
• Bucket size 
• 10 
• 100 
• 1000 
• 10000 
• no limit
Recall 
hash precision recall intermediate data to assess 
bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 
name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 
name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 
baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
Precision 
hash precision recall intermediate data to assess 
name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734 
name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128 
name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182 
name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208 
name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314 
name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074 
bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228
Recall/intermediate data 
hash precision recall intermediate data to assess 
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 
name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677 
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560 
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181 
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777
Recall vs. intermediate data
Recall/to assess 
hash precision recall intermediate data to assess 
name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 
name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 
name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 
name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261 
name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284 
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181 
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 
name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980 
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 
name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337 
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560 
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
Recall vs. to assess
Combination
Lost citations 
Hash Lost fraction 
name-year (approx., 1000, 30) 12.4% 
name-year-num2 (approx., 1000, 30) 12.3% 
name-year (strict, 1000, 30) 9.8% 
name-year-pages (approx., pessimistic, 1000, 30) 9.0% 
baseline (10000, 10) 6.7% 
name-year-num (approx., 1000, 30) 6.0% 
name-year (strict) 5.8% 
name-year-num2 (strict., 1000, 30) 5.6% 
name-year (approx.) 5.1% 
name-year-num (strict., 1000, 30) 4.4% 
name-year-num3 (approx., 1000, 30) 4.2% 
baseline (1000, 30) 3.7%
Results 
Hash sequence Recall Intermediate data To assess 
bigrams (10000, 30) 98.17% 285,908,900 79,329,459 
name-year-pages (strict, optimistic) 
87.64% 187,394,452 41,152,278 
name-year (strict, 1000, 30) 
name-year (strict, 10000, 30) 
bigrams (10000, 30) 
name-year-pages (strict, optimistic) 
name-year-pages (strict, pessimistic) 
bigrams (100, 30) 
bigrams (10000, 30) 
96.86% 333,701,109 29,818,635 
name-year-pages (strict, optimistic) 
bigrams (100, 30) 
bigrams (10000, 30) 
97.76% 202,590,413 30,582,488 
name-year-pages (strict, optimistic) 
name-year-num3 (strict) 
bigrams (10, 10) 
bigrams (100, 30) 
bigrams (10000, 30) 
97.73% 398,895,930 25,123,164
Future work 
• Other combinations 
• After fine-grained assessment 
• Various hash functions at the same time 
• Further efficiency tuning 
• Limit number of generated hashes
CoAnSys Project 
• An open source framework for mining very large 
collections of scientific publications 
• Contains implementation of the presented 
workflow 
• http://coansys.ceon.pl/
Thank you! Questions? 
Mateusz Fedoryszak 
matfed@icm.edu.pl 
http://coansys.ceon.pl/ 
http://adalab.icm.edu.pl/

More Related Content

Viewers also liked

Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Cited Reference Searching
Cited Reference SearchingCited Reference Searching
Cited Reference SearchingSCULibrarian
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
The Research Paper and Citation Methodology
The Research Paper and Citation MethodologyThe Research Paper and Citation Methodology
The Research Paper and Citation MethodologyOttawa University
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Christian Gügi
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

Viewers also liked (9)

Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Emerging sources citation index (esci)
Emerging sources citation index (esci)Emerging sources citation index (esci)
Emerging sources citation index (esci)
 
Cited Reference Searching
Cited Reference SearchingCited Reference Searching
Cited Reference Searching
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
The Research Paper and Citation Methodology
The Research Paper and Citation MethodologyThe Research Paper and Citation Methodology
The Research Paper and Citation Methodology
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
 
Citation and referencing in research work
Citation and referencing in research workCitation and referencing in research work
Citation and referencing in research work
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Similar to Efficient blocking method for a large scale citation matching

SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 
managing big data
managing big datamanaging big data
managing big dataSuveeksha
 
Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Labs
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDBAmazon Web Services
 
Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...National Institute of Informatics
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesGESIS
 
Slides
SlidesSlides
Slidesbutest
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradusrandyguck
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practiceJano Suchal
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNidhin Pattaniyil
 

Similar to Efficient blocking method for a large scale citation matching (20)

Matching Dirty Data
Matching Dirty DataMatching Dirty Data
Matching Dirty Data
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
managing big data
managing big datamanaging big data
managing big data
 
Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph Distribution
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 
Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital Libraries
 
Slides
SlidesSlides
Slides
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practice
 
Bit Vectors Siddhesh
Bit Vectors SiddheshBit Vectors Siddhesh
Bit Vectors Siddhesh
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_poster
 

Recently uploaded

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Recently uploaded (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

Efficient blocking method for a large scale citation matching

  • 1. Efficient blocking method for a large scale citation matching Mateusz Fedoryszak & Łukasz Bolikowski {matfed,bolo}@icm.edu.pl Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw
  • 2. Citation matching References [1] I. Newton, Philosophiae naturalis... [2] N. Copernicus, De revolutionibus... ID Title Author 11 Στοιχεῖα Εὐκλείδης 14 De revolutionibus... Copernicus • Note: it's an instance of data linkage problem
  • 3. Why important? • Clickable interfaces • Bibliometrics (think: H-index) • Further analysis (e.g. similarities) Why difficult? • Citation extraction errors (in both digital-born and retro-born docs) • Countless citation styles used inconsistently • Typos and other human errors
  • 4. The Problem References ID Title Author
  • 5. Naïve approach References ID Title Author For 1.3M documents and 12M citations it's 15.6 × 1012 comparisons
  • 6. Select the best candidates References ID Title Author • I'll present a method of candidate selection and how to implement it using Apache Hadoop
  • 7. Blocking References ID Title Author
  • 8. Fingerprints References CCCC ID Title Author AAAA BBBB CCCC AAAA EEEE AAAA FFFF
  • 9. Workflow citation document hash citation ID hash document ID hash document ID citation ID citation ID hash document ID document ID hash citation ID citation ID document ID citation ID document ID citation ID document ID Reduce Map
  • 10. Workflow with tuning • Before: • Compute bucket sizes • Reject too big ones • Use DistributedCache disseminate • After: • For each citation choose only the most popular candidates citation ID document ID Map Reduce citation document hash citation ID hash document ID hash document ID citation ID citation ID hash document ID document ID hash citation ID citation ID document ID citation ID document ID
  • 12. Normalisation • Lowercase • Remove • diacritics • punctuation marks • Filter out tokens shorter than 3 characters (except numbers)
  • 13. Normalisation Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356. pawlak zdzislaw 1982 rough sets internat comput inform sci 11 5 341 356
  • 14. Examples Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356. { author: "Zdzisław Pawlak", year: "1982", title: "Rough sets", journal: "International Journal of Computer & Information Sciences", volume: "11", issue: "5", pages: "341–356" }
  • 15. Baseline pawlak zdzislaw 1982 rough ... internat ... zdzislaw pawlak 1982 rough ... international journal ...
  • 16. Bigrams • For document we use only authors and title fields pawlak zdzislaw zdzislaw 1982 1982 rough rough sets ... zdzislaw pawlak rough sets
  • 17. name-year • For citation: • name: any of first 4 distinct text tokens • year: any number between 1900 and 2050 pawlak#1982 zdzislaw#1982 rough#1982 sets#1982 zdzislaw#1982 pawlak#1982 +approximate variant zdzislaw#1981 pawlak#1981 zdzislaw#1983 pawlak#1983
  • 18. name-year-pages • For citation: • pages: any sorted pair of numbers, not year pawlak#1982#5#11 pawlak#1982#5#341 pawlak#1982#... pawlak#1982#341#356 zdzislaw#... zdzislaw#1982#341#356 rough#... sets#... zdzislaw#1982#341#356 pawlak#1982#341#356 +approximate & optimistic variant
  • 19. Intermezzo: citation parsing Pawlak , Zdzisław ( 1982 ) . author other author other year other other ... ... Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
  • 20. name-year-numn • n = 1..3 • For citation: • numn: any sorted tuple of numbers, not year pawlak#1982#5#11#341 pawlak#1982#5#341#356 pawlak#1982#5#11#356# pawlak#1982#11#341#356 zdzislaw#... rough#... sets#... pawlak#1982#5#11#341 pawlak#1982#5#341#356 pawlak#1982#5#11#356# pawlak#1982#11#341#356 zdzislaw#... +approximate variant
  • 22. Test dataset <ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"> <name><surname>Jemal</surname><given-names>A</given-names></ name>, <name><surname>Bray</surname><given-names>F</given-names></ name>, <name><surname>Center</surname><given-names>MM</given-names></ name>, <name><surname>Ferlay</surname><given-names>J</given-names></ name>, <name><surname>Ward</surname><given-names>E</given-names></ name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source> <volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> <pub-id pub-id-type="pmid">21296855</pub-id> </mixed-citation></ref>
  • 23. Test dataset <ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"> <name><surname>Jemal</surname><given-names>A</given-names></ name>, <name><surname>Bray</surname><given-names>F</given-names></ name>, <name><surname>Center</surname><given-names>MM</given-names></ name>, <name><surname>Ferlay</surname><given-names>J</given-names></ name>, <name><surname>Ward</surname><given-names>E</given-names></ name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source> <volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> <pub-id pub-id-type="pmid">21296855</pub-id> </mixed-citation></ref>
  • 24. Test dataset <ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"> <name><surname>Jemal</surname><given-names>A</given-names></ name>, <name><surname>Bray</surname><given-names>F</given-names></ name>, <name><surname>Center</surname><given-names>MM</given-names></ name>, <name><surname>Ferlay</surname><given-names>J</given-names></ name>, <name><surname>Ward</surname><given-names>E</given-names></ name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source> <volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage> <pub-id pub-id-type="pmid">21296855</pub-id> </mixed-citation></ref>
  • 25. Test dataset 2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global cancer statistics. CA Cancer J Clin 61: 69–90
  • 26. Test dataset • Based on Open Access Subset of PMC • Only citations preserving original formatting • Only citations with PMID assigned • 528k documents • 3.6M citation out of which 321k resolvable
  • 27. Metrics • Recall — the percentage of true citation → document links that are maintained by the heuristic • Precision — the percentage of citation → document links returned by algorithm that are correct • Intermediate data — total number of hashes and pairs generated (before selecting the most popular ones) • Candidate pairs — number of pairs returned by heuristic for further assessment • F-measure not included intentionally
  • 28. Limits • Candidate documents per citation • 30 • no limit • Bucket size • 10 • 100 • 1000 • 10000 • no limit
  • 29. Recall hash precision recall intermediate data to assess bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
  • 30. Precision hash precision recall intermediate data to assess name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128 name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182 name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208 name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314 name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074 bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228
  • 31. Recall/intermediate data hash precision recall intermediate data to assess name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560 name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777
  • 33. Recall/to assess hash precision recall intermediate data to assess name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261 name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284 name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181 bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337 baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
  • 34. Recall vs. to assess
  • 36. Lost citations Hash Lost fraction name-year (approx., 1000, 30) 12.4% name-year-num2 (approx., 1000, 30) 12.3% name-year (strict, 1000, 30) 9.8% name-year-pages (approx., pessimistic, 1000, 30) 9.0% baseline (10000, 10) 6.7% name-year-num (approx., 1000, 30) 6.0% name-year (strict) 5.8% name-year-num2 (strict., 1000, 30) 5.6% name-year (approx.) 5.1% name-year-num (strict., 1000, 30) 4.4% name-year-num3 (approx., 1000, 30) 4.2% baseline (1000, 30) 3.7%
  • 37. Results Hash sequence Recall Intermediate data To assess bigrams (10000, 30) 98.17% 285,908,900 79,329,459 name-year-pages (strict, optimistic) 87.64% 187,394,452 41,152,278 name-year (strict, 1000, 30) name-year (strict, 10000, 30) bigrams (10000, 30) name-year-pages (strict, optimistic) name-year-pages (strict, pessimistic) bigrams (100, 30) bigrams (10000, 30) 96.86% 333,701,109 29,818,635 name-year-pages (strict, optimistic) bigrams (100, 30) bigrams (10000, 30) 97.76% 202,590,413 30,582,488 name-year-pages (strict, optimistic) name-year-num3 (strict) bigrams (10, 10) bigrams (100, 30) bigrams (10000, 30) 97.73% 398,895,930 25,123,164
  • 38. Future work • Other combinations • After fine-grained assessment • Various hash functions at the same time • Further efficiency tuning • Limit number of generated hashes
  • 39. CoAnSys Project • An open source framework for mining very large collections of scientific publications • Contains implementation of the presented workflow • http://coansys.ceon.pl/
  • 40. Thank you! Questions? Mateusz Fedoryszak matfed@icm.edu.pl http://coansys.ceon.pl/ http://adalab.icm.edu.pl/