SlideShare a Scribd company logo
1 of 46
Effective Unsupervised
Author Disambiguation with
Relative Frequencies
Tobias Backes JCDL‘18
Overview
2
Introduction Problem
Setting
Method Agglomerative
Clustering
Probabilistic
Similarity
Feature-types
Parameters
and Variants
Experiments Dataset Evaluation Setup
Results Results Discussion
Conclusion Conclusion
Introduction: Problem Setting
3
John W. Smith, Thomas
Müller
Document I
J. Smith, Zack Chen
Document II
J. Smith
Document III
author mention
same name
same author?
Introduction: Problem Setting
4
The collection is partitioned by
different author names.
John W. Smith, Thomas
Müller
Part of the collection is
annotated with author
identifiers.
E-6522-2011
Introduction: Problem Setting
5
Partition names into blocks.
Smith, J. W., Müller, T.
Gold authors are also partitioned.
Hidden source of error!
Introduction: Problem Setting
6
collection
blocks
each block
system authors
single mentions
Method: Agglomerative Clustering
7
Take a block.Each mention one author cluster.Merge the most similar author clustersTerminate if
stopping critereon
reached.
Otherwise the
entire block
becomes a single
author
Which clusters
are the most
similar?
Method: Probabilistic Similarity
8
Probability of author mention 𝑥 ∈ 𝑋 given mention
𝑥‘:
𝑝 𝑥 𝑥′
=
𝑓
#(𝑓, 𝑥) ∙ #(𝑓, 𝑥′)
#(𝑓) ∙ #(𝑥′)
Probability of author cluster 𝐶 given cluster 𝐶‘:
𝑝 𝐶 𝐶′
=
𝑥,𝑥′ ∈𝐶×𝐶′
𝑝 𝑥 𝑥′
⋅
#(𝑥′)
#(𝐶′)
(Use add-𝜖 smoothing to prevent zero-probabilities)
Method: Feature-types
9
 Terms
 Affiliations
 Categories
 Keywords
 Co-author
names
 Emails
 Years
15
%
20
%
18
%
3%
20
%
10
%
𝑠𝑐𝑜𝑟𝑒 𝐶, 𝐶′
=
𝑓𝑡𝑦𝑝𝑒
𝜆 𝑓𝑡𝑦𝑝𝑒 ⋅ 𝑝 𝑓𝑡𝑦𝑝𝑒(𝐶|𝐶′)
traine
d
𝝀 𝒇𝒕𝒚𝒑𝒆
on
separate
training
data
Method: Parameters and Variants
10
1. Variants
 𝑝 𝐶 𝐶‘ =
 (𝑥,𝑥′) … (prob-variant)
 max
(𝑥,𝑥′)
… (max-variant)
2. Parameters
 𝜆: Feature-type weights
 𝛼 or 𝛽: Stopping parameters
Method: Parameters and Variants
11
Stopping limit:
𝑙 = 𝛼 + 𝑋 ⋅ 𝛽
where 𝑋 is block-size in number of mentions.
 𝛼 = 0 for prob-variant
 𝛽 = 0 for max-variant
There is room for improvement.
Experiments: Dataset
12
 Web of Science (until first half 2014)
 Over 150 million author mentions
 Blocking: Lastname, all initials, i.e. J. W.
Smith
 For each problem size ∈ [1. . 10]:
Draw up to 1000 blocks.
 Size 1-4: > 1000 blocks
 Size 5: 860 blocks
 Size 10: 141 blocks
 Sparsity for large block sizes
Experiments: Evaluation
13
• PairF1 and bCube
• Count the number of pairs
• Get precision, recall and F1
• True-pos.:
Should be and are together
• True-neg.:
Should not be together and aren‘t
• False-pos.:
Should not be together but are
• False-neg.:
Should be together but aren‘t
Experiments: Setup
14
collection
Blocks by
number of
annotated
authors
For each block size:
For each clustering iteration:
1. Current Precision
2. Current Recall
3. Current number of clusters
4. Converged or not?
Results
15
What is the best
we can get?
Results
16
• all ftypes
• trained weights
• prob-variant
Results
17
How does the max-variant
compare to that?
Results
18
• all ftypes
• trained weights
• prob-variant
Results
19
• all ftypes
• trained weights
• max-variant
Results
20
early gains
• size 1
• prob-variant
Results
21
no
early
gains
• size 1
• max-
variant
Results
22
How well works the
stopping criterion?
Results
23
peaks before
correct size
stops before
correct size
• size 5
• prob-variant
Results
24
peaks at
correct size
stops at
correct size
• size 5
• max-variant
Results
25
peaks before
correct size
stops after
correct size
• size 10
• prob-variant
Results
26
peaks at
correct size
stops after
correct size
• size 10
• max-variant
Results
27
How do uniform
weights compare?
Results
28
• all ftypes
• trained weights
• prob-variant
Results
29
• all ftypes
• uniform weights
• prob-variant
practically
the same
Results
30
How about using
only terms?
Results
31
• all ftypes
• uniform weights
• prob-variant
Results
32
• only terms
• uniform weights
• prob-variant
recall drops
considerably
Results
33
How important are
co- and ref-authors?
Results
34
• all ftypes
• uniform weights
• prob-variant
Results
35
• only co-authors
• uniform weights
• prob-variant
recall drops
considerably
Results
36
• all ftypes
• uniform weights
• prob-variant
Results
37
• co+ref-authors
• uniform weights
• prob-variant
only little loss
Results
38
• all ftypes
• uniform weights
• prob-variant
Results
39
• no co-authors
• uniform weights
• prob-variant
recall drops
considerabl
y
Results: Discussion
40
Not possible to compare results directly to other groups.
1. Data
 Web of Science, until half 2014
 Available features in this version
 Available researcherID annotation in this version
2. Blocking
 Surname, all initials
3. Evaluation
 bCube, pairF1 (including pairs (𝑥, 𝑥))
 Recall not measured across blocks
 Macro average
 Each problem size separately
Any variation could change results considerably
Results: Discussion
41
 Suggestion:
Evaluate each problem size separately!
 Say, we have to distinguish i.e. 8 authors,
then
this defines task difficulty
Conclusion
42
 Method:
 Agglomerative Clustering + Probabilistic similarity
 Almost parameter free
 Features:
 Co-authors very important
 Add any ftype you like
 Evaluation:
 Each problem size separately
 Visualize clustering process
a. balance Precision vs. Recall
b. tune stopping criterion
References
43
[1] Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew
McCallum.
2007. Author disambiguation using error-driven machine learning with a
ranking
loss function. In Sixth International Workshop on Information Integration
on the
Web (IIWeb-07), Vancouver, Canada.
[2] Anderson A Ferreira, Marcos André Gonçalves, and Alberto HF
Laender. 2012. A
brief survey of automatic methods for author name disambiguation. Acm
Sigmod
Record 41, 2 (2012), 15–26.
[3] Anderson A Ferreira, Adriano Veloso, Marcos André Gonçalves, and
Alberto HF
Laender. 2010. Effective self-training author name disambiguation in
scholarly
digital libraries. In Proceedings of the 10th annual joint conference on
Digital
libraries. ACM, 39–48.
[4] Thomas Gurney, Edwin Horlings, and Peter Van Den Besselaar. 2012.
Author
disambiguation using multi-aspect similarity indicators. Scientometrics
91, 2
(2012), 435–449.
[5] Hui Han, Wei Xu, Hongyuan Zha, and C Lee Giles. 2005. A hierarchical
naive
Bayes mixture model for name disambiguation in author citations. In
Proceedings
of the 2005 ACM symposium on Applied computing. ACM, 1065–1069.
[6] Anne-Wil Harzing. 2015. Health warning: might contain multiple
personalities
- the problem of homonyms in Thomson Reuters Essential Science
Indicators.
Scientometrics 105, 3 (2015), 2259–2270.
[7] T. Kramer, F. Momeni, and P. Mayr. 2017. Coverage of Author
Identifiers in Web
of Science and Scopus. ArXiv e-prints (March 2017).
arXiv:cs.DL/1703.01319
[8] Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky.
2012. Citation-based bootstrapping for large-scale author
disambiguation. Journal of the American Society for Information Science
and Technology 63, 5 (2012), 1030–1047.
[9] Staša Milojević. 2013. Accuracy of simple, initials-based methods for
author
name disambiguation. Journal of Informetrics 7, 4 (2013), 767–773.
[10] Alan Filipe Santana, Marcos André Gonçalves, Alberto HF Laender, and
Ander-
son A Ferreira. 2017. Incremental author name disambiguation by exploiting
domain-specific heuristics. Journal of the Association for Information
Science and
Technology 68, 4 (2017), 931–945.
[11] Neil R Smalheiser and Vetle I Torvik. 2009. Author name
disambiguation. Annual review of information science and technology 43, 1
(2009), 1–43.
[12] Yang Song, Jian Huang, Isaac G Councill, Jia Li, and C Lee Giles. 2007.
Effi-
cient topic-based unsupervised name disambiguation. In Proceedings of the
7th
ACM/IEEE-CS joint conference on Digital libraries. ACM, 342–351.
[13] Andreas Strotmann and Dangzhi Zhao. 2012. Author name
disambiguation: What difference does it make in author-based citation
analysis? Journal of the American Society for Information Science and
Technology 63, 9 (2012), 1820–1833.
[14] Jie Tang, Alvis CM Fong, Bo Wang, and Jing Zhang. 2012. A unified
probabilistic framework for name disambiguation in digital library. IEEE
Transactions on Knowledge and Data Engineering 24, 6 (2012), 975–987.
[15] Vetle I Torvik and Neil R Smalheiser. 2009. Author name disambiguation
in
MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3,
3 (2009), 11.
[16] Vetle I Torvik, Marc Weeber, Don R Swanson, and Neil R Smalheiser.
2005. A
probabilistic similarity metric for Medline records: A model for author name
disambiguation. Journal of the American Society for information science
and
technology 56, 2 (2005), 140–158.
Thank You!
This work was supported by the EU‘s Horizon
2020 programme under grant agreement
H2020-693092, the MOVING project:
www.moving-project.eu
44
Backup Slides
Results: Discussion
45
 Primitive baseline: 1 block = 1 author
REALITY
size-1
clusters
poor results
using
baseline
BASELINE
Conclusion
46
stopping
criterion
development of
Precision +
Recall

More Related Content

Similar to Effective Unsupervised Author Disambiguation with Relative Frequencies

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...Elena Simperl
 
Detecting Multiple Aliases in Social Media
Detecting Multiple Aliases in Social MediaDetecting Multiple Aliases in Social Media
Detecting Multiple Aliases in Social MediaAmendra Shrestha
 
Interactive Browsing and Navigation in Relational Databases
Interactive Browsing and Navigation in Relational DatabasesInteractive Browsing and Navigation in Relational Databases
Interactive Browsing and Navigation in Relational DatabasesMinsuk Kahng
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
Heuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection ProblemHeuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection ProblemIJMER
 
Utility of topic extraction on customer experience data
Utility of topic extraction on customer experience dataUtility of topic extraction on customer experience data
Utility of topic extraction on customer experience dataKiran Karkera
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
Stareast2008
Stareast2008Stareast2008
Stareast2008JaAe CK
 
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Florian Mai
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyArnab Bhadury
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classificationijtsrd
 
Data mining and warehousing qb
Data mining and warehousing   qbData mining and warehousing   qb
Data mining and warehousing qbaniprahal
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Daniel Roggen
 
Berlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleBerlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleCornelius Puschmann
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia StudyMaribel Acosta Deibe
 

Similar to Effective Unsupervised Author Disambiguation with Relative Frequencies (20)

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Detecting Multiple Aliases in Social Media
Detecting Multiple Aliases in Social MediaDetecting Multiple Aliases in Social Media
Detecting Multiple Aliases in Social Media
 
My experiment
My experimentMy experiment
My experiment
 
Interactive Browsing and Navigation in Relational Databases
Interactive Browsing and Navigation in Relational DatabasesInteractive Browsing and Navigation in Relational Databases
Interactive Browsing and Navigation in Relational Databases
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
Heuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection ProblemHeuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection Problem
 
Utility of topic extraction on customer experience data
Utility of topic extraction on customer experience dataUtility of topic extraction on customer experience data
Utility of topic extraction on customer experience data
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Stareast2008
Stareast2008Stareast2008
Stareast2008
 
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Compet...
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Data mining and warehousing qb
Data mining and warehousing   qbData mining and warehousing   qb
Data mining and warehousing qb
 
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
 
Berlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick VandewalleBerlin 6 Open Access Conference: Patrick Vandewalle
Berlin 6 Open Access Conference: Patrick Vandewalle
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 

More from MOVING Project

Opening up education through digitization. Remarks on recent developments in ...
Opening up education through digitization. Remarks on recent developments in ...Opening up education through digitization. Remarks on recent developments in ...
Opening up education through digitization. Remarks on recent developments in ...MOVING Project
 
MOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVETMOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVETMOVING Project
 
Learning analytics for reflective learning
Learning analytics for reflective learningLearning analytics for reflective learning
Learning analytics for reflective learningMOVING Project
 
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...Challenges in Developing Automatic Learning Guidance in Relation to an Inform...
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...MOVING Project
 
Unesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalUnesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalMOVING Project
 
Inferring knowledge acquisition through Web navigation behaviour
Inferring knowledge acquisition through Web navigation behaviourInferring knowledge acquisition through Web navigation behaviour
Inferring knowledge acquisition through Web navigation behaviourMOVING Project
 
ITI-CERTH participation in TRECVID 2018
ITI-CERTH participation in TRECVID 2018ITI-CERTH participation in TRECVID 2018
ITI-CERTH participation in TRECVID 2018MOVING Project
 
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...MOVING Project
 
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...MOVING Project
 
VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
VERGE: A Multimodal Interactive Search Engine for Video Browsing and RetrievalVERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
VERGE: A Multimodal Interactive Search Engine for Video Browsing and RetrievalMOVING Project
 
Temporal Lecture Video Fragmentation using Word Embeddings
Temporal Lecture Video Fragmentation using Word EmbeddingsTemporal Lecture Video Fragmentation using Word Embeddings
Temporal Lecture Video Fragmentation using Word EmbeddingsMOVING Project
 
The Impact of Blocking and Name-Matching on Author Disambiguation.
The Impact of Blocking and Name-Matching on Author Disambiguation.The Impact of Blocking and Name-Matching on Author Disambiguation.
The Impact of Blocking and Name-Matching on Author Disambiguation.MOVING Project
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project
 
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data Cloud
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data CloudQualitative Analysis of Vocabulary Evolution on the Linked Open Data Cloud
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data CloudMOVING Project
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudMOVING Project
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...MOVING Project
 
Generic to Specific Recognition Models for Membership Analysis in Group Videos
Generic to Specific Recognition Models for Membership Analysis in Group VideosGeneric to Specific Recognition Models for Membership Analysis in Group Videos
Generic to Specific Recognition Models for Membership Analysis in Group VideosMOVING Project
 
MOVING the Industry 4.0
MOVING the Industry 4.0MOVING the Industry 4.0
MOVING the Industry 4.0MOVING Project
 
MOVING presentation at JSI
MOVING presentation at JSIMOVING presentation at JSI
MOVING presentation at JSIMOVING Project
 

More from MOVING Project (20)

Opening up education through digitization. Remarks on recent developments in ...
Opening up education through digitization. Remarks on recent developments in ...Opening up education through digitization. Remarks on recent developments in ...
Opening up education through digitization. Remarks on recent developments in ...
 
MOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVETMOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVET
 
Learning analytics for reflective learning
Learning analytics for reflective learningLearning analytics for reflective learning
Learning analytics for reflective learning
 
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...Challenges in Developing Automatic Learning Guidance in Relation to an Inform...
Challenges in Developing Automatic Learning Guidance in Relation to an Inform...
 
Unesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalUnesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-final
 
Inferring knowledge acquisition through Web navigation behaviour
Inferring knowledge acquisition through Web navigation behaviourInferring knowledge acquisition through Web navigation behaviour
Inferring knowledge acquisition through Web navigation behaviour
 
ITI-CERTH participation in TRECVID 2018
ITI-CERTH participation in TRECVID 2018ITI-CERTH participation in TRECVID 2018
ITI-CERTH participation in TRECVID 2018
 
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...
 
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln: Der MOOC Science 2...
 
VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
VERGE: A Multimodal Interactive Search Engine for Video Browsing and RetrievalVERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
 
Temporal Lecture Video Fragmentation using Word Embeddings
Temporal Lecture Video Fragmentation using Word EmbeddingsTemporal Lecture Video Fragmentation using Word Embeddings
Temporal Lecture Video Fragmentation using Word Embeddings
 
The Impact of Blocking and Name-Matching on Author Disambiguation.
The Impact of Blocking and Name-Matching on Author Disambiguation.The Impact of Blocking and Name-Matching on Author Disambiguation.
The Impact of Blocking and Name-Matching on Author Disambiguation.
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
 
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data Cloud
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data CloudQualitative Analysis of Vocabulary Evolution on the Linked Open Data Cloud
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data Cloud
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...
 
Generic to Specific Recognition Models for Membership Analysis in Group Videos
Generic to Specific Recognition Models for Membership Analysis in Group VideosGeneric to Specific Recognition Models for Membership Analysis in Group Videos
Generic to Specific Recognition Models for Membership Analysis in Group Videos
 
MOVING the Industry 4.0
MOVING the Industry 4.0MOVING the Industry 4.0
MOVING the Industry 4.0
 
MOVING presentation at JSI
MOVING presentation at JSIMOVING presentation at JSI
MOVING presentation at JSI
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Effective Unsupervised Author Disambiguation with Relative Frequencies

  • 1. Effective Unsupervised Author Disambiguation with Relative Frequencies Tobias Backes JCDL‘18
  • 2. Overview 2 Introduction Problem Setting Method Agglomerative Clustering Probabilistic Similarity Feature-types Parameters and Variants Experiments Dataset Evaluation Setup Results Results Discussion Conclusion Conclusion
  • 3. Introduction: Problem Setting 3 John W. Smith, Thomas Müller Document I J. Smith, Zack Chen Document II J. Smith Document III author mention same name same author?
  • 4. Introduction: Problem Setting 4 The collection is partitioned by different author names. John W. Smith, Thomas Müller Part of the collection is annotated with author identifiers. E-6522-2011
  • 5. Introduction: Problem Setting 5 Partition names into blocks. Smith, J. W., Müller, T. Gold authors are also partitioned. Hidden source of error!
  • 6. Introduction: Problem Setting 6 collection blocks each block system authors single mentions
  • 7. Method: Agglomerative Clustering 7 Take a block.Each mention one author cluster.Merge the most similar author clustersTerminate if stopping critereon reached. Otherwise the entire block becomes a single author Which clusters are the most similar?
  • 8. Method: Probabilistic Similarity 8 Probability of author mention 𝑥 ∈ 𝑋 given mention 𝑥‘: 𝑝 𝑥 𝑥′ = 𝑓 #(𝑓, 𝑥) ∙ #(𝑓, 𝑥′) #(𝑓) ∙ #(𝑥′) Probability of author cluster 𝐶 given cluster 𝐶‘: 𝑝 𝐶 𝐶′ = 𝑥,𝑥′ ∈𝐶×𝐶′ 𝑝 𝑥 𝑥′ ⋅ #(𝑥′) #(𝐶′) (Use add-𝜖 smoothing to prevent zero-probabilities)
  • 9. Method: Feature-types 9  Terms  Affiliations  Categories  Keywords  Co-author names  Emails  Years 15 % 20 % 18 % 3% 20 % 10 % 𝑠𝑐𝑜𝑟𝑒 𝐶, 𝐶′ = 𝑓𝑡𝑦𝑝𝑒 𝜆 𝑓𝑡𝑦𝑝𝑒 ⋅ 𝑝 𝑓𝑡𝑦𝑝𝑒(𝐶|𝐶′) traine d 𝝀 𝒇𝒕𝒚𝒑𝒆 on separate training data
  • 10. Method: Parameters and Variants 10 1. Variants  𝑝 𝐶 𝐶‘ =  (𝑥,𝑥′) … (prob-variant)  max (𝑥,𝑥′) … (max-variant) 2. Parameters  𝜆: Feature-type weights  𝛼 or 𝛽: Stopping parameters
  • 11. Method: Parameters and Variants 11 Stopping limit: 𝑙 = 𝛼 + 𝑋 ⋅ 𝛽 where 𝑋 is block-size in number of mentions.  𝛼 = 0 for prob-variant  𝛽 = 0 for max-variant There is room for improvement.
  • 12. Experiments: Dataset 12  Web of Science (until first half 2014)  Over 150 million author mentions  Blocking: Lastname, all initials, i.e. J. W. Smith  For each problem size ∈ [1. . 10]: Draw up to 1000 blocks.  Size 1-4: > 1000 blocks  Size 5: 860 blocks  Size 10: 141 blocks  Sparsity for large block sizes
  • 13. Experiments: Evaluation 13 • PairF1 and bCube • Count the number of pairs • Get precision, recall and F1 • True-pos.: Should be and are together • True-neg.: Should not be together and aren‘t • False-pos.: Should not be together but are • False-neg.: Should be together but aren‘t
  • 14. Experiments: Setup 14 collection Blocks by number of annotated authors For each block size: For each clustering iteration: 1. Current Precision 2. Current Recall 3. Current number of clusters 4. Converged or not?
  • 15. Results 15 What is the best we can get?
  • 16. Results 16 • all ftypes • trained weights • prob-variant
  • 17. Results 17 How does the max-variant compare to that?
  • 18. Results 18 • all ftypes • trained weights • prob-variant
  • 19. Results 19 • all ftypes • trained weights • max-variant
  • 20. Results 20 early gains • size 1 • prob-variant
  • 22. Results 22 How well works the stopping criterion?
  • 23. Results 23 peaks before correct size stops before correct size • size 5 • prob-variant
  • 24. Results 24 peaks at correct size stops at correct size • size 5 • max-variant
  • 25. Results 25 peaks before correct size stops after correct size • size 10 • prob-variant
  • 26. Results 26 peaks at correct size stops after correct size • size 10 • max-variant
  • 28. Results 28 • all ftypes • trained weights • prob-variant
  • 29. Results 29 • all ftypes • uniform weights • prob-variant practically the same
  • 31. Results 31 • all ftypes • uniform weights • prob-variant
  • 32. Results 32 • only terms • uniform weights • prob-variant recall drops considerably
  • 34. Results 34 • all ftypes • uniform weights • prob-variant
  • 35. Results 35 • only co-authors • uniform weights • prob-variant recall drops considerably
  • 36. Results 36 • all ftypes • uniform weights • prob-variant
  • 37. Results 37 • co+ref-authors • uniform weights • prob-variant only little loss
  • 38. Results 38 • all ftypes • uniform weights • prob-variant
  • 39. Results 39 • no co-authors • uniform weights • prob-variant recall drops considerabl y
  • 40. Results: Discussion 40 Not possible to compare results directly to other groups. 1. Data  Web of Science, until half 2014  Available features in this version  Available researcherID annotation in this version 2. Blocking  Surname, all initials 3. Evaluation  bCube, pairF1 (including pairs (𝑥, 𝑥))  Recall not measured across blocks  Macro average  Each problem size separately Any variation could change results considerably
  • 41. Results: Discussion 41  Suggestion: Evaluate each problem size separately!  Say, we have to distinguish i.e. 8 authors, then this defines task difficulty
  • 42. Conclusion 42  Method:  Agglomerative Clustering + Probabilistic similarity  Almost parameter free  Features:  Co-authors very important  Add any ftype you like  Evaluation:  Each problem size separately  Visualize clustering process a. balance Precision vs. Recall b. tune stopping criterion
  • 43. References 43 [1] Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew McCallum. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. In Sixth International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada. [2] Anderson A Ferreira, Marcos André Gonçalves, and Alberto HF Laender. 2012. A brief survey of automatic methods for author name disambiguation. Acm Sigmod Record 41, 2 (2012), 15–26. [3] Anderson A Ferreira, Adriano Veloso, Marcos André Gonçalves, and Alberto HF Laender. 2010. Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries. ACM, 39–48. [4] Thomas Gurney, Edwin Horlings, and Peter Van Den Besselaar. 2012. Author disambiguation using multi-aspect similarity indicators. Scientometrics 91, 2 (2012), 435–449. [5] Hui Han, Wei Xu, Hongyuan Zha, and C Lee Giles. 2005. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium on Applied computing. ACM, 1065–1069. [6] Anne-Wil Harzing. 2015. Health warning: might contain multiple personalities - the problem of homonyms in Thomson Reuters Essential Science Indicators. Scientometrics 105, 3 (2015), 2259–2270. [7] T. Kramer, F. Momeni, and P. Mayr. 2017. Coverage of Author Identifiers in Web of Science and Scopus. ArXiv e-prints (March 2017). arXiv:cs.DL/1703.01319 [8] Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63, 5 (2012), 1030–1047. [9] Staša Milojević. 2013. Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics 7, 4 (2013), 767–773. [10] Alan Filipe Santana, Marcos André Gonçalves, Alberto HF Laender, and Ander- son A Ferreira. 2017. Incremental author name disambiguation by exploiting domain-specific heuristics. Journal of the Association for Information Science and Technology 68, 4 (2017), 931–945. [11] Neil R Smalheiser and Vetle I Torvik. 2009. Author name disambiguation. Annual review of information science and technology 43, 1 (2009), 1–43. [12] Yang Song, Jian Huang, Isaac G Councill, Jia Li, and C Lee Giles. 2007. Effi- cient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries. ACM, 342–351. [13] Andreas Strotmann and Dangzhi Zhao. 2012. Author name disambiguation: What difference does it make in author-based citation analysis? Journal of the American Society for Information Science and Technology 63, 9 (2012), 1820–1833. [14] Jie Tang, Alvis CM Fong, Bo Wang, and Jing Zhang. 2012. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24, 6 (2012), 975–987. [15] Vetle I Torvik and Neil R Smalheiser. 2009. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 3 (2009), 11. [16] Vetle I Torvik, Marc Weeber, Don R Swanson, and Neil R Smalheiser. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for information science and technology 56, 2 (2005), 140–158. Thank You! This work was supported by the EU‘s Horizon 2020 programme under grant agreement H2020-693092, the MOVING project: www.moving-project.eu
  • 45. Results: Discussion 45  Primitive baseline: 1 block = 1 author REALITY size-1 clusters poor results using baseline BASELINE