3. Introduction: Problem Setting
3
John W. Smith, Thomas
Müller
Document I
J. Smith, Zack Chen
Document II
J. Smith
Document III
author mention
same name
same author?
4. Introduction: Problem Setting
4
The collection is partitioned by
different author names.
John W. Smith, Thomas
Müller
Part of the collection is
annotated with author
identifiers.
E-6522-2011
7. Method: Agglomerative Clustering
7
Take a block.Each mention one author cluster.Merge the most similar author clustersTerminate if
stopping critereon
reached.
Otherwise the
entire block
becomes a single
author
Which clusters
are the most
similar?
8. Method: Probabilistic Similarity
8
Probability of author mention 𝑥 ∈ 𝑋 given mention
𝑥‘:
𝑝 𝑥 𝑥′
=
𝑓
#(𝑓, 𝑥) ∙ #(𝑓, 𝑥′)
#(𝑓) ∙ #(𝑥′)
Probability of author cluster 𝐶 given cluster 𝐶‘:
𝑝 𝐶 𝐶′
=
𝑥,𝑥′ ∈𝐶×𝐶′
𝑝 𝑥 𝑥′
⋅
#(𝑥′)
#(𝐶′)
(Use add-𝜖 smoothing to prevent zero-probabilities)
9. Method: Feature-types
9
Terms
Affiliations
Categories
Keywords
Co-author
names
Emails
Years
15
%
20
%
18
%
3%
20
%
10
%
𝑠𝑐𝑜𝑟𝑒 𝐶, 𝐶′
=
𝑓𝑡𝑦𝑝𝑒
𝜆 𝑓𝑡𝑦𝑝𝑒 ⋅ 𝑝 𝑓𝑡𝑦𝑝𝑒(𝐶|𝐶′)
traine
d
𝝀 𝒇𝒕𝒚𝒑𝒆
on
separate
training
data
11. Method: Parameters and Variants
11
Stopping limit:
𝑙 = 𝛼 + 𝑋 ⋅ 𝛽
where 𝑋 is block-size in number of mentions.
𝛼 = 0 for prob-variant
𝛽 = 0 for max-variant
There is room for improvement.
12. Experiments: Dataset
12
Web of Science (until first half 2014)
Over 150 million author mentions
Blocking: Lastname, all initials, i.e. J. W.
Smith
For each problem size ∈ [1. . 10]:
Draw up to 1000 blocks.
Size 1-4: > 1000 blocks
Size 5: 860 blocks
Size 10: 141 blocks
Sparsity for large block sizes
13. Experiments: Evaluation
13
• PairF1 and bCube
• Count the number of pairs
• Get precision, recall and F1
• True-pos.:
Should be and are together
• True-neg.:
Should not be together and aren‘t
• False-pos.:
Should not be together but are
• False-neg.:
Should be together but aren‘t
14. Experiments: Setup
14
collection
Blocks by
number of
annotated
authors
For each block size:
For each clustering iteration:
1. Current Precision
2. Current Recall
3. Current number of clusters
4. Converged or not?
40. Results: Discussion
40
Not possible to compare results directly to other groups.
1. Data
Web of Science, until half 2014
Available features in this version
Available researcherID annotation in this version
2. Blocking
Surname, all initials
3. Evaluation
bCube, pairF1 (including pairs (𝑥, 𝑥))
Recall not measured across blocks
Macro average
Each problem size separately
Any variation could change results considerably
42. Conclusion
42
Method:
Agglomerative Clustering + Probabilistic similarity
Almost parameter free
Features:
Co-authors very important
Add any ftype you like
Evaluation:
Each problem size separately
Visualize clustering process
a. balance Precision vs. Recall
b. tune stopping criterion
43. References
43
[1] Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew
McCallum.
2007. Author disambiguation using error-driven machine learning with a
ranking
loss function. In Sixth International Workshop on Information Integration
on the
Web (IIWeb-07), Vancouver, Canada.
[2] Anderson A Ferreira, Marcos André Gonçalves, and Alberto HF
Laender. 2012. A
brief survey of automatic methods for author name disambiguation. Acm
Sigmod
Record 41, 2 (2012), 15–26.
[3] Anderson A Ferreira, Adriano Veloso, Marcos André Gonçalves, and
Alberto HF
Laender. 2010. Effective self-training author name disambiguation in
scholarly
digital libraries. In Proceedings of the 10th annual joint conference on
Digital
libraries. ACM, 39–48.
[4] Thomas Gurney, Edwin Horlings, and Peter Van Den Besselaar. 2012.
Author
disambiguation using multi-aspect similarity indicators. Scientometrics
91, 2
(2012), 435–449.
[5] Hui Han, Wei Xu, Hongyuan Zha, and C Lee Giles. 2005. A hierarchical
naive
Bayes mixture model for name disambiguation in author citations. In
Proceedings
of the 2005 ACM symposium on Applied computing. ACM, 1065–1069.
[6] Anne-Wil Harzing. 2015. Health warning: might contain multiple
personalities
- the problem of homonyms in Thomson Reuters Essential Science
Indicators.
Scientometrics 105, 3 (2015), 2259–2270.
[7] T. Kramer, F. Momeni, and P. Mayr. 2017. Coverage of Author
Identifiers in Web
of Science and Scopus. ArXiv e-prints (March 2017).
arXiv:cs.DL/1703.01319
[8] Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky.
2012. Citation-based bootstrapping for large-scale author
disambiguation. Journal of the American Society for Information Science
and Technology 63, 5 (2012), 1030–1047.
[9] Staša Milojević. 2013. Accuracy of simple, initials-based methods for
author
name disambiguation. Journal of Informetrics 7, 4 (2013), 767–773.
[10] Alan Filipe Santana, Marcos André Gonçalves, Alberto HF Laender, and
Ander-
son A Ferreira. 2017. Incremental author name disambiguation by exploiting
domain-specific heuristics. Journal of the Association for Information
Science and
Technology 68, 4 (2017), 931–945.
[11] Neil R Smalheiser and Vetle I Torvik. 2009. Author name
disambiguation. Annual review of information science and technology 43, 1
(2009), 1–43.
[12] Yang Song, Jian Huang, Isaac G Councill, Jia Li, and C Lee Giles. 2007.
Effi-
cient topic-based unsupervised name disambiguation. In Proceedings of the
7th
ACM/IEEE-CS joint conference on Digital libraries. ACM, 342–351.
[13] Andreas Strotmann and Dangzhi Zhao. 2012. Author name
disambiguation: What difference does it make in author-based citation
analysis? Journal of the American Society for Information Science and
Technology 63, 9 (2012), 1820–1833.
[14] Jie Tang, Alvis CM Fong, Bo Wang, and Jing Zhang. 2012. A unified
probabilistic framework for name disambiguation in digital library. IEEE
Transactions on Knowledge and Data Engineering 24, 6 (2012), 975–987.
[15] Vetle I Torvik and Neil R Smalheiser. 2009. Author name disambiguation
in
MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3,
3 (2009), 11.
[16] Vetle I Torvik, Marc Weeber, Don R Swanson, and Neil R Smalheiser.
2005. A
probabilistic similarity metric for Medline records: A model for author name
disambiguation. Journal of the American Society for information science
and
technology 56, 2 (2005), 140–158.
Thank You!
This work was supported by the EU‘s Horizon
2020 programme under grant agreement
H2020-693092, the MOVING project:
www.moving-project.eu