Effective Unsupervised Author Disambiguation with Relative Frequencies

Effective Unsupervised
Author Disambiguation with
Relative Frequencies
Tobias Backes JCDL‘18

Overview
2
Introduction Problem
Setting
Method Agglomerative
Clustering
Probabilistic
Similarity
Feature-types
Parameters
and Variants
Experiments Dataset Evaluation Setup
Results Results Discussion
Conclusion Conclusion

Introduction: Problem Setting
3
John W. Smith, Thomas
Müller
Document I
J. Smith, Zack Chen
Document II
J. Smith
Document III
author mention
same name
same author?

4
The collection is partitioned by
different author names.
John W. Smith, Thomas
Müller
Part of the collection is
annotated with author
identifiers.
E-6522-2011

5
Partition names into blocks.
Smith, J. W., Müller, T.
Gold authors are also partitioned.
Hidden source of error!

6
collection
blocks
each block
system authors
single mentions

Method: Agglomerative Clustering
7
Take a block.Each mention one author cluster.Merge the most similar author clustersTerminate if
stopping critereon
reached.
Otherwise the
entire block
becomes a single
author
Which clusters
are the most
similar?

Method: Probabilistic Similarity
8
Probability of author mention 𝑥 ∈ 𝑋 given mention
𝑥‘:
𝑝 𝑥 𝑥′
=
𝑓
#(𝑓, 𝑥) ∙ #(𝑓, 𝑥′)
#(𝑓) ∙ #(𝑥′)
Probability of author cluster 𝐶 given cluster 𝐶‘:
𝑝 𝐶 𝐶′
=
𝑥,𝑥′ ∈𝐶×𝐶′
𝑝 𝑥 𝑥′
⋅
#(𝑥′)
#(𝐶′)
(Use add-𝜖 smoothing to prevent zero-probabilities)

Method: Feature-types
9
 Terms
 Affiliations
 Categories
 Keywords
 Co-author
names
 Emails
 Years
15
%
20
%
18
%
3%
20
%
10
%
𝑠𝑐𝑜𝑟𝑒 𝐶, 𝐶′
=
𝑓𝑡𝑦𝑝𝑒
𝜆 𝑓𝑡𝑦𝑝𝑒 ⋅ 𝑝 𝑓𝑡𝑦𝑝𝑒(𝐶|𝐶′)
traine
d
𝝀 𝒇𝒕𝒚𝒑𝒆
on
separate
training
data

Method: Parameters and Variants
10
1. Variants
 𝑝 𝐶 𝐶‘ =
 (𝑥,𝑥′) … (prob-variant)
 max
(𝑥,𝑥′)
… (max-variant)
2. Parameters
 𝜆: Feature-type weights
 𝛼 or 𝛽: Stopping parameters

Method: Parameters and Variants
11
Stopping limit:
𝑙 = 𝛼 + 𝑋 ⋅ 𝛽
where 𝑋 is block-size in number of mentions.
 𝛼 = 0 for prob-variant
 𝛽 = 0 for max-variant
There is room for improvement.

Experiments: Dataset
12
 Web of Science (until first half 2014)
 Over 150 million author mentions
 Blocking: Lastname, all initials, i.e. J. W.
Smith
 For each problem size ∈ [1. . 10]:
Draw up to 1000 blocks.
 Size 1-4: > 1000 blocks
 Size 5: 860 blocks
 Size 10: 141 blocks
 Sparsity for large block sizes

Experiments: Evaluation
13
• PairF1 and bCube
• Count the number of pairs
• Get precision, recall and F1
• True-pos.:
Should be and are together
• True-neg.:
Should not be together and aren‘t
• False-pos.:
Should not be together but are
• False-neg.:
Should be together but aren‘t

Experiments: Setup
14
collection
Blocks by
number of
annotated
authors
For each block size:
For each clustering iteration:
1. Current Precision
2. Current Recall
3. Current number of clusters
4. Converged or not?

Results
15
What is the best
we can get?

Results
16
• all ftypes
• trained weights
• prob-variant

Results
17
How does the max-variant
compare to that?

Results
18
• all ftypes
• trained weights
• prob-variant

Results
19
• all ftypes
• trained weights
• max-variant

Results
20
early gains
• size 1
• prob-variant

Results
21
no
early
gains
• size 1
• max-
variant

Results
22
How well works the
stopping criterion?

Results
23
peaks before
correct size
stops before
correct size
• size 5
• prob-variant

Results
24
peaks at
correct size
stops at
correct size
• size 5
• max-variant

Results
25
peaks before
correct size
stops after
correct size
• size 10
• prob-variant

Results
26
peaks at
correct size
stops after
correct size
• size 10
• max-variant

Results
27
How do uniform
weights compare?

Results
28
• all ftypes
• trained weights
• prob-variant

Results
29
• all ftypes
• uniform weights
• prob-variant
practically
the same

Results
30
How about using
only terms?

Results
31
• all ftypes
• uniform weights
• prob-variant

Results
32
• only terms
• uniform weights
• prob-variant
recall drops
considerably

Results
33
How important are
co- and ref-authors?

Results
34
• all ftypes
• uniform weights
• prob-variant

Results
35
• only co-authors
• uniform weights
• prob-variant
recall drops
considerably

Results
36
• all ftypes
• uniform weights
• prob-variant

Results
37
• co+ref-authors
• uniform weights
• prob-variant
only little loss

Results
38
• all ftypes
• uniform weights
• prob-variant

Results
39
• no co-authors
• uniform weights
• prob-variant
recall drops
considerabl
y

Results: Discussion
40
Not possible to compare results directly to other groups.
1. Data
 Web of Science, until half 2014
 Available features in this version
 Available researcherID annotation in this version
2. Blocking
 Surname, all initials
3. Evaluation
 bCube, pairF1 (including pairs (𝑥, 𝑥))
 Recall not measured across blocks
 Macro average
 Each problem size separately
Any variation could change results considerably

Results: Discussion
41
 Suggestion:
Evaluate each problem size separately!
 Say, we have to distinguish i.e. 8 authors,
then
this defines task difficulty

Conclusion
42
 Method:
 Agglomerative Clustering + Probabilistic similarity
 Almost parameter free
 Features:
 Co-authors very important
 Add any ftype you like
 Evaluation:
 Each problem size separately
 Visualize clustering process
a. balance Precision vs. Recall
b. tune stopping criterion

References
43
[1] Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew
McCallum.
2007. Author disambiguation using error-driven machine learning with a
ranking
loss function. In Sixth International Workshop on Information Integration
on the
Web (IIWeb-07), Vancouver, Canada.
[2] Anderson A Ferreira, Marcos André Gonçalves, and Alberto HF
Laender. 2012. A
brief survey of automatic methods for author name disambiguation. Acm
Sigmod
Record 41, 2 (2012), 15–26.
[3] Anderson A Ferreira, Adriano Veloso, Marcos André Gonçalves, and
Alberto HF
Laender. 2010. Effective self-training author name disambiguation in
scholarly
digital libraries. In Proceedings of the 10th annual joint conference on
Digital
libraries. ACM, 39–48.
[4] Thomas Gurney, Edwin Horlings, and Peter Van Den Besselaar. 2012.
Author
disambiguation using multi-aspect similarity indicators. Scientometrics
91, 2
(2012), 435–449.
[5] Hui Han, Wei Xu, Hongyuan Zha, and C Lee Giles. 2005. A hierarchical
naive
Bayes mixture model for name disambiguation in author citations. In
Proceedings
of the 2005 ACM symposium on Applied computing. ACM, 1065–1069.
[6] Anne-Wil Harzing. 2015. Health warning: might contain multiple
personalities
- the problem of homonyms in Thomson Reuters Essential Science
Indicators.
Scientometrics 105, 3 (2015), 2259–2270.
[7] T. Kramer, F. Momeni, and P. Mayr. 2017. Coverage of Author
Identifiers in Web
of Science and Scopus. ArXiv e-prints (March 2017).
arXiv:cs.DL/1703.01319
[8] Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky.
2012. Citation-based bootstrapping for large-scale author
disambiguation. Journal of the American Society for Information Science
and Technology 63, 5 (2012), 1030–1047.
[9] Staša Milojević. 2013. Accuracy of simple, initials-based methods for
author
name disambiguation. Journal of Informetrics 7, 4 (2013), 767–773.
[10] Alan Filipe Santana, Marcos André Gonçalves, Alberto HF Laender, and
Ander-
son A Ferreira. 2017. Incremental author name disambiguation by exploiting
domain-specific heuristics. Journal of the Association for Information
Science and
Technology 68, 4 (2017), 931–945.
[11] Neil R Smalheiser and Vetle I Torvik. 2009. Author name
disambiguation. Annual review of information science and technology 43, 1
(2009), 1–43.
[12] Yang Song, Jian Huang, Isaac G Councill, Jia Li, and C Lee Giles. 2007.
Effi-
cient topic-based unsupervised name disambiguation. In Proceedings of the
7th
ACM/IEEE-CS joint conference on Digital libraries. ACM, 342–351.
[13] Andreas Strotmann and Dangzhi Zhao. 2012. Author name
disambiguation: What difference does it make in author-based citation
analysis? Journal of the American Society for Information Science and
Technology 63, 9 (2012), 1820–1833.
[14] Jie Tang, Alvis CM Fong, Bo Wang, and Jing Zhang. 2012. A unified
probabilistic framework for name disambiguation in digital library. IEEE
Transactions on Knowledge and Data Engineering 24, 6 (2012), 975–987.
[15] Vetle I Torvik and Neil R Smalheiser. 2009. Author name disambiguation
in
MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3,
3 (2009), 11.
[16] Vetle I Torvik, Marc Weeber, Don R Swanson, and Neil R Smalheiser.
2005. A
probabilistic similarity metric for Medline records: A model for author name
disambiguation. Journal of the American Society for information science
and
technology 56, 2 (2005), 140–158.
Thank You!
This work was supported by the EU‘s Horizon
2020 programme under grant agreement
H2020-693092, the MOVING project:
www.moving-project.eu

Results: Discussion
45
 Primitive baseline: 1 block = 1 author
REALITY
size-1
clusters
poor results
using
baseline
BASELINE

Conclusion
46
stopping
criterion
development of
Precision +
Recall

Effective Unsupervised Author Disambiguation with Relative Frequencies

Recommended

Recommended

More Related Content

Similar to Effective Unsupervised Author Disambiguation with Relative Frequencies

Similar to Effective Unsupervised Author Disambiguation with Relative Frequencies (20)

More from MOVING Project

More from MOVING Project (20)

Recently uploaded

Recently uploaded (20)

Effective Unsupervised Author Disambiguation with Relative Frequencies