On cluster stability

Nees Jan van Eck
Nees Jan van EckSenior Researcher at Centre for Science and Technology Studies
On cluster stability
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
15th International Conference on Scientometrics & Informetrics
Istanbul, Turkey, June 30, 2015
Introduction
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number of
clusters)
• A clustering technique can be used to force each
publication to be assigned to a cluster
• However, in a highly detailed clustering, is the
assignment of publications to clusters still meaningful?
1
Example: Waltman and Van Eck (2012)
2
Cluster stability
• To ensure that publications are assigned to clusters in a
meaningful way, we introduce the notion of stable
clusters
• Essentially, a cluster is stable if it is insensitive to small
changes in the underlying data
• Bootstrapping is used to make small changes in the data
3
Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap citation
networks:
– A bootstrap citation network is a weighted variant of the original citation
network in which each edge has an integer weight drawn from a
Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009)
• In each bootstrap citation network, perform clustering
• For each pair of publications, calculate the proportion of
the bootstrap clustering results in which the publications
are in the same cluster
4
5
Original network Bootstrap networks
1
1
1
0
1
1
2
1
1
0
1
3
1
1
1 2
1
1
1
1
0
1
1
3
1
0
4
1
1
1
2 2
1
1
1
1
0
2
1
0
0
1
3
1
1
0
1 1
Clustering
1
1
1
0
1
1
2
1
1
0
1
3
1
0
1 2
1
1
0
0
1
1
1
3
0
0
4
1
1
1
0 2
1
1
1
1
0
1
1
0
0
1
3
1
2
0
1 1
1.0
0.9
0.9
0.4
0.6
0.9
0.9
0.9
0.1
0.1
0.9
1.0
0.9
0.5
0.9 1.0
Weighted network Clustered bootstrap networks
Identification of stable clusters:
Step 2
• Create a network of publications with an edge between
two publications if the publications are in the same
cluster in at least a certain proportion (e.g., 0.9) of the
bootstrap clustering results
• Identify connected components in the newly created
network
• Each connected component represents a stable cluster
6
1.0
0.9
0.9
0.4
0.6
0.9
0.9
0.9
0.1
0.1
0.9
1.0
0.9
0.5
0.9 1.0
Weighted network
7
Binary network
Connected components
Stable clusters
Data
• Library & Information Sciences (LIS):
– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• Astrophysics (Berlin dataset):
– Time period: 2003-2010
– Publications: 101,828
– Citation links: 924,171
8
Cluster stability LIS
9
Stable clusters LIS (resolution 2)
10
Stable clusters LIS (resolution 2)
11
Cluster stability Berlin
12
Cluster stability
13
LIS Berlin
Conclusions
• If we want to have an accurate and detailed clustering,
we need to be satisfied with a clustering that doesn’t
comprehensively cover all publications
• Publications that do not clearly belong to one of the main
topics in a field cannot be assigned to a cluster
• Cluster stability analysis can be used to distinguish
between meaningful and non-meaningful assignments of
publications to clusters
14
Thank you for your attention!
15
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large
networks. PLoS ONE, 5(1), e8694.
http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology for
constructing a publication-level classification system of
science. JASIST, 63(12), 2378-2392.
http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving
algorithm for large-scale modularity-based community
detection. European Physical Journal B, 86(11), 471.
http://dx.doi.org/10.1140/epjb/e2013-40829-0
16
1 of 17

More Related Content

What's hot(20)

Multiple perspectives on bibliometric dataMultiple perspectives on bibliometric data
Multiple perspectives on bibliometric data
Nees Jan van Eck1.5K views
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
Nees Jan van Eck1.4K views
VOSviewer and CitNetExplorer TutorialVOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer Tutorial
Nees Jan van Eck7.3K views
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publications
Nees Jan van Eck415 views
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
Nees Jan van Eck385 views
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of science
Nees Jan van Eck410 views
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
Nees Jan van Eck659 views

On cluster stability

  • 1. On cluster stability Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University 15th International Conference on Scientometrics & Informetrics Istanbul, Turkey, June 30, 2015
  • 2. Introduction • A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters) • A clustering technique can be used to force each publication to be assigned to a cluster • However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful? 1
  • 3. Example: Waltman and Van Eck (2012) 2
  • 4. Cluster stability • To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters • Essentially, a cluster is stable if it is insensitive to small changes in the underlying data • Bootstrapping is used to make small changes in the data 3
  • 5. Identification of stable clusters: Step 1 • Collect the citation network of publications • Create a large number (e.g., 100) of bootstrap citation networks: – A bootstrap citation network is a weighted variant of the original citation network in which each edge has an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009) • In each bootstrap citation network, perform clustering • For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster 4
  • 6. 5 Original network Bootstrap networks 1 1 1 0 1 1 2 1 1 0 1 3 1 1 1 2 1 1 1 1 0 1 1 3 1 0 4 1 1 1 2 2 1 1 1 1 0 2 1 0 0 1 3 1 1 0 1 1 Clustering 1 1 1 0 1 1 2 1 1 0 1 3 1 0 1 2 1 1 0 0 1 1 1 3 0 0 4 1 1 1 0 2 1 1 1 1 0 1 1 0 0 1 3 1 2 0 1 1 1.0 0.9 0.9 0.4 0.6 0.9 0.9 0.9 0.1 0.1 0.9 1.0 0.9 0.5 0.9 1.0 Weighted network Clustered bootstrap networks
  • 7. Identification of stable clusters: Step 2 • Create a network of publications with an edge between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results • Identify connected components in the newly created network • Each connected component represents a stable cluster 6
  • 9. Data • Library & Information Sciences (LIS): – Time period: 1996-2013 – Publications: 31,534 – Citation links: 131,266 • Astrophysics (Berlin dataset): – Time period: 2003-2010 – Publications: 101,828 – Citation links: 924,171 8
  • 11. Stable clusters LIS (resolution 2) 10
  • 12. Stable clusters LIS (resolution 2) 11
  • 15. Conclusions • If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications • Publications that do not clearly belong to one of the main topics in a field cannot be assigned to a cluster • Cluster stability analysis can be used to distinguish between meaningful and non-meaningful assignments of publications to clusters 14
  • 16. Thank you for your attention! 15
  • 17. References Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694. http://dx.doi.org/10.1371/journal.pone.0008694 Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392. http://dx.doi.org/10.1002/asi.22748 Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471. http://dx.doi.org/10.1140/epjb/e2013-40829-0 16