# Cluster stability

Senior Researcher at Centre for Science and Technology Studies
Apr. 20, 2015
1 of 18

### Cluster stability

• 1. Cluster stability Nees Jan van Eck and Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University Workshop “Comparison of Algorithms”, Amsterdam April 20, 2015
• 2. Problem statement • A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters) • A clustering technique can be used to force each publication to be assigned to a cluster • However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful? • The assignment of a publication to a cluster may be based on very little information (e.g., a single citation relation) 1
• 3. Example: Waltman and Van Eck (2012) 2
• 4. Cluster stability • To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters • Essentially, a cluster is stable if it is insensitive to small changes in the underlying data • Bootstrapping is used to make small changes in the data • There is no formal statistical framework • To some extent, this resembles the stability intervals in the CWTS Leiden Ranking 3
• 5. Identification of stable clusters: Step 1 • Collect the citation network of publications • Create a large number (e.g., 100) of bootstrap citation networks • In each bootstrap citation network, perform clustering: – Clustering technique of Waltman and Van Eck (2012) – User-defined resolution parameter – Smart local moving algorithm of Waltman and Van Eck (2013) • For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster 4
• 6. Identification of stable clusters: Step 2 • Create a network of publications with an edge between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results • Identify connected components in the newly created network • Each connected component represents a stable cluster 5
• 7. Non-parametric bootstrapping • Sample with replacement from the set of all citation relations between publications • Make sure to obtain a sample that is of the same size as the original set of citation relations • Some citation relations will occur multiple times in the sample, others won’t occur in it at all • Based on the sampled citation relations, create a bootstrap citation network • Edges have integer weights in this network 6
• 8. Parametric bootstrapping • A bootstrap citation network is a weighted variant of the original citation network, with each edge having an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009) • Total edge weight in the bootstrap citation network will be approximately equal to the number of edges in the original network • For large networks, parametric and non-parametric bootstrapping coincide • We use parametric bootstrapping 7
• 9. Data • Library & Information Sciences (LIS): – Time period: 1996-2013 – Publications: 31,534 – Citation links: 131,266 • Astrophysics (Berlin dataset): – Time period: 2003-2010 – Publications: 101,828 – Citation links: 924,171 8
• 10. Cluster stability LIS 9
• 11. Stable clusters LIS (resolution 2) 10
• 12. Stable clusters LIS (resolution 2) 11
• 13. Cluster stability Berlin 12
• 14. Cluster stability 13 LIS Berlin
• 15. Conclusions • What is a good clustering of publication? – High accuracy: Publications in the same cluster are topically related – High level of detail: It is possible to have a large number of clusters – Comprehensiveness: The clustering includes all publications – Uniformity in cluster size: Clusters are of roughly the same size • It seems impossible to obtain a clustering that has all properties listed above • At least one property needs to be given up 14
• 16. Conclusions • Why cannot we have an accurate and detailed clustering that includes all publications? – Consider the field of scientometrics – We would expect an accurate and detailed clustering to have clusters dealing with topics such as indicators, science mapping, collaboration, patents, etc. – However, many publications in scientometrics (e.g., case studies) do not neatly belong to one of these topics and therefore cannot be accurately assigned to a cluster • If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications • The clustering covers only publications related to the main topics in the fields 15
• 17. Conclusions • Analysis of cluster stability offers an approach to distinguish between meaningful and non- meaningful assignments of publications to clusters • Clustering based on direct citations is computationally attractive but ignores relevant information (e.g., bibliographic coupling) • A post processing procedure can be developed to try to assign ‘isolated publications’ to stable clusters based on additional information • Cluster stability is a general idea that can be applied also to other clustering approaches 16
• 18. References Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694. http://dx.doi.org/10.1371/journal.pone.0008694 Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392. http://dx.doi.org/10.1002/asi.22748 Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471. http://dx.doi.org/10.1140/epjb/e2013-40829-0 17