Apr. 20, 2015•0 likes•1,062 views

Download to read offline

Report

To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters. Essentially, a cluster is stable if it is insensitive to small changes in the underlying data. Bootstrapping is used to make small changes in the data.

Nees Jan van EckFollow

CWTS Leiden Ranking: An advanced bibliometric approach to university rankingNees Jan van Eck

A systematic empirical comparison of different approaches for normalizing cit...Nees Jan van Eck

Multiple perspectives on bibliometric dataNees Jan van Eck

Advanced bibliometric software tools for publishers and editorsNees Jan van Eck

Getting started with CitNetExplorerNees Jan van Eck

Applications of community detection in bibliometric network analysisNees Jan van Eck

- 1. Cluster stability Nees Jan van Eck and Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University Workshop “Comparison of Algorithms”, Amsterdam April 20, 2015
- 2. Problem statement • A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters) • A clustering technique can be used to force each publication to be assigned to a cluster • However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful? • The assignment of a publication to a cluster may be based on very little information (e.g., a single citation relation) 1
- 3. Example: Waltman and Van Eck (2012) 2
- 4. Cluster stability • To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters • Essentially, a cluster is stable if it is insensitive to small changes in the underlying data • Bootstrapping is used to make small changes in the data • There is no formal statistical framework • To some extent, this resembles the stability intervals in the CWTS Leiden Ranking 3
- 5. Identification of stable clusters: Step 1 • Collect the citation network of publications • Create a large number (e.g., 100) of bootstrap citation networks • In each bootstrap citation network, perform clustering: – Clustering technique of Waltman and Van Eck (2012) – User-defined resolution parameter – Smart local moving algorithm of Waltman and Van Eck (2013) • For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster 4
- 6. Identification of stable clusters: Step 2 • Create a network of publications with an edge between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results • Identify connected components in the newly created network • Each connected component represents a stable cluster 5
- 7. Non-parametric bootstrapping • Sample with replacement from the set of all citation relations between publications • Make sure to obtain a sample that is of the same size as the original set of citation relations • Some citation relations will occur multiple times in the sample, others won’t occur in it at all • Based on the sampled citation relations, create a bootstrap citation network • Edges have integer weights in this network 6
- 8. Parametric bootstrapping • A bootstrap citation network is a weighted variant of the original citation network, with each edge having an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009) • Total edge weight in the bootstrap citation network will be approximately equal to the number of edges in the original network • For large networks, parametric and non-parametric bootstrapping coincide • We use parametric bootstrapping 7
- 9. Data • Library & Information Sciences (LIS): – Time period: 1996-2013 – Publications: 31,534 – Citation links: 131,266 • Astrophysics (Berlin dataset): – Time period: 2003-2010 – Publications: 101,828 – Citation links: 924,171 8
- 10. Cluster stability LIS 9
- 11. Stable clusters LIS (resolution 2) 10
- 12. Stable clusters LIS (resolution 2) 11
- 13. Cluster stability Berlin 12
- 14. Cluster stability 13 LIS Berlin
- 15. Conclusions • What is a good clustering of publication? – High accuracy: Publications in the same cluster are topically related – High level of detail: It is possible to have a large number of clusters – Comprehensiveness: The clustering includes all publications – Uniformity in cluster size: Clusters are of roughly the same size • It seems impossible to obtain a clustering that has all properties listed above • At least one property needs to be given up 14
- 16. Conclusions • Why cannot we have an accurate and detailed clustering that includes all publications? – Consider the field of scientometrics – We would expect an accurate and detailed clustering to have clusters dealing with topics such as indicators, science mapping, collaboration, patents, etc. – However, many publications in scientometrics (e.g., case studies) do not neatly belong to one of these topics and therefore cannot be accurately assigned to a cluster • If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications • The clustering covers only publications related to the main topics in the fields 15
- 17. Conclusions • Analysis of cluster stability offers an approach to distinguish between meaningful and non- meaningful assignments of publications to clusters • Clustering based on direct citations is computationally attractive but ignores relevant information (e.g., bibliographic coupling) • A post processing procedure can be developed to try to assign ‘isolated publications’ to stable clusters based on additional information • Cluster stability is a general idea that can be applied also to other clustering approaches 16
- 18. References Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694. http://dx.doi.org/10.1371/journal.pone.0008694 Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392. http://dx.doi.org/10.1002/asi.22748 Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471. http://dx.doi.org/10.1140/epjb/e2013-40829-0 17