Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cluster stability
Nees Jan van Eck and Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University
Wo...
Problem statement
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number ...
Example: Waltman and Van Eck (2012)
2
Cluster stability
• To ensure that publications are assigned to clusters
in a meaningful way, we introduce the notion of
s...
Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 10...
Identification of stable clusters:
Step 2
• Create a network of publications with an edge
between two publications if the ...
Non-parametric bootstrapping
• Sample with replacement from the set of all citation
relations between publications
• Make ...
Parametric bootstrapping
• A bootstrap citation network is a weighted variant
of the original citation network, with each ...
Data
• Library & Information Sciences (LIS):
– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• A...
Cluster stability LIS
9
Stable clusters LIS (resolution 2)
10
Stable clusters LIS (resolution 2)
11
Cluster stability Berlin
12
Cluster stability
13
LIS Berlin
Conclusions
• What is a good clustering of publication?
– High accuracy: Publications in the same cluster are topically
re...
Conclusions
• Why cannot we have an accurate and detailed
clustering that includes all publications?
– Consider the field ...
Conclusions
• Analysis of cluster stability offers an approach to
distinguish between meaningful and non-
meaningful assig...
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change
in large networks. PLoS ONE, 5(1), e8694.
http://dx.doi.o...
Upcoming SlideShare
Loading in …5
×

Cluster stability

772 views

Published on

To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters. Essentially, a cluster is stable if it is insensitive to small changes in the underlying data. Bootstrapping is used to make small changes in the data.

  • Be the first to comment

  • Be the first to like this

Cluster stability

  1. 1. Cluster stability Nees Jan van Eck and Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University Workshop “Comparison of Algorithms”, Amsterdam April 20, 2015
  2. 2. Problem statement • A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters) • A clustering technique can be used to force each publication to be assigned to a cluster • However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful? • The assignment of a publication to a cluster may be based on very little information (e.g., a single citation relation) 1
  3. 3. Example: Waltman and Van Eck (2012) 2
  4. 4. Cluster stability • To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters • Essentially, a cluster is stable if it is insensitive to small changes in the underlying data • Bootstrapping is used to make small changes in the data • There is no formal statistical framework • To some extent, this resembles the stability intervals in the CWTS Leiden Ranking 3
  5. 5. Identification of stable clusters: Step 1 • Collect the citation network of publications • Create a large number (e.g., 100) of bootstrap citation networks • In each bootstrap citation network, perform clustering: – Clustering technique of Waltman and Van Eck (2012) – User-defined resolution parameter – Smart local moving algorithm of Waltman and Van Eck (2013) • For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster 4
  6. 6. Identification of stable clusters: Step 2 • Create a network of publications with an edge between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results • Identify connected components in the newly created network • Each connected component represents a stable cluster 5
  7. 7. Non-parametric bootstrapping • Sample with replacement from the set of all citation relations between publications • Make sure to obtain a sample that is of the same size as the original set of citation relations • Some citation relations will occur multiple times in the sample, others won’t occur in it at all • Based on the sampled citation relations, create a bootstrap citation network • Edges have integer weights in this network 6
  8. 8. Parametric bootstrapping • A bootstrap citation network is a weighted variant of the original citation network, with each edge having an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009) • Total edge weight in the bootstrap citation network will be approximately equal to the number of edges in the original network • For large networks, parametric and non-parametric bootstrapping coincide • We use parametric bootstrapping 7
  9. 9. Data • Library & Information Sciences (LIS): – Time period: 1996-2013 – Publications: 31,534 – Citation links: 131,266 • Astrophysics (Berlin dataset): – Time period: 2003-2010 – Publications: 101,828 – Citation links: 924,171 8
  10. 10. Cluster stability LIS 9
  11. 11. Stable clusters LIS (resolution 2) 10
  12. 12. Stable clusters LIS (resolution 2) 11
  13. 13. Cluster stability Berlin 12
  14. 14. Cluster stability 13 LIS Berlin
  15. 15. Conclusions • What is a good clustering of publication? – High accuracy: Publications in the same cluster are topically related – High level of detail: It is possible to have a large number of clusters – Comprehensiveness: The clustering includes all publications – Uniformity in cluster size: Clusters are of roughly the same size • It seems impossible to obtain a clustering that has all properties listed above • At least one property needs to be given up 14
  16. 16. Conclusions • Why cannot we have an accurate and detailed clustering that includes all publications? – Consider the field of scientometrics – We would expect an accurate and detailed clustering to have clusters dealing with topics such as indicators, science mapping, collaboration, patents, etc. – However, many publications in scientometrics (e.g., case studies) do not neatly belong to one of these topics and therefore cannot be accurately assigned to a cluster • If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications • The clustering covers only publications related to the main topics in the fields 15
  17. 17. Conclusions • Analysis of cluster stability offers an approach to distinguish between meaningful and non- meaningful assignments of publications to clusters • Clustering based on direct citations is computationally attractive but ignores relevant information (e.g., bibliographic coupling) • A post processing procedure can be developed to try to assign ‘isolated publications’ to stable clusters based on additional information • Cluster stability is a general idea that can be applied also to other clustering approaches 16
  18. 18. References Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694. http://dx.doi.org/10.1371/journal.pone.0008694 Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392. http://dx.doi.org/10.1002/asi.22748 Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471. http://dx.doi.org/10.1140/epjb/e2013-40829-0 17

×