Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Community detection using citation relations and textual similarities in a large set of PubMed publications

17 views

Published on

Presentation at the 17th International Conference on Scientometrics & Informetrics, Rome, Italy, September 4, 2019.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Community detection using citation relations and textual similarities in a large set of PubMed publications

  1. 1. Community detection using citation relations and textual similarities in a large set of PubMed publications Per Ahlgren, Yunwei Chen, Cristian Colliander, and Nees Jan van Eck
  2. 2. Publication-level classification system 2 Social sciences and humanities Biomedical and health sciences Life and earth sciences Mathematics and computer science Physical sciences and engineering
  3. 3. Introduction Purpose of the study: To analyze whether clustering accuracy can be improved by combining direct citations with indirect citation relations or text relations.
  4. 4. Introduction • We compare 6 publication clustering approaches. • The main difference between them is how the relatedness of publications is defined. • We build on, and were inspired by, two studies presented at the ISSI conference in Wuhan 2017: – Chen et. al (2017). A weighted method for citation network community detection. – Waltman et al. (2017). A principled methodology for comparing relatedness measures for clustering publications.
  5. 5. Data and methods • Five-year publication period, 2013-2017. • About 4 million publications were retrieved from MEDLINE, the largest subset of PubMed. • PubMed does not contain citation relations between publications. Therefore, we also used Web of Science (WoS) data. – Each publication was matched to a publication included in the in-house version of the WoS database available at the Centre for Science and Technology Studies (CWTS) at Leiden University.
  6. 6. Data and methods – About 3.5 million publications remained after matching. – From these publications, we selected each publication p such that p satisfies each of the following four conditions: 1. p has a WoS publication year in the period 2013-2017. 2. p is of WoS document type Article or Review. 3. p has both an abstract and a title with respect to its WoS record. 4. p has a citation relation to at least one publication p’ such that p’ satisfies points 1-3 in this list. – About 3 million publications finally obtained.
  7. 7. Data and methods • Investigated relatedness measures – Direct citations (DC). The relatedness of two publications i and j is 1 if there is a direct citation from i to j or such a relation from j to i, otherwise the relatedness is 0. – Bibliographic coupling (BC). The relatedness of i and j is defined as the number of shared cited references in i and j. – Co-citation (CC). The relatedness of i and j is defined as the number of publications that cite both i and j. – BM25. Terms (noun phrases) in the titles and abstracts of the publications are used to represent the publications. The approach involves the BM25 measure, a well- known query-publication similarity measure in information retrieval research. • The value of the measure for i with j is a sum across all unique terms in the dataset, where the number of occurrences of a term in i and j, the inverse document frequency of the term and the length of j are taken into account.
  8. 8. Data and methods – DC-BC-CC. In this approach, direct citations are enhanced by the citation relations corresponding to the approaches BC and CC. We define relatedness of i and j, 𝑟𝑖𝑗 DC−BC−CC , as 𝑟𝑖𝑗 DC−BC−CC = 𝛼𝑟𝑖𝑗 DC + 𝑟𝑖𝑗 BC + 𝑟𝑖𝑗 CC where 𝛼 is a weight of direct citations relative to BC and CC. In our analysis, we use 1 and 5 as values of 𝛼.
  9. 9. Data and methods – DC-BM25. In this approach, direct citations are enhanced by the text relations. We define relatedness of i and j, 𝑟𝑖𝑗 DC−BM25 , as 𝑟𝑖𝑗 DC−BM25 = 𝛼𝑟𝑖𝑗 DC + 𝑟𝑖𝑗 BM25 where 𝛼 is a weight of direct citations relative to BM25. The average across all BM25 relatedness values greater than 0 was calculated, an average that turned out to be equal to 50. By setting 𝛼 to 50, the DC values are put on the same scale as the BM25 relatedness values, in an average sense. By setting 𝛼 to 25 (100), less (more) emphasis would be put on DC. We use all these three 𝛼 values in our analysis.
  10. 10. Data and methods • Clustering of publications – In this study, we use the Leiden algorithm (Traag et al., 2018a, 2018b) to generate a series of clustering solutions for each of the relatedness measures. The Leiden algorithm is used to maximize the Constant Potts Model as quality function (Traag et al., 2011; Waltman & Van Eck, 2012). – Using different values of the resolution parameter (0.000001, 0.000002, 0.000005, 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01), we obtain 13 clustering solutions for each relatedness measure.
  11. 11. Data and methods • Evaluation of approach performance – We use the evaluation framework of Waltman et al. (2017, 2019). – A relatedness measure based on MeSH terms is used as an independent evaluation measure to compare the accuracy of clustering solutions produced by all approaches. – MeSH is a detailed item-level subject classification scheme. • MeSH descriptors (more than 28 thousand) and subheadings are used to index publications in PubMed. • Approximately 80 subheadings (or qualifiers) can be used by the indexer to qualify a descriptor.
  12. 12. Data and methods – The accuracy of the kth (1 ≤ k ≤ 13) clustering solution for 𝑋 ∈ {DC, BC, CC, BM25, DC-BC-CC, DC-BM25, MeSH}, where the accuracy is based on MeSH cosine similarity, symbolically 𝐴 𝑋 𝑘|MeSH, is defined as follows (Waltman et al., 2017, 2019): 𝐴 𝑋 𝑘|MeSH = 1 𝑁 𝑖,𝑗 𝐼(𝑐𝑖 𝑋 𝑘 = 𝑐𝑗 𝑋 𝑘 )𝑟𝑖𝑗 MeSH where N is the number of publications in the dataset, 𝑐𝑖 𝑋 𝑘 a positive integer denoting the cluster to which publication i belongs with respect to the kth clustering solution for X, 𝐼(𝑐𝑖 𝑋 𝑘 = 𝑐𝑗 𝑋 𝑘 ) is 1 if its condition is true, otherwise 0, and 𝑟𝑖𝑗 MeSH (norm) the normalized MeSH cosine similarity of i with j.
  13. 13. Results • We visualize the evaluation results by using granularity- accuracy (GA) plots (Waltman et al., 2017, 2019). • We present three figures containing GA plots. – DC and the other citation-based approaches – DC and the text-based approaches – DC and best performing approaches
  14. 14. Results
  15. 15. Results
  16. 16. Results
  17. 17. Conclusions and future research • Enhancing direct citations with indirect citation relations (BC- CC) or text relations (BM25) gives rise to substantial performance gains relative to direct citations • Combination of direct citations and text (BM25) performs best • These results assume that MeSH terms serve as an appropriate evaluation measure
  18. 18. Conclusions and future research • An extended version of our paper has been submitted to the journal Quantitative Science Studies. – One more approach is added: extended direct citations (EDC). – EDC shows the best performance.
  19. 19. Conclusions and future research • It does not follow that two cluster solutions with similar accuracy also have similar groupings of publications into clusters. In view of this, in future studies we aim to further compare the clustering solutions to deepen the insight into how the clustering solutions based on different relatedness measures differ from each other.
  20. 20. Thank you for your attention!
  21. 21. Results extended journal paper

×