Successfully reported this slideshow.

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

425 views

Published on

Presentation given by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK
24-27 September 2012
TPDL 2012, Cyprus

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  1. 1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. CloughTPDL 2012, Cyprus, 24-27 September 2012
  2. 2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/Carl Collinshttp://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012
  3. 3. Exploring Collections• Exploring / Browsing as an alternative to Search (where applicable)• Requires some kind of structuring of the data• Manual structuring ideal – Expensive to generate – Integration of collections problematic• Alternative: Automatic structuring via clusteringTPDL 2012, Cyprus, 24-27 September 2012
  4. 4. Test Collection• 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be shortTPDL 2012, Cyprus, 24-27 September 2012
  5. 5. Tested Clustering Strategies• Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering• K-Means – 900 clusters – TFIDF vectors & LDA topic vectors• OPTICS – 900 clusters – TFIDF vectors & LDA topic vectorsTPDL 2012, Cyprus, 23-27 September 2012
  6. 6. Processing TimeModel Wall-clock TimeLDA 300 00:21:48LDA 900 00:42:42LDA + PMI 300 05:05:13LDA + PMI 900 17:26:08K-Means TFIDF 09:37:40K-Means LDA 03:49:04Optics TFIDF 12:42:13Optics LDA 05:12:49TPDL 2012, Cyprus, 24-27 September 2012
  7. 7. Evaluation Metrics• Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters• How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find itTPDL 2012, Cyprus, 24-27 September 2012
  8. 8. Intruder Test1. Randomly select one topic2. Randomly select four items from the topic3. Randomly select a second topic – the “intruder” topic4. Randomly select one item from the second topic – the “intruder” item5. Scramble the five items and let the user choose which one is the “intruder”TPDL 2012, Cyprus, 24-27 September 2012
  9. 9. Cluster Cohesion – CohesiveTPDL 2012, Cyprus, 24-27 September 2012
  10. 10. Cluster Cohesion – Not CohesiveTPDL 2012, Cyprus, 24-27 September 2012
  11. 11. Evaluation Metrics• Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution• Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those twoTPDL 2012, Cyprus, 24-27 September 2012
  12. 12. Evaluation Bounds• Upper bound – Manual annotation • 936 topics• Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance• Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answersTPDL 2012, Cyprus, 24-27 September 2012
  13. 13. Experiment• Crowd-sourced using staff & students at Sheffield University – 700 participants• 9 clustering strategies – 30 units per strategy – total of 270 units• Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings)TPDL 2012, Cyprus, 24-27 September 2012
  14. 14. ResultsModel Cohesive Borderline Non-CohesiveUpper Bound 27 0 3Lower Bound 3 0 27LDA 300 15 6 9LDA 900 20 4 6LDA + PMI 300 16 4 10LDA + PMI 900 21 2 7K-Means TFIDF 24 3 3K-Means LDA 20 0 10Optics TFIDF 14 2 14Optics LDA 16 0 14TPDL 2012, Cyprus, 24-27 September 2012
  15. 15. Conclusions• K-means almost as good as the human classification• LDA is very fast and approximately two thirds of the topics are acceptably cohesive• Future work: – Make it hierarchical – Create hybrid algorithmsTPDL 2012, Cyprus, 24-27 September 2012
  16. 16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.ukThe research leading to these results has received funding from the European Communitys Seventh FrameworkProgramme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all projectpartners involved in PATHS (see: http://www.paths-project.eu).

×