SlideShare a Scribd company logo
Evaluating the Use of Clustering
    for Automatically Organising
      Digital Library Collections
             Mark M. Hall, Mark Stevenson,
                    Paul D. Clough


TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage




                                                                     http://www.flickr.com/photos/brokenthoughts/122096903/
Carl Collins
http://www.flickr.com/photos/carlcollins/199792939/




                                 http://www.flickr.com/photos/usnationalarchives/4069633668/
   TPDL 2012, Cyprus, 24-27 September 2012
Exploring Collections
• Exploring / Browsing as an alternative to
  Search (where applicable)
• Requires some kind of structuring of the
  data
• Manual structuring ideal
    – Expensive to generate
    – Integration of collections problematic
• Alternative: Automatic structuring via
  clustering

TPDL 2012, Cyprus, 24-27 September 2012
Test Collection
• 28133 photographs provided
  by the University of St
  Andrews Library
    – 85% pre 1940                             Ottery St Mary
    – 89% black and white                      Church

    – Majority UK
    – Title and description tend to be
      short


TPDL 2012, Cyprus, 24-27 September 2012
Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)
    – 300 & 900 topics
    – With and without Pairwise Mutual Information
      (PMI) filtering
• K-Means
    – 900 clusters
    – TFIDF vectors & LDA topic vectors
• OPTICS
    – 900 clusters
    – TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012
Processing Time
Model                                     Wall-clock Time
LDA 300                                   00:21:48
LDA 900                                   00:42:42
LDA + PMI 300                             05:05:13
LDA + PMI 900                             17:26:08
K-Means TFIDF                             09:37:40
K-Means LDA                               03:49:04
Optics TFIDF                              12:42:13
Optics LDA                                05:12:49



TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cluster cohesion
    – Items in a cluster should be similar to each
      other
    – Items in a cluster should be different from
      items in other clusters
• How to test this?
    – “Intruder” test
    – If you insert an intruder into a cluster, can
      people find it

TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the
   “intruder” topic
4. Randomly select one item from the
   second topic – the “intruder” item
5. Scramble the five items and let the user
   choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cohesive
    – “Intruder” is chosen significantly more
      frequently than by chance
    – Choice distribution is significantly different
      from the uniform distribution
• Borderline cohesive
    – Two out of five items make up > 95% of the
      answers
    – “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds
• Upper bound
    – Manual annotation
         • 936 topics
• Lower bound
    – 3 cohesive topics
    – <5% likelihood of seeing that number of cohesive
      topics by chance
• Control data
    – 10 “really, totally, completely obvious” intruders
      used to filter participants who randomly select
      answers


TPDL 2012, Cyprus, 24-27 September 2012
Experiment
• Crowd-sourced using staff & students at
  Sheffield University
    – 700 participants
• 9 clustering strategies
    – 30 units per strategy – total of 270 units
• Results
    – 8840 ratings
    – 21 – 30 ratings per unit (median 27 ratings)


TPDL 2012, Cyprus, 24-27 September 2012
Results
Model                        Cohesive     Borderline   Non-Cohesive
Upper Bound                  27           0            3
Lower Bound                  3            0            27
LDA 300                      15           6            9
LDA 900                      20           4            6
LDA + PMI 300                16           4            10
LDA + PMI 900                21           2            7
K-Means TFIDF                24           3            3
K-Means LDA                  20           0            10
Optics TFIDF                 14           2            14
Optics LDA                   16           0            14

TPDL 2012, Cyprus, 24-27 September 2012
Conclusions
• K-means almost as good as the human
  classification
• LDA is very fast and approximately two
  thirds of the topics are acceptably
  cohesive

• Future work:
    – Make it hierarchical
    – Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening



                                   Find out more about the project:

                              http://www.paths-project.eu


                                       m.mhall@sheffield.ac.uk



The research leading to these results has received funding from the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project
partners involved in PATHS (see: http://www.paths-project.eu).

More Related Content

Viewers also liked

My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
Eyal Doron
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36
Eyal Doron
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama
 
DFC2012 India: Health & Hygiene
DFC2012 India: Health & HygieneDFC2012 India: Health & Hygiene
DFC2012 India: Health & Hygiene
designforchangechallenge
 
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Eyal Doron
 
Plivo OSDC FR 2012
Plivo OSDC FR 2012Plivo OSDC FR 2012
Plivo OSDC FR 2012
mricordeau
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
Desi Puspitasariku
 

Viewers also liked (7)

My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
DFC2012 India: Health & Hygiene
DFC2012 India: Health & HygieneDFC2012 India: Health & Hygiene
DFC2012 India: Health & Hygiene
 
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
 
Plivo OSDC FR 2012
Plivo OSDC FR 2012Plivo OSDC FR 2012
Plivo OSDC FR 2012
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...
Defence and Security Accelerator
 
Facing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & servicesFacing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & services
Marieke Guy
 
DM2E Data Model
DM2E Data ModelDM2E Data Model
DM2E Data Model
Steffen Hennicke
 
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
TERN Australia
 
Dr. alex bartzas
Dr. alex bartzasDr. alex bartzas
Dr. alex bartzas
innovation_workshop2013
 
Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge
Eleftherios Spyromitros-Xioufis
 
UKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting DundeeUKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting Dundee
euroCRIS - Current Research Information Systems
 
Business case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM serviceBusiness case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM service
Jisc RDM
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
Maria Eskevich
 
(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up
Interaction-FP7
 
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
EDINA, University of Edinburgh
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
Dr.-Ing. Thomas Hartmann
 
Icsm12.ppt
Icsm12.pptIcsm12.ppt
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Sebastian Hellmann
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economics
Esther Hoorn
 
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Lighton Phiri
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
Sarah Anna Stewart
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
Mathieu d'Aquin
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the Cloud
DuraSpace
 
Orcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbertOrcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbert
jonas_gilbert
 

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...
 
Facing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & servicesFacing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & services
 
DM2E Data Model
DM2E Data ModelDM2E Data Model
DM2E Data Model
 
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
 
Dr. alex bartzas
Dr. alex bartzasDr. alex bartzas
Dr. alex bartzas
 
Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge
 
UKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting DundeeUKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting Dundee
 
Business case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM serviceBusiness case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM service
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up
 
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
Icsm12.ppt
Icsm12.pptIcsm12.ppt
Icsm12.ppt
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economics
 
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the Cloud
 
Orcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbertOrcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbert
 

More from pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
pathsproject
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
pathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
pathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
pathsproject
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...
pathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
pathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
pathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
pathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
pathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
pathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
pathsproject
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
pathsproject
 

More from pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  • 1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012
  • 2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012
  • 3. Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal – Expensive to generate – Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012
  • 4. Test Collection • 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be short TPDL 2012, Cyprus, 24-27 September 2012
  • 5. Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering • K-Means – 900 clusters – TFIDF vectors & LDA topic vectors • OPTICS – 900 clusters – TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012
  • 6. Processing Time Model Wall-clock Time LDA 300 00:21:48 LDA 900 00:42:42 LDA + PMI 300 05:05:13 LDA + PMI 900 17:26:08 K-Means TFIDF 09:37:40 K-Means LDA 03:49:04 Optics TFIDF 12:42:13 Optics LDA 05:12:49 TPDL 2012, Cyprus, 24-27 September 2012
  • 7. Evaluation Metrics • Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters • How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012
  • 8. Intruder Test 1. Randomly select one topic 2. Randomly select four items from the topic 3. Randomly select a second topic – the “intruder” topic 4. Randomly select one item from the second topic – the “intruder” item 5. Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012
  • 9. Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 10. Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 11. Evaluation Metrics • Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution • Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012
  • 12. Evaluation Bounds • Upper bound – Manual annotation • 936 topics • Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance • Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012
  • 13. Experiment • Crowd-sourced using staff & students at Sheffield University – 700 participants • 9 clustering strategies – 30 units per strategy – total of 270 units • Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012
  • 14. Results Model Cohesive Borderline Non-Cohesive Upper Bound 27 0 3 Lower Bound 3 0 27 LDA 300 15 6 9 LDA 900 20 4 6 LDA + PMI 300 16 4 10 LDA + PMI 900 21 2 7 K-Means TFIDF 24 3 3 K-Means LDA 20 0 10 Optics TFIDF 14 2 14 Optics LDA 16 0 14 TPDL 2012, Cyprus, 24-27 September 2012
  • 15. Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: – Make it hierarchical – Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012
  • 16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.uk The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).