Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

•Download as PPTX, PDF•

0 likes•380 views

Presentation given by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK 24-27 September 2012 TPDL 2012, Cyprus

Evaluating the Use of Clustering
for Automatically Organising
Digital Library Collections
Mark M. Hall, Mark Stevenson,
Paul D. Clough

TPDL 2012, Cyprus, 24-27 September 2012

Opening Up Digital Cultural Heritage

http://www.flickr.com/photos/brokenthoughts/122096903/
Carl Collins
http://www.flickr.com/photos/carlcollins/199792939/

http://www.flickr.com/photos/usnationalarchives/4069633668/
TPDL 2012, Cyprus, 24-27 September 2012

Exploring Collections
• Exploring / Browsing as an alternative to
Search (where applicable)
• Requires some kind of structuring of the
data
• Manual structuring ideal
– Expensive to generate
– Integration of collections problematic
• Alternative: Automatic structuring via
clustering

TPDL 2012, Cyprus, 24-27 September 2012

Test Collection
• 28133 photographs provided
by the University of St
Andrews Library
– 85% pre 1940 Ottery St Mary
– 89% black and white Church

– Majority UK
– Title and description tend to be
short

TPDL 2012, Cyprus, 24-27 September 2012

Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)
– 300 & 900 topics
– With and without Pairwise Mutual Information
(PMI) filtering
• K-Means
– 900 clusters
– TFIDF vectors & LDA topic vectors
• OPTICS
– 900 clusters
– TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012

Processing Time
Model Wall-clock Time
LDA 300 00:21:48
LDA 900 00:42:42
LDA + PMI 300 05:05:13
LDA + PMI 900 17:26:08
K-Means TFIDF 09:37:40
K-Means LDA 03:49:04
Optics TFIDF 12:42:13
Optics LDA 05:12:49

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics
• Cluster cohesion
– Items in a cluster should be similar to each
other
– Items in a cluster should be different from
items in other clusters
• How to test this?
– “Intruder” test
– If you insert an intruder into a cluster, can
people find it

TPDL 2012, Cyprus, 24-27 September 2012

Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the
“intruder” topic
4. Randomly select one item from the
second topic – the “intruder” item
5. Scramble the five items and let the user
choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Not Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics
• Cohesive
– “Intruder” is chosen significantly more
frequently than by chance
– Choice distribution is significantly different
from the uniform distribution
• Borderline cohesive
– Two out of five items make up > 95% of the
answers
– “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Bounds
• Upper bound
– Manual annotation
• 936 topics
• Lower bound
– 3 cohesive topics
– <5% likelihood of seeing that number of cohesive
topics by chance
• Control data
– 10 “really, totally, completely obvious” intruders
used to filter participants who randomly select
answers

TPDL 2012, Cyprus, 24-27 September 2012

Experiment
• Crowd-sourced using staff & students at
Sheffield University
– 700 participants
• 9 clustering strategies
– 30 units per strategy – total of 270 units
• Results
– 8840 ratings
– 21 – 30 ratings per unit (median 27 ratings)

TPDL 2012, Cyprus, 24-27 September 2012

Results
Model Cohesive Borderline Non-Cohesive
Upper Bound 27 0 3
Lower Bound 3 0 27
LDA 300 15 6 9
LDA 900 20 4 6
LDA + PMI 300 16 4 10
LDA + PMI 900 21 2 7
K-Means TFIDF 24 3 3
K-Means LDA 20 0 10
Optics TFIDF 14 2 14
Optics LDA 16 0 14

TPDL 2012, Cyprus, 24-27 September 2012

Conclusions
• K-means almost as good as the human
classification
• LDA is very fast and approximately two
thirds of the topics are acceptably
cohesive

• Future work:
– Make it hierarchical
– Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012

Thank you for listening

Find out more about the project:

http://www.paths-project.eu

m.mhall@sheffield.ac.uk

The research leading to these results has received funding from the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project
partners involved in PATHS (see: http://www.paths-project.eu).

Review the use to the use of the Outlook built-in tool named - Outlook Test E-mail AutoConfiguration for - viewing the content of the Autodiscover session between a client and a server. This is the first article for a series of Three articles, in which we review different tools for “Autodiscover Troubleshooting scenarios”. http://o365info.com/outlook-test-e-mail-autoconfiguration-autodiscover-troubleshooting-tools-part-1-of-4-part-21-of-36

Should i use a single namespace for exchange infrastructure part 1#2 part ...

Eyal Doron

Should I use a single namespace for Exchange Infrastructure? | Part 1#2 | Part 17#36 Description of a scenario in which we use a single namespace for Exchange infrastructure. The meaning of the term –“single namespace” – is a scenario in which Exchange infrastructure use the same namespace for internal and external Exchange client described as – single or unified namespace. http://o365info.com/should-i-use-a-single-namespace-for-exchange-infrastructure-part-1-of-2-part-17-of-36 Eyal Doron | o365info.com

Outlook autodiscover decision process choosing the right autodiscover method ...

Eyal Doron

PATHS: Personalised Access to Cultural Heritage Spaces

pathsproject

IND-2012-277 St.Xavier’s High School -Zero Garbage Campaign

designforchangechallenge

Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...

pathsproject

This document introduces distributional semantic similarity methods for automatically measuring the coherence of topics generated by topic models. It constructs semantic spaces to represent topic words using Wikipedia as a reference corpus. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information. Topic coherence is determined by measuring the distance between word vectors. Evaluation on three datasets shows distributional measures outperform the state-of-the-art approach, with performance improving using a reduced semantic space.

Presentación Drupal Commerce en OpenExpo Ecommerce

OpenExpo

Este documento introduce Drupal como un framework y CMS, explicando sus capas, hooks y permisos de roles. También describe la comunidad Drupal y recursos como módulos, temas y ejemplos de sitios como Kickstarter que usan Drupal Commerce para tiendas en línea. En resumen, presenta las características principales de Drupal, su arquitectura, administración de contenido y comercialización a través de Drupal Commerce.

PATHS Demo: Exploring Digital Cultural Heritage Spaces

pathsproject

Students in a village conducted a survey and found that the local community lacked awareness about health and hygiene. To address this, the students organized a rally and role plays to educate the community about health and hygiene. They also visited homes to raise awareness and help the community change their practices. As a result of these efforts, the community became more aware of health and hygiene, kept their surroundings cleaner, and helped others do the same.

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...

Eyal Doron

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introduction and basic concepts| 1/4 | 16#23 http://o365info.com/client-protocol-connectivity-flow-in-exchange-2013-2007-coexistence-environment-introduction-and-basic-concepts-14/ Reviewing the subject of – client protocol connectivity flow, in an Exchange 2013/2007 coexistence environment (this is the first article, in a series of four articles). Eyal Doron | o365info.com

Plivo OSDC FR 2012

mricordeau

Think before you speak

Desi Puspitasariku

3 Dec 2013 Integrated computational materials CDE themed competition presenta...

Defence and Security Accelerator

Facing the data challenge: Developing data policy & services

Marieke Guy

DM2E Data Model

Steffen Hennicke

Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...

TERN Australia

The document summarizes a project that aims to develop long-term datasets from agricultural trials in Australia to model carbon and nitrogen dynamics. The project involves collating climate, soil, management and crop yield data from several long-term trial sites. Software is being developed to extract and transform this data to calibrate and validate carbon and nitrogen models. Preliminary results using several models show they produce similar soil carbon stock estimates when given the same input data, but further refinement is needed. The project aims to make the data and software publicly available to improve modeling of carbon and nutrient dynamics in agricultural systems.

Dr. alex bartzas

innovation_workshop2013

The Microlab experience in bridging the gap between academia and industry through collaborations on research projects and technology clusters. Microlab is a member of three technology clusters supported by Corallia focused on nano/microelectronics, space technologies, and gaming/creative content. Microlab provides services to cluster members and works with industry on applications in domains including multimedia, trusted computing, medical devices, microelectronics, space, and energy through European Commission funded projects.

Kaggle's WISE 2014 challenge

Eleftherios Spyromitros-Xioufis

UKRepNet presentation at Pure UK User Group Meeting Dundee

euroCRIS - Current Research Information Systems

Business case and cost modelling for an end-to-end RDM service

Jisc RDM

Search and Hyperlinking Overview @MediaEval2014

Maria Eskevich

The document summarizes the Search and Hyperlinking task at the 2014 MediaEval benchmarking initiative. It provides an overview of the task and datasets used from 2012-2014. It also reports the results of various submissions on the search and hyperlinking sub-tasks based on evaluation metrics like MAP, P@5/10 and discusses lessons learned like the effect of prosodic features and metadata on performance. Finally, it acknowledges the contributions of the BBC and others in preparing the datasets and hosting user trials.

(11) INTERACTION Final event - Wrap-up

Interaction-FP7

This document summarizes a project that studied driver behavior with in-vehicle technologies. It collected and analyzed data from over 100 vehicles across several European countries over 12 months. The project produced novel insights into how drivers interact with in-vehicle technologies by applying various techniques. It also compared the study methodology and findings to the 100-car naturalistic driving study in the United States.

UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...

EDINA, University of Edinburgh

The document discusses the UK RepositoryNet+ Project which aims to enhance institutional repository networks in the UK. It describes some of the complex landscape of actors, projects, and stakeholders involved. It also outlines a joint venture between RepositoryNet+, the University of St Andrews, and the Software Development Life Cycle group to enhance St Andrews' CRIS/IR system according to RepositoryNet's worklines, including implementing various interoperability standards and services.

IASSIST 2012 - DDI-RDF - Trouble with Triples

Dr.-Ing. Thomas Hartmann

This document summarizes work being done to express the Data Documentation Initiative (DDI) metadata standard in Resource Description Framework (RDF) format to improve discovery and linking of microdata on the Web of Linked Data. It describes background on the DDI to RDF mapping effort, the goals of making microdata more accessible and interoperable online, and examples of how the RDF representation would support common discovery use cases. It also provides information on tools and next steps for the ongoing work, acknowledging contributions from participants in workshops where this effort was discussed.

Icsm12.ppt

Yann-Gaël Guéhéneuc

The document presents an empirical study on improving requirements traceability techniques using insights from an eye-tracking study. The eye-tracking study found that developers spent the most time viewing method names, comments, and variable names when verifying traceability links. This suggests that requirements traceability techniques should weight source code entities differently based on developer attention. The study then proposed two improved weighting schemes: SE/IDF, which weights entities based on developer attention; and DOI/IDF, which separately weights domain and implementation terms. Evaluating these schemes on two case studies found they achieved statistically higher accuracy than a baseline TF/IDF approach, supporting the hypothesis that incorporating developer insights can improve traceability recovery.

Improving the Performance of the DL-Learner SPARQL Component for Semantic We...

Sebastian Hellmann

Open Access & sharing research data: a Dutch workshop for phd in economics

Esther Hoorn

This document discusses a workshop on open access and sharing research data. It introduces open access, defines it as digital works that are free online and free of copyright restrictions. It discusses funder mandates requiring open access publication and sharing of research data. It also addresses issues around research integrity and transparency when publicly sharing data and retaining copyright of published work. The document provides information on open access policies and initiatives in various fields and journals.

Viewers also liked

My E-mail appears as spam - troubleshooting path - part 11 of 17

Eyal Doron

The autodiscover algorithm for locating the source of information part 05#36

Eyal Doron

Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama

DFC2012 India: Health & Hygiene

designforchangechallenge

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...

Eyal Doron

Plivo OSDC FR 2012

mricordeau

Think before you speak

Desi Puspitasariku

Viewers also liked (7)

My E-mail appears as spam - troubleshooting path - part 11 of 17

The autodiscover algorithm for locating the source of information part 05#36

Word pressで情報を得るのに役立つwebサイトの紹介

DFC2012 India: Health & Hygiene

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...

Plivo OSDC FR 2012

Think before you speak

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

3 Dec 2013 Integrated computational materials CDE themed competition presenta...

Defence and Security Accelerator

Facing the data challenge: Developing data policy & services

Marieke Guy

DM2E Data Model

Steffen Hennicke

Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...

TERN Australia

Dr. alex bartzas

innovation_workshop2013

Kaggle's WISE 2014 challenge

Eleftherios Spyromitros-Xioufis

UKRepNet presentation at Pure UK User Group Meeting Dundee

euroCRIS - Current Research Information Systems

Business case and cost modelling for an end-to-end RDM service

Jisc RDM

Search and Hyperlinking Overview @MediaEval2014

Maria Eskevich

(11) INTERACTION Final event - Wrap-up

Interaction-FP7

UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...

EDINA, University of Edinburgh

IASSIST 2012 - DDI-RDF - Trouble with Triples

Dr.-Ing. Thomas Hartmann

Icsm12.ppt

Yann-Gaël Guéhéneuc

Improving the Performance of the DL-Learner SPARQL Component for Semantic We...

Sebastian Hellmann

Open Access & sharing research data: a Dutch workshop for phd in economics

Esther Hoorn

Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...

Lighton Phiri

Research Data Management at Imperial College London

Sarah Anna Stewart

Linked Data for Knowledge Discovery: Introduction

Mathieu d'Aquin

This document summarizes the LD4KD 2015 workshop, which brought together researchers from the linked data and knowledge discovery communities. The workshop included two paper presentations, a demo session, and discussions on opportunities and challenges at the intersection of linked data and knowledge discovery. Some opportunities discussed were using linked data as input for knowledge discovery due to its large, global scale and ability to be extended and enriched. Challenges discussed included dealing with linked data as a graph structure, its distributed and incomplete nature, and ensuring its quality and reducing bias. The goal of the workshop was to further understanding and develop practical tools to address these challenges.

DLF Fall Forum 2012, Tales from the Cloud

DuraSpace

The Texas Digital Library moved all of its infrastructure and services to Amazon Web Services in 2011. This allowed TDL to deploy services more easily and elastically without being constrained by physical hardware limitations. While the transition was successful overall, TDL continues to evaluate AWS costs and considers options for certain services like digital preservation that may be better suited to on-campus infrastructure. The outage experienced by many AWS users in late 2012 highlighted the benefits of AWS's communication during such incidents but also reinforced the importance of TDL's ongoing evaluation of cloud strategy and risks.

Orcid implementations-140929-jonasgilbert

jonas_gilbert

This document discusses Chalmers University of Technology's implementation of ORCID identifiers. It notes that Chalmers has 10,000 students and 2,000 researchers annually publishing 2,500 scholarly works. Chalmers obtained an ORCID institutional membership in 2013 and conducted a pilot project batch creating ORCIDs for one department. Now, Chalmers has developed a "Create & Connect" service, implemented in collaboration with the library and IT/HR departments, to generate ORCIDs that are stored in the central HR system and linked across university research and publication databases. The service aims to provide researchers a one-stop solution for all things concerning their ORCID at Chalmers.

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

3 Dec 2013 Integrated computational materials CDE themed competition presenta...

Facing the data challenge: Developing data policy & services

DM2E Data Model

Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...

Dr. alex bartzas

Kaggle's WISE 2014 challenge

UKRepNet presentation at Pure UK User Group Meeting Dundee

Business case and cost modelling for an end-to-end RDM service

Search and Hyperlinking Overview @MediaEval2014

(11) INTERACTION Final event - Wrap-up

UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...

IASSIST 2012 - DDI-RDF - Trouble with Triples

Icsm12.ppt

Improving the Performance of the DL-Learner SPARQL Component for Semantic We...

Open Access & sharing research data: a Dutch workshop for phd in economics

Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...

Research Data Management at Imperial College London

Linked Data for Knowledge Discovery: Introduction

DLF Fall Forum 2012, Tales from the Cloud

Orcid implementations-140929-jonasgilbert

More from pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...

pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulting from automatic enrichment - Aitor Soroa, Eneko Agirre, Arantxa Otegi and Antoine Isaac This document is a case study on using the Europeana Data Model (EDM) [Doerr et al., 2010] for representing annotations of Cultural Heritage Objects (CHO). One of the main goals of the PATHS project is to augment CHOs (items) with information that will enrich the user’s experience. The additional information includes links between items in cultural collections and from items to external sources like Wikipedia. With this goal, the PATHS project has applied Natural Language Processing (NLP) techniques on a subset of the items in Europeana.

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...

pathsproject

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enrichment, Eneko Agirre, Ander Barrena, Kike Fernandez, Esther Miranda, Arantxa Otegi, and Aitor Soroa, paper presented the international conference on Theory and Practice in Digital Libraries, TPDL 2013 Large amounts of cultural heritage material are nowadays available through online digital library portals. Most of these cultural items have short descriptions and lack rich contextual information. The PATHS project has developed experimental enrichment services. As a proof of concept, this paper presents a web service prototype which allows independent content providers to enrich cultural heritage items with a subset of the full functionality: links to related items in the collection and links to related Wikipedia articles. In the future we plan to provide more advanced functionality, as available offline for PATHS.

Implementing Recommendations in the PATHS system, SUEDL 2013

pathsproject

Implementing Recommendations in the PATHS system, Paul Clough, Arantxa Otegi, Eneko Agirre and Mark Hall, paper presented at the Supporting Users Exploration of Digital Libraries, SUEDL 2013 workshop, during TPDL 2013 in Valetta, Malta In this paper we describe the design and implementation of nonpersonalized recommendations in the PATHS system. This system allows users to explore items from Europeana in new ways. Recommendations of the type “people who viewed this item also viewed this item” are powered by pairs of viewed items mined from Europeana. However, due to limited usage data only 10.3% of items in the PATHS dataset have recommendations (4.3% of item pairs visited more than once). Therefore, “related items”, a form of contentbased recommendation, are offered to users based on identifying similar items. We discuss some of the problems with implementing recommendations and highlight areas for future work in the PATHS project.

User-Centred Design to Support Exploration and Path Creation in Cultural Her...

pathsproject

This document describes research on developing a prototype system to enhance user interaction with cultural heritage collections through a pathway metaphor. It involved gathering user requirements through surveys and interviews. Key findings include: 1) Existing online paths tend to be linear and static, limiting exploration, though users preferred more flexible, theme-based paths that allowed branching. 2) Interviews found the path metaphor could represent search histories, journeys of discovery, linked metadata, guides into collections, routes through collections, and more. 3) An interaction model was developed involving consuming, collecting, creating and communicating about paths to support exploration, learning and engagement. 4) The prototype aims to integrate path creation, use and sharing to better support

Generating Paths through Cultural Heritage Collections Latech2013 paper

pathsproject

Generating Paths through Cultural Heritage Collections, Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre. Paper presented at Latech 2013 Cultural heritage collections usually organise sets of items into exhibitions or guided tours. These items are often accompanied by text that describes the theme and topic of the exhibition and provides background context and details of connections with other items. The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.

Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...

pathsproject

Workshop proceedings from the International workshop on Supporting Users Exploration of Digital Libraries, SUEDL 2012 which was held at TPDL 2012 (the international conference on Theory and Practice in Digital Libraries), Paphos, Cyprus, September 2012. The aim of the workshop was to stimulate collaboration from experts and stakeholders in Digital Libraries, Cultural Heritage, Natural Language Processing and Information Retrieval in order to explore methods and strategies to support exploration of Digital Libraries, beyond the white box paradigm of search and click. The proceedings includes: "Browsing Europeana - Opportunities and Challenges', David Haskiya "Query re-writing using shallow language processing effects', Anna Mastora and Sarantos Kapidakis "Visualising Television Heritage" Johan Ooman et al, "Providing suitable information access for new users of Digital Libraries", Rike Brecht et al "Exploring Pelagios: a Visual Browser for Geo-tagged datasets" Rainer Simon et al

PATHS state of the art monitoring report

pathsproject

Recommendations for the automatic enrichment of digital library content using...

pathsproject

Recommendations for the enrichment of digital library content using open source software, PATHS report by Eneko Agirre and Arantxa Otegi The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Library content using open source software. It is intended to be useful for third-parties who would like to offer enrichment services. Note that this is not a step-by-step guide for reimplementation, but an overall view of the software required and the programming effort involved.

Semantic Enrichment of Cultural Heritage content in PATHS

pathsproject

Semantic Enrichment of Cultural Heritage content in PATHS, report by Mark Stevenson and Arantxa Otegi with Eneko Agirre, Nikos Aletras, Paul Clough, Samuel Fernando and Aitor Saroa. The aim of the PATHS project is to enable exploration and discovery within cultural heritage collections. In order to support this the project developed a range of enrichment techniques which augmented these collections with additional information to enhance the users’ browsing experience. One of the demonstration systems developed in PATHS makes use of content from Europeana. This document summarises the semantic enrichment techniques developed in PATHS, with particular reference to their application to the Europeana data.

Generating Paths through Cultural Heritage Collections, LATECH 2013 paper

pathsproject

Generating Paths through Cultural Heritage Collections Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre. The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.

PATHS @ LATECH 2013

pathsproject

PATHS at the eChallenges conference

pathsproject

The PATHS project is a 3-year EU-funded project involving 6 partners across 5 countries. The project aims to introduce personalized paths into digital cultural heritage collections to provide more engaging access to large volumes of online material. The PATHS system enriches metadata through natural language processing and links items within collections and to external resources. It provides various tools for browsing, searching and creating paths. Two rounds of user evaluations found the path creation tools and search mechanisms were well received. Outcomes include the PATHS API and potential commercialization of components and consultancy services.

PATHS at the EAA conference 2013

pathsproject

This document summarizes the PATHS project, which developed tools for exploring digital cultural heritage collections. The project involved 6 partners across 5 countries. It researched methods for navigating collections, including user-created paths and natural language processing of metadata. Users can browse collections through a thesaurus, tag cloud, or topic map. The system allows users to create and publish nonlinear paths through the collection with descriptions. The tools have potential for classroom activities, curated collections, and research.

PATHS at the eCult dialogue day 2013

pathsproject

Comparing taxonomies for organising collections of documents presentation

pathsproject

This document compares different taxonomies for organizing large collections of documents. It evaluates taxonomies that were either manually created (LCSH, WordNet domains, Wikipedia taxonomy, DBpedia ontology) or automatically derived from document data using LDA topic modeling or Wikipedia link frequencies. The document describes applying these taxonomies to a collection of over 550,000 items from Europeana, a digital library. It then evaluates the taxonomies based on how cohesive the groupings are and how accurately the relationships between parent and child nodes are classified.

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

pathsproject

This document describes the SemEval-2012 Task 6 on semantic textual similarity. The task involved measuring the semantic equivalence of sentence pairs on a scale from 0 to 5. The training data consisted of 2000 sentence pairs from existing paraphrase and machine translation datasets. The test data also had 2000 sentence pairs from these datasets as well as surprise datasets. Systems were evaluated based on their Pearson correlation with human annotations. 35 teams participated and the best systems achieved a Pearson correlation over 80%. This pilot task established semantic textual similarity as an area for further exploration.

A pilot on Semantic Textual Similarity

pathsproject

This document summarizes the SemEval 2012 task on semantic textual similarity. It describes the motivation for the task as measuring similarity between text fragments on a graded scale. It then outlines the datasets used, including the MSR paraphrase corpus, MSR video corpus, WMT evaluation data, and OntoNotes word sense data. It also discusses the annotation process, which involved a pilot with authors and crowdsourcing through Mechanical Turk. The results showed most systems performed better than baselines and the best systems achieved correlations over 0.8 with human judgments.

Comparing taxonomies for organising collections of documents

pathsproject

This document compares different taxonomies for organizing large collections of documents. It examines four existing manually created taxonomies (Library of Congress Subject Headings, WordNet Domains, Wikipedia Taxonomy, DBpedia) and two methods for automatically deriving taxonomies (WikiFreq and LDA topics) for organizing a large online cultural heritage collection from Europeana. It then presents two human evaluations of the taxonomies, measuring cohesion and analyzing concept relations, and finds that the manual taxonomies have high-quality relations while the novel automatic method generates very high cohesion.

PATHS Final prototype interface design v1.0

pathsproject

This document summarizes the design methodology and current status of the interface design for the second prototype of the PATHS project. It begins with a three-stage design methodology that includes: evaluating the first prototype design process, creating low-fidelity storyboards, and developing high-fidelity interaction designs. It then reviews lessons learned from developing the first prototype interface. The document introduces new user interface components and presents preliminary high-fidelity designs for key pages like the landing page, path editing, and item pages. Expert evaluation of the designs is planned along with user evaluation of a working prototype. The goal is to address issues identified in prior evaluations and create an intuitive interface for the PATHS cultural heritage system.

PATHS Evaluation of the 1st paths prototype

pathsproject

This document summarizes the evaluation of the first prototype of the PATHS project. It describes the evaluation methodology, which included field-based demonstrations and laboratory evaluations. Results are presented from both types of evaluations, including participant demographics, user feedback on ease of use and usefulness of PATHS, suggested improvements, and results from structured tasks conducted in the laboratory evaluations. The document also reviews how well the first PATHS prototype met its functional specifications.

More from pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...

Implementing Recommendations in the PATHS system, SUEDL 2013

User-Centred Design to Support Exploration and Path Creation in Cultural Her...

Generating Paths through Cultural Heritage Collections Latech2013 paper

Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...

PATHS state of the art monitoring report

Recommendations for the automatic enrichment of digital library content using...

Semantic Enrichment of Cultural Heritage content in PATHS

Generating Paths through Cultural Heritage Collections, LATECH 2013 paper

PATHS @ LATECH 2013

PATHS at the eChallenges conference

PATHS at the EAA conference 2013

PATHS at the eCult dialogue day 2013

Comparing taxonomies for organising collections of documents presentation

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

A pilot on Semantic Textual Similarity

Comparing taxonomies for organising collections of documents

PATHS Final prototype interface design v1.0

PATHS Evaluation of the 1st paths prototype

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Neo4j

Leonard Jayamohan, Partner & Generative AI Lead, Deloitte This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Zilliz

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

“I’m still / I’m still / Chaining from the Block”

Claudio Di Ciccio

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Climate Impact of Software Testing at Nordic Testing Days

National Security Agency - NSA mobile device best practices

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...

Building RAG with self-deployed Milvus vector database and Snowpark Container...

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

“I’m still / I’m still / Chaining from the Block”

Introduction to CHERI technology - Cybersecurity

Full-RAG: A modern architecture for hyper-personalization

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

UiPath Test Automation using UiPath Test Suite series, part 5

Removing Uninteresting Bytes in Software Fuzzing

Presentation of the OECD Artificial Intelligence Review of Germany

Securing your Kubernetes cluster_ a step-by-step guide to success !

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012

2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012

3. Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal – Expensive to generate – Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012

4. Test Collection • 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be short TPDL 2012, Cyprus, 24-27 September 2012

5. Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering • K-Means – 900 clusters – TFIDF vectors & LDA topic vectors • OPTICS – 900 clusters – TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012

6. Processing Time Model Wall-clock Time LDA 300 00:21:48 LDA 900 00:42:42 LDA + PMI 300 05:05:13 LDA + PMI 900 17:26:08 K-Means TFIDF 09:37:40 K-Means LDA 03:49:04 Optics TFIDF 12:42:13 Optics LDA 05:12:49 TPDL 2012, Cyprus, 24-27 September 2012

7. Evaluation Metrics • Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters • How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012

8. Intruder Test 1. Randomly select one topic 2. Randomly select four items from the topic 3. Randomly select a second topic – the “intruder” topic 4. Randomly select one item from the second topic – the “intruder” item 5. Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012

9. Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012

10. Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012

11. Evaluation Metrics • Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution • Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012

12. Evaluation Bounds • Upper bound – Manual annotation • 936 topics • Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance • Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012

13. Experiment • Crowd-sourced using staff & students at Sheffield University – 700 participants • 9 clustering strategies – 30 units per strategy – total of 270 units • Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012

14. Results Model Cohesive Borderline Non-Cohesive Upper Bound 27 0 3 Lower Bound 3 0 27 LDA 300 15 6 9 LDA 900 20 4 6 LDA + PMI 300 16 4 10 LDA + PMI 900 21 2 7 K-Means TFIDF 24 3 3 K-Means LDA 20 0 10 Optics TFIDF 14 2 14 Optics LDA 16 0 14 TPDL 2012, Cyprus, 24-27 September 2012

15. Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: – Make it hierarchical – Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012

16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.uk The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

More from pathsproject

More from pathsproject (20)

Recently uploaded

Recently uploaded (20)

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections