This document compares different taxonomies for organizing large collections of documents. It examines four existing manually created taxonomies (Library of Congress Subject Headings, WordNet Domains, Wikipedia Taxonomy, DBpedia) and two methods for automatically deriving taxonomies (WikiFreq and LDA topics) for organizing a large online cultural heritage collection from Europeana. It then presents two human evaluations of the taxonomies, measuring cohesion and analyzing concept relations, and finds that the manual taxonomies have high-quality relations while the novel automatic method generates very high cohesion.
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
Global Library of Life: The Biodiversity Heritage Library. Martin R. Kalfatovic. Boston Library Consortium Meeting. Boston Public Library. 18 March 2008. Boston, MA.
The document discusses several key points about library classification systems:
1) Shiyali Ramamrita Ranganathan introduced the Colon Classification system which classifies knowledge into broad concepts called facets, distinguished using colons.
2) Specialized libraries may develop their own customized systems, like the Mathematics Subject Classification.
3) The National Library of Medicine classification was adopted by most US medical libraries, using the letter W for medicine and parts of class Q for science.
4) Subject headings allow libraries to assign multiple terms to works covering multiple topics, improving searchability.
Knowledge Organisation Systems in Digital Libraries: A Comparative StudyBhojaraju Gunjal
The document presents a study that compares the different Knowledge Organization Systems (KOS) used in major digital libraries. It finds that while traditional libraries used standardized systems like classification schemes, digital libraries employ various KOS tools including thesauri, ontologies, and subject headings. The study analyzes the specific KOS used in different digital libraries and summarizes the current state of KOS in these libraries.
Modelling Knowledge Organization Systems and StructuresMarcia Zeng
In this paper FRSAD (as a conceptual model) is compared to SKOS and SKOS XL (as data models), with implementation examples. ISKO-UK 2011 Conference, London, July 2011.
ABSTRACT. There is an emergent and global trend that integrates bibliometrics and text-mining practices and tools within academic library new services. In this poster, we
present three recent local initiatives, targets and first results, which are order to support the scientific research activities at universities. The cases of three University Libraries in South America are presented, namely, Universidade de São Paulo (USP) in Brazil, and two in Colombia, Universidad Nacional in Medellín and Universidad Santo Tomás in Bucaramanga. According to the data collected through a survey, there is evidence of a growth, after libraries training programs, in the use of databases structured electronic information. This is
especially so in sources indexed in citation databases, such as Scopus and Web of Science and in the volume of records downloaded. There is also evidence of a mutually beneficial engagement of researchers and librarians in the development of technology state of the art and research trends reports. Finally is early to know the extension and impact of these new activities in redefining the role of libraries in the knowledge transfer processes of academic communities.
Poster presented at 4th Global Tech Mining Conference (GTM 2014), Leiden, The Netherlands, Sept. 2, 2014. www.gtmconference.org/
Knowledge Organisation Systems in ETDs: A Comparative StudyBhojaraju Gunjal
Gunjal, Bhojaraju & Urs, Shalini R (2010): Knowledge Organisation Systems in ETDs: A Comparative Study; In Proceedings of 13th International Symposium on Electronic Theses and Dissertations (ETD 2010); June 16 -18, 2010, Austin, TX, USA. https://conferences.tdl.org/utlibraries/etd2010/paper/view/28
Wikipedia Categories on Research: Towards a Qualitiative Review of Uses and A...Jesús Tramullas
This document summarizes a literature review on the uses and applications of Wikipedia categories in research. It finds that researchers use Wikipedia categories in two main ways: to analyze the category system itself, and to use the categories in information processing studies on other document corpora. The categories have been used for information retrieval, entity processing, indexing, classification, creating taxonomies and ontologies, and semantic treatment. The researchers conclude that Wikipedia provides an important research area and tool to study and validate information processing and classification approaches.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
Global Library of Life: The Biodiversity Heritage Library. Martin R. Kalfatovic. Boston Library Consortium Meeting. Boston Public Library. 18 March 2008. Boston, MA.
The document discusses several key points about library classification systems:
1) Shiyali Ramamrita Ranganathan introduced the Colon Classification system which classifies knowledge into broad concepts called facets, distinguished using colons.
2) Specialized libraries may develop their own customized systems, like the Mathematics Subject Classification.
3) The National Library of Medicine classification was adopted by most US medical libraries, using the letter W for medicine and parts of class Q for science.
4) Subject headings allow libraries to assign multiple terms to works covering multiple topics, improving searchability.
Knowledge Organisation Systems in Digital Libraries: A Comparative StudyBhojaraju Gunjal
The document presents a study that compares the different Knowledge Organization Systems (KOS) used in major digital libraries. It finds that while traditional libraries used standardized systems like classification schemes, digital libraries employ various KOS tools including thesauri, ontologies, and subject headings. The study analyzes the specific KOS used in different digital libraries and summarizes the current state of KOS in these libraries.
Modelling Knowledge Organization Systems and StructuresMarcia Zeng
In this paper FRSAD (as a conceptual model) is compared to SKOS and SKOS XL (as data models), with implementation examples. ISKO-UK 2011 Conference, London, July 2011.
ABSTRACT. There is an emergent and global trend that integrates bibliometrics and text-mining practices and tools within academic library new services. In this poster, we
present three recent local initiatives, targets and first results, which are order to support the scientific research activities at universities. The cases of three University Libraries in South America are presented, namely, Universidade de São Paulo (USP) in Brazil, and two in Colombia, Universidad Nacional in Medellín and Universidad Santo Tomás in Bucaramanga. According to the data collected through a survey, there is evidence of a growth, after libraries training programs, in the use of databases structured electronic information. This is
especially so in sources indexed in citation databases, such as Scopus and Web of Science and in the volume of records downloaded. There is also evidence of a mutually beneficial engagement of researchers and librarians in the development of technology state of the art and research trends reports. Finally is early to know the extension and impact of these new activities in redefining the role of libraries in the knowledge transfer processes of academic communities.
Poster presented at 4th Global Tech Mining Conference (GTM 2014), Leiden, The Netherlands, Sept. 2, 2014. www.gtmconference.org/
Knowledge Organisation Systems in ETDs: A Comparative StudyBhojaraju Gunjal
Gunjal, Bhojaraju & Urs, Shalini R (2010): Knowledge Organisation Systems in ETDs: A Comparative Study; In Proceedings of 13th International Symposium on Electronic Theses and Dissertations (ETD 2010); June 16 -18, 2010, Austin, TX, USA. https://conferences.tdl.org/utlibraries/etd2010/paper/view/28
Wikipedia Categories on Research: Towards a Qualitiative Review of Uses and A...Jesús Tramullas
This document summarizes a literature review on the uses and applications of Wikipedia categories in research. It finds that researchers use Wikipedia categories in two main ways: to analyze the category system itself, and to use the categories in information processing studies on other document corpora. The categories have been used for information retrieval, entity processing, indexing, classification, creating taxonomies and ontologies, and semantic treatment. The researchers conclude that Wikipedia provides an important research area and tool to study and validate information processing and classification approaches.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
Mail migration to office 365 optimizing the mail migration throughput - par...Eyal Doron
This document provides recommendations for optimizing the throughput of migrating email from an on-premises Exchange server to Office 365. It discusses factors like available bandwidth, network infrastructure tuning, Exchange server performance, and using multiple Exchange sites in parallel. Specific tips include using non-work hours for migration, checking for bottlenecks, adjusting the MRSProxy resource allocation, and taking advantage of hybrid migration methods to reduce migrated data size. The overall goal is to understand limitations and make adjustments to complete large-scale email migrations within tight windows.
This document summarizes the services provided by sriServices, a UK-based firm that aims to grow sustainable, responsible, and ethical investments. It provides online tools and support to help financial advisors integrate consideration of environmental, social, and governance issues into their practices. The document highlights the growing diversity and demand in the sustainable investment field. It also provides an overview of the different approaches funds take and illustrates the variety of sustainable investing styles using a fund classification system. The goal is to help advisors better understand options and match client needs and values to appropriate fund choices.
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
Semantic Enrichment of Cultural Heritage content in PATHS, report by Mark Stevenson and Arantxa Otegi with Eneko Agirre, Nikos Aletras, Paul Clough, Samuel Fernando and Aitor Saroa.
The aim of the PATHS project is to enable exploration and discovery within cultural heritage collections. In order to support this the project developed a range of enrichment techniques which augmented these collections with additional information to enhance the users’ browsing experience. One of the demonstration systems developed in PATHS makes use of content from Europeana. This document summarises the semantic enrichment techniques developed in PATHS, with particular reference to their application to the Europeana data.
Presentación Drupal Commerce en OpenExpo EcommerceOpenExpo
Este documento introduce Drupal como un framework y CMS, explicando sus capas, hooks y permisos de roles. También describe la comunidad Drupal y recursos como módulos, temas y ejemplos de sitios como Kickstarter que usan Drupal Commerce para tiendas en línea. En resumen, presenta las características principales de Drupal, su arquitectura, administración de contenido y comercialización a través de Drupal Commerce.
Autodiscover flow in an office 365 environment part 3#3 part 31#36Eyal Doron
Autodiscover flow in an Office 365 environment | Part 3#3 | Part 31#36
Detailed description of the Autodiscover flow that is implemented between Autodiscover client and his Autodiscover Endpoint (Exchange server) in a scenario, in which the mail infrastructure is an Office 365 environment (Exchange Online).
This is the third article, in a series of three articles.
http://o365info.com/autodiscover-flow-in-an-office-365-environment-part-3-of-3-part-31-of-36
Eyal Doron | o365info.com
Public san certificate deprecated support in the internal server name part ...Eyal Doron
The document discusses changes to the standards for public SAN certificates by the CA/Browser Forum. As of November 1, 2015, certificates will no longer be able to include non-unique names, also called internal server names, that cannot be externally verified. This is because such names pose a security risk, as they are not truly unique and could be falsely represented. The changes will affect services that currently use certificates for internal server names, requiring administrators to prepare by obtaining new certificates without such names before the deadline. Failure to do so could result in issues ranging from annoyances to catastrophic consequences for their systems and services.
PATHS at Digital Humanities Congress 2012pathsproject
A presnetation about the project given by N. Aletras, P.D. Clough, S. Fernando, N.Ford, P. Goodale, M.M. Hall and M. Stevenson from the Information School /Department of Computer Science, University of Sheffield
Sheffield, 6th – 8th September 2012
Exchange In-Place eDiscovery & Hold | Introduction | 5#7Eyal Doron
Exchange In-Place eDiscovery & Hold | Introduction | 5#7
http://o365info.com/exchange-in-place-ediscovery-hold-introduction-part-5-7
The Exchange In-Place Hold & eDiscovery, is a very powerful tool that can help us to accomplish three main tasks.
1. Search for information (mail items) in single or multiple mailboxes
2. Put specific information on “hold” (enable to save the information for an unlimited time period)
3. Recover deleted mail items
In this article, we will review the logic and the concepts of the Exchange In-Place Hold & eDiscovery toll.
In the next article xx, we will demonstrate how to use the Exchange In-Place Hold & eDiscovery toll for recovering deleted mail items.
Eyal Doron | o365info.com
O documento resume as atividades realizadas nas bibliotecas escolares do Agrupamento de Escolas Dr. Júlio Martins no mês de janeiro de 2016, incluindo uma ação de formação sobre leitura em voz alta, visitas dos alunos do 1o ano à biblioteca municipal, sessões de saúde oral para alunos e pais, e a primeira fase do Concurso Nacional de Leitura.
The document discusses personal information management (PIM) tools and strategies. It describes how PIM has been an issue since information became available and outlines some common PIM tools like email, calendars, computer desktop organization, and websites. It also discusses the implications of increased digital information storage, such as challenges around saving, organizing, and retrieving personal information across multiple tools and locations.
Sean Ryan Processing an Archival Collection of Scientific PapersSean Ryan
The six articles discuss issues related to processing a scientific archival collection. They provide insights into establishing context through surveying, adding value by digitizing items, conducting research on the science and researchers, balancing original order with usability, criteria for retention, and reflecting on other experiences processing collections. The articles reinforce the importance of understanding the scientific environment and community that would use the materials. They support following guidelines while using some creativity to organize the mycology collection in a way that is useful for researchers.
A Science Mapping Analysis Of Blood Donation BehaviourBria Davis
This study analyzed 963 scholarly articles on blood donation behavior published between 1957-2017. It used bibliometric methods including keyword co-occurrence analysis and science mapping to identify the major topics, influential authors, journals, and countries contributing to research in this area. The analysis found that research output has significantly increased over time, with the most publications in recent years. The authors publishing the most papers were Christopher France (44 papers), Blaine Ditto (24), and Eamonn Ferguson (23). The most influential journal was Transfusion, which published 12.36% of the papers analyzed. The study provides a comprehensive overview of the structure and evolution of scientific research on blood donation behavior.
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
Ontology engineering involves constructing ontologies through various methods. It begins with defining the scope and evaluating existing ontologies for reuse. Terms are enumerated and organized in a taxonomy with defined properties, facets, and instances. The ontology is checked for anomalies and refined iteratively. Popular tools for ontology development include Protege and WebOnto. Methods like Meth ontology and On-To-Knowledge methodology provide processes for building ontologies from scratch or reusing existing ones. Ontology sharing requires mapping between ontologies to allow interoperability, and libraries exist for storing and accessing ontologies.
This document discusses syntactic variation research and provides examples of syntactic doubling. It notes four major developments in the field: 1) increased focus on dialect data; 2) incorporation of sociolinguistic methodology; 3) improved data accessibility through technology; 4) the Minimalist hypothesis that syntactic variation arises through interaction between universal principles and extra-syntactic factors. Large dialect syntax databases now allow investigation of correlations and patterns across languages. Examples are given of syntactic doubling phenomena like Wh-doubling and negative concord in various dialects.
Mail migration to office 365 optimizing the mail migration throughput - par...Eyal Doron
This document provides recommendations for optimizing the throughput of migrating email from an on-premises Exchange server to Office 365. It discusses factors like available bandwidth, network infrastructure tuning, Exchange server performance, and using multiple Exchange sites in parallel. Specific tips include using non-work hours for migration, checking for bottlenecks, adjusting the MRSProxy resource allocation, and taking advantage of hybrid migration methods to reduce migrated data size. The overall goal is to understand limitations and make adjustments to complete large-scale email migrations within tight windows.
This document summarizes the services provided by sriServices, a UK-based firm that aims to grow sustainable, responsible, and ethical investments. It provides online tools and support to help financial advisors integrate consideration of environmental, social, and governance issues into their practices. The document highlights the growing diversity and demand in the sustainable investment field. It also provides an overview of the different approaches funds take and illustrates the variety of sustainable investing styles using a fund classification system. The goal is to help advisors better understand options and match client needs and values to appropriate fund choices.
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
Semantic Enrichment of Cultural Heritage content in PATHS, report by Mark Stevenson and Arantxa Otegi with Eneko Agirre, Nikos Aletras, Paul Clough, Samuel Fernando and Aitor Saroa.
The aim of the PATHS project is to enable exploration and discovery within cultural heritage collections. In order to support this the project developed a range of enrichment techniques which augmented these collections with additional information to enhance the users’ browsing experience. One of the demonstration systems developed in PATHS makes use of content from Europeana. This document summarises the semantic enrichment techniques developed in PATHS, with particular reference to their application to the Europeana data.
Presentación Drupal Commerce en OpenExpo EcommerceOpenExpo
Este documento introduce Drupal como un framework y CMS, explicando sus capas, hooks y permisos de roles. También describe la comunidad Drupal y recursos como módulos, temas y ejemplos de sitios como Kickstarter que usan Drupal Commerce para tiendas en línea. En resumen, presenta las características principales de Drupal, su arquitectura, administración de contenido y comercialización a través de Drupal Commerce.
Autodiscover flow in an office 365 environment part 3#3 part 31#36Eyal Doron
Autodiscover flow in an Office 365 environment | Part 3#3 | Part 31#36
Detailed description of the Autodiscover flow that is implemented between Autodiscover client and his Autodiscover Endpoint (Exchange server) in a scenario, in which the mail infrastructure is an Office 365 environment (Exchange Online).
This is the third article, in a series of three articles.
http://o365info.com/autodiscover-flow-in-an-office-365-environment-part-3-of-3-part-31-of-36
Eyal Doron | o365info.com
Public san certificate deprecated support in the internal server name part ...Eyal Doron
The document discusses changes to the standards for public SAN certificates by the CA/Browser Forum. As of November 1, 2015, certificates will no longer be able to include non-unique names, also called internal server names, that cannot be externally verified. This is because such names pose a security risk, as they are not truly unique and could be falsely represented. The changes will affect services that currently use certificates for internal server names, requiring administrators to prepare by obtaining new certificates without such names before the deadline. Failure to do so could result in issues ranging from annoyances to catastrophic consequences for their systems and services.
PATHS at Digital Humanities Congress 2012pathsproject
A presnetation about the project given by N. Aletras, P.D. Clough, S. Fernando, N.Ford, P. Goodale, M.M. Hall and M. Stevenson from the Information School /Department of Computer Science, University of Sheffield
Sheffield, 6th – 8th September 2012
Exchange In-Place eDiscovery & Hold | Introduction | 5#7Eyal Doron
Exchange In-Place eDiscovery & Hold | Introduction | 5#7
http://o365info.com/exchange-in-place-ediscovery-hold-introduction-part-5-7
The Exchange In-Place Hold & eDiscovery, is a very powerful tool that can help us to accomplish three main tasks.
1. Search for information (mail items) in single or multiple mailboxes
2. Put specific information on “hold” (enable to save the information for an unlimited time period)
3. Recover deleted mail items
In this article, we will review the logic and the concepts of the Exchange In-Place Hold & eDiscovery toll.
In the next article xx, we will demonstrate how to use the Exchange In-Place Hold & eDiscovery toll for recovering deleted mail items.
Eyal Doron | o365info.com
O documento resume as atividades realizadas nas bibliotecas escolares do Agrupamento de Escolas Dr. Júlio Martins no mês de janeiro de 2016, incluindo uma ação de formação sobre leitura em voz alta, visitas dos alunos do 1o ano à biblioteca municipal, sessões de saúde oral para alunos e pais, e a primeira fase do Concurso Nacional de Leitura.
The document discusses personal information management (PIM) tools and strategies. It describes how PIM has been an issue since information became available and outlines some common PIM tools like email, calendars, computer desktop organization, and websites. It also discusses the implications of increased digital information storage, such as challenges around saving, organizing, and retrieving personal information across multiple tools and locations.
Sean Ryan Processing an Archival Collection of Scientific PapersSean Ryan
The six articles discuss issues related to processing a scientific archival collection. They provide insights into establishing context through surveying, adding value by digitizing items, conducting research on the science and researchers, balancing original order with usability, criteria for retention, and reflecting on other experiences processing collections. The articles reinforce the importance of understanding the scientific environment and community that would use the materials. They support following guidelines while using some creativity to organize the mycology collection in a way that is useful for researchers.
A Science Mapping Analysis Of Blood Donation BehaviourBria Davis
This study analyzed 963 scholarly articles on blood donation behavior published between 1957-2017. It used bibliometric methods including keyword co-occurrence analysis and science mapping to identify the major topics, influential authors, journals, and countries contributing to research in this area. The analysis found that research output has significantly increased over time, with the most publications in recent years. The authors publishing the most papers were Christopher France (44 papers), Blaine Ditto (24), and Eamonn Ferguson (23). The most influential journal was Transfusion, which published 12.36% of the papers analyzed. The study provides a comprehensive overview of the structure and evolution of scientific research on blood donation behavior.
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
Ontology engineering involves constructing ontologies through various methods. It begins with defining the scope and evaluating existing ontologies for reuse. Terms are enumerated and organized in a taxonomy with defined properties, facets, and instances. The ontology is checked for anomalies and refined iteratively. Popular tools for ontology development include Protege and WebOnto. Methods like Meth ontology and On-To-Knowledge methodology provide processes for building ontologies from scratch or reusing existing ones. Ontology sharing requires mapping between ontologies to allow interoperability, and libraries exist for storing and accessing ontologies.
This document discusses syntactic variation research and provides examples of syntactic doubling. It notes four major developments in the field: 1) increased focus on dialect data; 2) incorporation of sociolinguistic methodology; 3) improved data accessibility through technology; 4) the Minimalist hypothesis that syntactic variation arises through interaction between universal principles and extra-syntactic factors. Large dialect syntax databases now allow investigation of correlations and patterns across languages. Examples are given of syntactic doubling phenomena like Wh-doubling and negative concord in various dialects.
RELATIONSHIP OF LIBRARY SCIENCE WITH INFORMATION SCIENCELibcorpio
LS relationship IS, Library and Information Science, LIS, Library Science and Information Science, LS vs IS; Relationship of Library science with Information science, Library science, Information science, Library Science Vs Information Science, Similarities and Differences, Library Science vs Information Science, Similarities and Differences, LS relationship IS, Library science, Information science,
This document discusses using scientometric techniques to measure knowledge codification and abstraction in scientific disciplines. It proposes applying these techniques to analyze the ATLAS experiment at CERN. The techniques include mapping knowledge domains using bibliographic coupling, co-citation analysis, and co-word analysis to identify research topics, fronts, and relationships over time. This would help evaluate the degree of codification and abstraction in a field on a high, medium, or low scale. The document recommends initially using co-word co-occurrence analysis on ISI records before exploring other literature-based techniques to further analyze knowledge use and transmission.
The repository ecology: an approach to understanding repository and service i...R. John Robertson
An increasing number of university institutions and other organisations are deciding to deploy repositories and a growing number of formal and informal distributed services are supporting or capitalising on the information these repositories provide. Despite reasonably well understood technical architectures, early majority adopters may struggle to articulate their place within the actualities of a wider information environment. The idea of a repository ecology provides developers and administrators with a useful way of articulating and analysing their place in the information environment, and the technical and organisational interactions they have, or are developing, with other parts of such an environment. This presentation will provide an overview of the concept of a repository ecology and examine some examples from the domains of scholarly communications and elearning.
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
With the growth of data-oriented research in humanities, a large number of research datasets have been
created and published through web services. However, how to discover, integrate and reuse these distributed
heterogeneous research datasets is a challenging task. Ontology is the soul between series digital humanities
resources, which provides a good way for people to discover and understand these datasets. With the release
of more and more linked open data and knowledge bases, a large number of ontologies have been produced
at the same time. These ontologies have different publishing formats, consumption patterns, and interactions
ways, which are not conductive to the user’s understanding of the datasets and the reuse of the ontologies.
The Ontology Service Center platform consists of Ontology Query Center and Ontology Validation Center,
mainly using linked data and ontology-based technologies. The Ontology Query Center realizes the functions
of ontology publishing, querying, data interaction and online browsing, while the Ontology Validation
Center can verify the status of using certain ontologies in the linked datasets. The empirical part of the paper
uses the Confucius portrait as an example of how OSC can be used in the semantic annotation of images. In
a word, the purpose of this paper is to construct the applied ecology of ontology to promote the development
of knowledge graphs and the spread of ontology.
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
With the growth of data-oriented research in humanities, a large number of research datasets have been
created and published through web services. However, how to discover, integrate and reuse these distributed
heterogeneous research datasets is a challenging task. Ontology is the soul between series digital humanities
resources, which provides a good way for people to discover and understand these datasets. With the release
of more and more linked open data and knowledge bases, a large number of ontologies have been produced
at the same time
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
Adoption Of Online Databases In Public Libraries An Australian Case StudyLori Moore
This document summarizes a case study that evaluated the adoption of online databases in public libraries in Victoria, Australia. Researchers used an interpretivist framework and Rogers' diffusion of innovations theory to understand how both library staff and patrons perceived and used four new online databases. Through focus groups and surveys with staff and patrons across different libraries, the study found that training was key to improving understanding and adoption of the new databases, as both staff and patrons were slow to accept the technologies initially. The researchers believe Rogers' theory can help explain barriers to adoption for new technologies in libraries.
The document summarizes techniques for identifying themes in qualitative research. It discusses that themes are abstract constructs that link expressions and can come in various shapes. Themes can come from the data, investigator's prior understanding, characteristics of the phenomenon, definitions, common constructs, and personal experiences. It outlines several techniques for identifying themes, including repetitions, indigenous typologies/categories, metaphors and analogies, transitions, similarities and differences, linguistic connectors, missing data, theory-related material, and processing techniques like cutting and sorting, multi-dimensional scaling, words and keywords in context, and word co-occurrence. The document evaluates different techniques based on the type of data, required expertise, labor required, and number/types of themes to
The document discusses the basics of ontologies, including their origin in philosophy, definitions, types, benefits and application areas. Some key points are:
- An ontology is a formal specification of a conceptualization used to help humans and programs share knowledge. It establishes a shared vocabulary for exchanging information.
- Ontologies describe domain knowledge and provide an agreed-upon understanding of a domain through concepts and relations. They help solve problems of ambiguity and enable knowledge sharing.
- Ontologies benefit applications like information retrieval, digital libraries, knowledge engineering and natural language processing by facilitating semantic search and integration of data.
Ontology Based Approach for Classifying Biomedical Text AbstractsWaqas Tariq
Classifying biomedical literature is a difficult and challenging task, especially when a large number of biomedical articles should be organized into a hierarchical structure. Due to this problem, various classification methods were proposed by many researchers for classifying biomedical literature in order to help users find relevant articles on the web. In this paper, we propose a new approach to classifying a collection of biomedical text abstracts by using ontology alignment algorithm that we have developed. To accomplish our goal, we construct the OHSUMED disease hierarchy as the initial training hierarchy and the Medline abstract disease hierarchies as our testing hierarchy. For enriching our training hierarchy, we use the relevant features that extracted from selected categories in the OHSUMED dataset as feature vectors. These feature vectors then are mapped to each node or concept in the OHSUMED disease hierarchy according to their specific category. Afterward, we align and match the concepts in both hierarchies using our ontology alignment algorithm for finding probable concepts or categories. Subsequently, we compute the cosine similarity score between the feature vectors in probable concepts, in the genrichedh OHSUMED disease hierarchy and the Medline abstract disease hierarchy. Finally, we predict a category to the new Medline abstracts based on the highest cosine similarity score. The results obtained from the experiments demonstrate that our proposed approach for hierarchical classification performs slightly better than the multi-class flat classification.
This document summarizes key concepts from a paper on the CIDOC Conceptual Reference Model (CRM), an ontology for semantic interoperability of cultural heritage data. It discusses the CRM's property-centric approach and methodology, which aims for read-only integration of heterogeneous cultural data sources. The CRM is designed to reconstruct possible past worlds from loosely correlated historical records in a way that allows for monotonic extension of the ontology over time without revising existing definitions.
1) The document analyzes information practices of teenagers across three spaces: the classroom, school library, and home.
2) It discusses the school library as a "heterotopic place," meaning it is a space that challenges the norms and power structures of the school institution by allowing more informal learning.
3) Using ethnographic research methods like observation and interviews, the document examines how the school library facilitates connections between students and documents in a more open environment compared to the disciplined classroom, making it a paradoxical space within the school.
Use of ontologies in natural language processingATHMAN HAJ-HAMOU
The document discusses the limitations of classical natural language processing approaches that rely solely on text-based search and statistics. It describes how ontologies can help address these limitations by representing knowledge at the semantic level rather than just words. The document outlines how ontologies can solve problems related to vagueness, high-level concepts, semantic relations, and time dimensions. It then summarizes two research papers that explore using ontologies to index and categorize documents as well as provide multiple views and dimensions to explore a document collection.
Similar to Comparing taxonomies for organising collections of documents (20)
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulting from automatic enrichment - Aitor Soroa, Eneko Agirre, Arantxa Otegi and Antoine Isaac
This document is a case study on using the Europeana Data Model (EDM) [Doerr et al., 2010] for representing annotations of Cultural Heritage Objects (CHO). One of the main goals of
the PATHS project is to augment CHOs (items) with information that will enrich the user’s experience. The additional information includes links between items in cultural collections and from items to external sources like Wikipedia. With this goal, the PATHS project has applied Natural Language Processing (NLP) techniques on a subset of the items in Europeana.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
This document introduces distributional semantic similarity methods for automatically measuring the coherence of topics generated by topic models. It constructs semantic spaces to represent topic words using Wikipedia as a reference corpus. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information. Topic coherence is determined by measuring the distance between word vectors. Evaluation on three datasets shows distributional measures outperform the state-of-the-art approach, with performance improving using a reduced semantic space.
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enrichment, Eneko Agirre, Ander Barrena, Kike Fernandez, Esther Miranda, Arantxa Otegi, and Aitor Soroa, paper presented the international conference on Theory and Practice in Digital Libraries, TPDL 2013
Large amounts of cultural heritage material are nowadays available through online digital library portals. Most of these cultural items have short descriptions and lack rich contextual information. The PATHS project has developed experimental enrichment services. As a proof of concept, this paper presents a web service prototype which allows independent content providers to enrich cultural heritage items with a subset of the full functionality: links to related items in the collection and links to related Wikipedia articles. In the future we plan to provide more advanced functionality, as available offline for PATHS.
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
Implementing Recommendations in the PATHS system, Paul Clough, Arantxa Otegi, Eneko Agirre and Mark Hall, paper presented at the Supporting Users Exploration of Digital Libraries, SUEDL 2013 workshop, during TPDL 2013 in Valetta, Malta
In this paper we describe the design and implementation of nonpersonalized recommendations in the PATHS system. This system allows users to explore items from Europeana in new ways. Recommendations of the type “people who viewed this item also viewed this item” are powered by pairs of viewed items mined from Europeana. However, due to limited usage data only 10.3% of items in the PATHS dataset have recommendations (4.3% of item pairs visited more than once). Therefore, “related items”, a form of contentbased recommendation, are offered to users based on identifying similar items. We discuss some of the problems with implementing recommendations and highlight areas for future work in the PATHS project.
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
This document describes research on developing a prototype system to enhance user interaction with cultural heritage collections through a pathway metaphor. It involved gathering user requirements through surveys and interviews. Key findings include:
1) Existing online paths tend to be linear and static, limiting exploration, though users preferred more flexible, theme-based paths that allowed branching.
2) Interviews found the path metaphor could represent search histories, journeys of discovery, linked metadata, guides into collections, routes through collections, and more.
3) An interaction model was developed involving consuming, collecting, creating and communicating about paths to support exploration, learning and engagement.
4) The prototype aims to integrate path creation, use and sharing to better support
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
Generating Paths through Cultural Heritage Collections, Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre. Paper presented at Latech 2013
Cultural heritage collections usually organise sets of items into exhibitions or guided tours. These items are often accompanied by text that describes the theme and topic of the exhibition and provides background context and details of connections with other items. The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
Workshop proceedings from the International workshop on Supporting Users Exploration of Digital Libraries, SUEDL 2012 which was held at TPDL 2012 (the international conference on Theory and Practice in Digital Libraries), Paphos, Cyprus, September 2012.
The aim of the workshop was to stimulate collaboration from experts and stakeholders in Digital Libraries, Cultural Heritage, Natural Language Processing and Information Retrieval in order to explore methods and strategies to support exploration of Digital Libraries, beyond the white box paradigm of search and click.
The proceedings includes:
"Browsing Europeana - Opportunities and Challenges', David Haskiya
"Query re-writing using shallow language processing effects', Anna Mastora and Sarantos Kapidakis
"Visualising Television Heritage" Johan Ooman et al,
"Providing suitable information access for new users of Digital Libraries", Rike Brecht et al
"Exploring Pelagios: a Visual Browser for Geo-tagged datasets" Rainer Simon et al
PATHS state of the art monitoring reportpathsproject
This document provides an update to an Initial State of the Art Monitoring report delivered by the project. The report covers the areas of Educational Informatics, Information Retrieval and Semantic Similarity relatedness.
Recommendations for the automatic enrichment of digital library content using...pathsproject
Recommendations for the enrichment of digital library content using open source software, PATHS report by Eneko Agirre and Arantxa Otegi
The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Library content using open source software. It is intended to be useful for third-parties who would like to offer enrichment services. Note that this is not a step-by-step guide for reimplementation, but an overall view of the software required and the programming effort involved.
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
Generating Paths through Cultural Heritage Collections Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre.
The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.
Generating PATHS through Cultural Heritage Collections, Samuel Fernando, Paula Goodale, Paul Clough,
Mark Stevenson, Mark Hall, Eneko Agirre. Presentation given at LaTeCH 2013, ACL Workshop, Sofia, Bulgaria.
The PATHS project is a 3-year EU-funded project involving 6 partners across 5 countries. The project aims to introduce personalized paths into digital cultural heritage collections to provide more engaging access to large volumes of online material. The PATHS system enriches metadata through natural language processing and links items within collections and to external resources. It provides various tools for browsing, searching and creating paths. Two rounds of user evaluations found the path creation tools and search mechanisms were well received. Outcomes include the PATHS API and potential commercialization of components and consultancy services.
This document summarizes the PATHS project, which developed tools for exploring digital cultural heritage collections. The project involved 6 partners across 5 countries. It researched methods for navigating collections, including user-created paths and natural language processing of metadata. Users can browse collections through a thesaurus, tag cloud, or topic map. The system allows users to create and publish nonlinear paths through the collection with descriptions. The tools have potential for classroom activities, curated collections, and research.
Comparing taxonomies for organising collections of documents presentationpathsproject
This document compares different taxonomies for organizing large collections of documents. It evaluates taxonomies that were either manually created (LCSH, WordNet domains, Wikipedia taxonomy, DBpedia ontology) or automatically derived from document data using LDA topic modeling or Wikipedia link frequencies. The document describes applying these taxonomies to a collection of over 550,000 items from Europeana, a digital library. It then evaluates the taxonomies based on how cohesive the groupings are and how accurately the relationships between parent and child nodes are classified.
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
This document describes the SemEval-2012 Task 6 on semantic textual similarity. The task involved measuring the semantic equivalence of sentence pairs on a scale from 0 to 5. The training data consisted of 2000 sentence pairs from existing paraphrase and machine translation datasets. The test data also had 2000 sentence pairs from these datasets as well as surprise datasets. Systems were evaluated based on their Pearson correlation with human annotations. 35 teams participated and the best systems achieved a Pearson correlation over 80%. This pilot task established semantic textual similarity as an area for further exploration.
A pilot on Semantic Textual Similaritypathsproject
This document summarizes the SemEval 2012 task on semantic textual similarity. It describes the motivation for the task as measuring similarity between text fragments on a graded scale. It then outlines the datasets used, including the MSR paraphrase corpus, MSR video corpus, WMT evaluation data, and OntoNotes word sense data. It also discusses the annotation process, which involved a pilot with authors and crowdsourcing through Mechanical Turk. The results showed most systems performed better than baselines and the best systems achieved correlations over 0.8 with human judgments.
PATHS Final prototype interface design v1.0pathsproject
This document summarizes the design methodology and current status of the interface design for the second prototype of the PATHS project. It begins with a three-stage design methodology that includes: evaluating the first prototype design process, creating low-fidelity storyboards, and developing high-fidelity interaction designs. It then reviews lessons learned from developing the first prototype interface. The document introduces new user interface components and presents preliminary high-fidelity designs for key pages like the landing page, path editing, and item pages. Expert evaluation of the designs is planned along with user evaluation of a working prototype. The goal is to address issues identified in prior evaluations and create an intuitive interface for the PATHS cultural heritage system.
PATHS Evaluation of the 1st paths prototypepathsproject
This document summarizes the evaluation of the first prototype of the PATHS project. It describes the evaluation methodology, which included field-based demonstrations and laboratory evaluations. Results are presented from both types of evaluations, including participant demographics, user feedback on ease of use and usefulness of PATHS, suggested improvements, and results from structured tasks conducted in the laboratory evaluations. The document also reviews how well the first PATHS prototype met its functional specifications.
PATHS Second prototype-functional-specpathsproject
This document provides the functional specifications for the second prototype of the PATHS project. It outlines functions to be implemented based on priorities - functions that resolve critical user interface issues are highest priority. The second prototype will address issues identified in testing the first prototype and provide enhanced functionality such as improved search, user accounts/permissions, and workspace customization. The specifications draw on recommendations from evaluating the first prototype to improve the PATHS system.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Comparing taxonomies for organising collections of documents
1. Comparing taxonomies for organising collections of
documents
Samuel Fer nando1 M ar k H all 1,2 Eneko Agir r e3
Ai t or Sor oa3 Paul C lough2 M ar k S t evenson1
(1) Department of Computer Science, Regent Court, 211 Portobello, Sheffield, S1 4DP UK
,
(2) Information School, Regent Court, 211 Portobello, Sheffield, S1 4DP UK
,
(3) Universidad del País Vasco, Barrio Sarriena s/n, 48940 Leioa, Bizkaia
{s.fernando,m.mhall,p.d.clough,m.stevenson}@sheffield.ac.uk
{e.agirre, a.soroa}@ehu.es
ABSTRACT
There is a demand for taxonomies to organise large collections of documents into categories
for browsing and exploration. This paper examines four existing taxonomies that have been
manually created, along with two methods for deriving taxonomies automatically from data
items. We use these taxonomies to organise items from a large online cultural heritage collection.
We then present two human evaluations of the taxonomies. The first measures the cohesion
of the taxonomies to determine how well they group together similar items under the same
concept node. The second analyses the concept relations in the taxonomies. The results show
that the manual taxonomies have high quality well defined relations. However the novel
automatic method is found to generate very high cohesion.
TITLE AND ABSTRACT IN BASQUE
Dokumentu bildumak antolatzeko taxonomien arteko
alderaketa
Dokumentu bildumak kategorietan sailkatzea oso erabilgarria da, dokumentuak arakatzeko eta
aztertzeko aukera berriak eskaintzen duen heinean. Hori horrela izanik, dokumentu bilduma
handiak sailkatzeko taxonomien behar handia dago. Artikulu honetan eskuz sortutako lau
taxonomia aztertzen dira, taxonomiak automatikoki sortzen dituzten bi metodorekin batera.
Taxonomia hauek ondare kulturaleko bilduma handiak antolatzeko erabili ditugu. Taxonomien
ebaluazioa egin dugu galdetegietan oinarritutako bi metodo erabiliaz. Lehenbizikoak tax-
onomiaren kohesioa neurtzen du, hau da, antzeko itemak kontzeptu beraren azpian zein
ondo taldekatzen diren. Bigarrenak taxonomiako kontzeptuen arteko erlazioak aztertzen ditu.
Emaitzek erakusten dute eskuzko taxonomien erlazioen kalitatea, baina metodo automatiko
berri batek lortzen du kohesio handiena.
KEYWORDS: Semantic network, taxonomy, hierarchy, Wikipedia, WordNet, ontology.
KEYWORDS IN BASQUE: Sare semantiko, taxonomia, hierarkia, Wikipedia, WordNet, Ontolo-
gia.
2. 1 Introduction
With increasingly large sets of diverse collections of documents available online a key challenge
is organising and presenting these items effectively for information access. To enable the
navigation and exploration of collections, content providers typically provide users with free-
text search functionalities, along with some form of browsable subject categories or taxonomy,
also useful in organising documents. Providing multiple mechanisms for accessing documents
enables users to conduct various modes of information seeking activity, from locating specific
documents to more exploratory forms of searching and browsing behaviour (Hearst, 2009;
Marchionini, 2006; Wilson et al., 2010).
In this paper we focus on evaluating different taxonomies that could be used to organise
and navigate content from Europeana1 , an online cultural heritage collection. This collection
comprises many subcollections taken from different providers, and thus contains a very diverse
set of cultural heritage items2 . Some of the subcollections are linked to bespoke taxonomies;
however, many are not. This therefore represents a very challenging dataset to organise in a
consistent and uniform manner. We focus on two main approaches for organising content: the
first is to map items from Europeana onto existing manually-created taxonomies; the second
is to use data-driven approaches to automatically derive taxonomies from the collection. This
requires being able to successfully group items into categories and generate suitable category
labels. A note on definitions: the term ‘taxonomy’ is used in this paper as a general term
meaning a conceptual hierarchy. A taxonomy does not necessarily have to be a subsumption
hierarchy (where each child concept is subsumed by its parent concept). Some of the taxonomies
described here are subsumption hierarchies and some are not.
There are many different ways of evaluating taxonomies (Snow et al., 2004; Malaisé et al.,
2006; Yi, 2008; Nikolova et al., 2010; Ponzetto and Strube, 2011). Here we focus on two
approaches which capture different qualities of the taxonomies. The first evaluation measures
the cohesion of the taxonomies; how well they group together similar items into the same
concept node. The second analyses the relationships between concept nodes in the taxonomies
and whether people can understand the concept labels. We believe this is the first time that
such evaluations have been applied to such taxonomies over large collections of data items.
This paper provides three main contributions: (1) the systematic comparison of different
taxonomies for organising a large cultural heritage collection; (2) a novel data-driven approach
based on using Wikipedia article links as concept nodes in the taxonomy; and (3) the evaluation
of cohesion and relationship type between concepts using an approach based on crowdsourcing.
The rest of the paper comprises the following. Section 2 describes related work in this area.
Section 3 gives a brief overview of the key data resources and tools referenced and Section 4
describes the taxonomies and item-to-resource mapping approaches used in the experiments.
Section 5 describes the cohesion and relatedness experiments and the results obtained.
2 Related work
There are many category systems or taxonomies available, some of which are domain specific;
others aimed at covering more general subjects. For example, one of the most popular and
commonly used resources is the Library of Congress Subject Headings (LCSH)3 . LCSH provides
1
http://www.europeana.eu
2
An item is defined here as a online record of a cultural heritage artifact (usually an image), together with associated
metadata, such as title, subject, description etc.
3
http://www.loc.gov/aba/cataloging/subject/
3. a controlled vocabulary of keywords (or subject headings), which are widely used in libraries to
catalogue materials and facilitate information access. Similarly, in the medical domain MeSH4 ,
created and maintained by the National Library of Medicine, provides a controlled vocabulary
of medical subject headings. In computational linguistics, WordNet (Fellbaum, 1998) is a
commonly used lexical knowledge base that links concepts in various ways. WordNet has been
expanded with WordNet domain labels which group together words from different syntactic
categories and different senses (Bentivogli et al., 2004). These domain labels are organised
into a hierarchical structure.
There is a body of previous work on automatically deriving taxonomies and relations from free
text. Hearst (1992) was perhaps the earliest significant effort to derive hyponym-hypernym
relations from free text using the now eponymous Hearst patterns which code common syntactic
forms of the hyponymy pattern (e.g. ‘Vehicles such as Cars’). These patterns were hand-crafted.
More recently Snow et al. (2004) developed on this work by using an existing knowledge base
to automatically derive lexico-syntactic patterns containing the hyponym-hypernym pairs.
An alternative to creating hierarchies of concepts from the pattern-based methods is to use
statistical methods. Sanderson and Croft (1999) used an approach to automatically build
a hierarchy of terms or concepts (nouns and noun phrases) based on term co-occurrences
within a set of documents. To order the concepts a statistical relation called subsumption was
used to determine which of a co-occurring pair of concepts was most likely to be the parent.
Griffiths and Tenenbaum (2004) uses Bayesian methods and LDA (Latent Dirichlet Allocation)
to derive topic clusters and create a topic hierarchy. Recent work has exploited the information
in Wikipedia to create taxonomies. Ponzetto and Strube (2011) uses the category hierarchy
along with lexical matching methods to create a taxonomy which compares well with manually
created resources. DBpedia (Auer et al., 2007) also uses the Wikipedia category hierarchy but
also additionally links in the articles into the hierarchy.
One key problem is how to evaluate taxonomies effectively. This is a complex problem since
there are many aspects to consider. We can consider how users would rate various aspects of
their experience using a questionnaire. However this subjective study can be misleading since
people can underrate or overrate their experience. A more objective approach is to log user
interactions and then infer from these how effective the taxonomy is i.e. time spent on a task,
how much of the domain was covered and so on. There have been user studies of taxonomies
which have used combinations of both of these approaches (Malaisé et al., 2006; Yi, 2008;
Nikolova et al., 2010). There is also an important distinction to be made between the data
content of a taxonomy and the methods used for visualisation. There have been user studies
which have focussed on evaluation different visualisations while keeping the data constant
(Katifori et al., 2007). In this paper we are considering only the data content, the concepts and
relations, so visualisations are kept constant.
User studies are certainly important. However these studies often don’t answer certain questions
about the taxonomy. We can measure aspects of their experience but we might not have a
fine-grained understanding of which aspects of the taxonomy were positively or negatively
perceived. Also such studies require users to be physically present at a machine specially
set up for logging. This can make such studies expensive and time-consuming. A different
approach is to use intrinsic evaluations. Yu et al. (2007) present a wide range of ontology
evaluation approaches as applied to variations of the Wikipedia category structure such as
4
http://www.nlm.nih.gov/mesh/
4. fanout, tangledness, relationship richness, class richness, importance connectivity and cohesion.
Cohesion attempts to measure the degree to which similar items are clustered together under a
single node in the taxonomy. This was initially proposed by (Boyd-Graber et al., 2009) and has
also recently been applied to cultural heritage (Hall et al., 2012).
Another evaluation is to measure the accuracy of the child-parent pairs from the taxonomies by
asking evaluators to classify them as either isa or notisa relations (Snow et al., 2004; Ponzetto
and Strube, 2011). A similar method is used here but expanded to give a more detailed
understanding of the types of the relations found in the different taxonomies.
3 Resources and tools
This section lists some key resources and tools that are used in this paper.
3.1 Wikipedia Miner
The Wikipedia Miner (Milne and Witten, 2008) tool is used in this paper both as a tool to help
map items into existing taxonomies and as a way to generate a novel taxonomy from scratch.
Wikipedia Miner is a Wikification tool which adds inline links to Wikipedia articles into free text.
The software is trained on Wikipedia articles, and thus learns to disambiguate and detect links
in the same way as Wikipedia editors. Disambiguation of terms within the text is performed
first. A machine-learning classifier is used with several features. The main features used are
commonness and relatedness, as in Medelyan et al. (2008). The commonness of a target sense is
defined by the number of times it is used a destination from some anchor text e.g. the anchor
text ‘Tree’ links to the article about the plant more often than the mathematical concept and is
thus more common. Relatedness gives a measure of the similarity of two articles by comparing
their incoming and outgoing links. The performance achieved using their approach is currently
state of the art for this task. The Wikipedia Miner software is freely available5 .
3.2 Europeana cultural heritage collection
Cultural heritage items from Europeana are used for the evaluation. Europeana is a large
online aggregation of cultural heritage collections from across Europe. In this paper a snapshot
of the English subset of the data from March 2011 is used. This comprises 547780 items in
total. Each item consists of an XML metadata record. This comprises a number of fields the
most informative of which are dc:title, dc:subject, dc:description which contain the
title, subject keywords and a textual description of the item. About 74% of the items have an
associated image which is displayed on the portal website. A difficulty with this collection is
that a significant number of the items have very little associated metadata. In the worst case
some items have only a one-word title, with no subject or description. This problem is dealt
with implicitly in the methods described below, where such sparse records are effectively filtered
out in the mapping and taxonomy generation stages and are not included in the evaluations.
4 Taxonomies and mappings
Six taxonomies were tested in these experiments. Four of these were based on existing
taxonomies which have been mostly manually created: the Library of Congress Subject Headings,
WordNet Domains, Wikipedia Taxonomy and DBpedia. The remaining two taxonomies were
automatically derived from the metadata present for the items in the collections: WikiFreq and
5
http://wikipedia-miner.cms.waikato.ac.nz/
5. LDA topics. This section gives a description of each of these taxonomies and how the items in
the collection were mapped into them. Statistics for each taxonomy are presented at the end of
this section.
4.1 Manually created taxonomies
4.1.1 Library of Congress Subject Headings (LCSH)
The LCSH comprises a controlled vocabulary maintained by the United States Library of
Congress for use in bibliographic records. They are used in many libraries to organise their
collections as well as for organising materials online.
The text from the dc:subject field in the Europeana item are used for the mapping. The text
is lemmatized using Freeling (Padró, 2011). The text is compared to the category labels for the
LCSH concepts. If the text contains any of the category labels then the item is matched to these
categories. If more than one matching label is found, then the longest matching label is used
for the mapping.
4.1.2 WordNet domains
The WordNet hierarchy is a fine-grained classification which is too detailed for browsing, with
more than a hundred thousand nodes and concepts like entity, natural phenomenon and body of
water. Instead WordNet domains organises the WordNet concepts into a smaller hierarchy, with
only 164 domain labels which are easily understood by a general user. The domain labels have
been semi-automatically applied to each of the synsets in WordNet. Each synset is annotated
with each one label from a set of about two hundred. The information provided by the domain
labels is complementary to the data existing already in WordNet. The domain labels group
together words from different syntactic categories (e.g. nouns and verbs), and also may group
together different senses of the same word and thus reduce polysemy.
WordNet domains lists the domain labels for all open class words in WordNet, but it only
contains a few proper nouns. Given the large concentration of proper nouns in Europeana, we
extend the list of words using Yago2. Yago2 (Hoffart et al., 2011) is a knowledge base derived
from Wikipedia with more than 10 million entities, and each entity in Yago2 is linked to a
WordNet 3.0 synset. We also used a mapping from WordNet 3.0 synsets to WordNet Domain
labels as provided by the Multilingual Central Repository (MCR) (Atserias et al., 2004). To
perform the mapping, the first step is again to use the dc:subject field to link Europeana
items to Yago2 entities (using lemmatization and finding the longest possible match). These
are then mapped to the WordNet Domain labels via the Yago2 entity-to-synset and the MCR
synset-to-WordNetDomain mappings.
4.1.3 Wikipedia Taxonomy
Wikipedia Taxonomy (Ponzetto and Strube, 2011) is a taxonomy derived from Wikipedia
categories. The authors create the Wikipedia Taxonomy by keeping the isa relations between
Wikipedia categories and discarding the rest. We first apply Wikipedia Miner (see Section 3.1)
over the Europeana items to find the relevant Wikipedia entities in the the dc:subject field.
Then, we link the Europeana item to all Wikipedia Taxonomy categories which are related to
these entities.
6. 4.1.4 DBpedia ontology
The DBpedia ontology (Auer et al., 2007) is a small, shallow ontology manually created based
on information derived from Wikipedia. Contrary to the previous vocabularies described above,
the DBpedia ontology is a formalised ontology, including inference capabilities. The authors
provide the instances of each ontology class, i.e. the set of Wikipedia entities pertaining to
this class. For mapping Europeana items to DBpedia ontology classes, we first apply Wikipedia
Miner to find the relevant Wikipedia entities to the item, and then link the item to the classes
these entities belong.
4.2 Automatically created data-driven taxonomies
4.2.1 LDA topic modelling
Latent Dirichlet Allocation (LDA) is a state-of-the-art topic modelling algorithm, that creates a
mapping between a set of topics T and a set of items I, where each item i ∈ I is linked to one
or more topics t ∈ T . Each item is input into LDA as a bag-of-words and then represented as a
probabilistic mixture of topics. The LDA model consists of a multinomial distribution of items
over topics where each topic is itself a multinomial distribution over words. The item-topic and
topic-word distributions are learned simultaneously using collapsed Gibbs sampling based on
the item - word distributions observed in the source collection (Griffiths and Steyvers, 2004).
LDA has been used to successfully improve result quality in Information Retrieval (Azzopardi
et al., 2004; Wei and Croft, 2006) tasks and is thus well suited to support exploration in digital
libraries.
To turn the flat LDA topic model into a navigable hierarchy, Griffiths and Tenenbaum (2004)
describe a hierarchical LDA approach. However this was found to be prohibitively time
consuming given our large data-set. Instead a recursive divide and conquer approach was
used, which was much more efficient. The number of topic groups at each stage was limited
to a maximum of 9 to make the hierarchy manageable for users to navigate. The algorithm is
outlined below.
1) Run LDA over the corpus to determine the document-topic probabilities. The number of
topics t opic_n to generate is automatically determined using this equation:
d ocuments_in_cor pus
t opic_n = min(9, ) (1)
30
2) For each topic used the document-topic probabilities that LDA outputs to identify the set
of documents associated with that topic. Each document is assigned only to its highest-
probability topic. While this removes some of the power inherent in the LDA topic model,
we believe that from a navigational perspective it is better if each document is located at
only one point in the hierarchy and not at multiple points. To give each topic a label, we
simply selected the highest-probability word from each topic’s topic-word distribution.
3) If a topic set has less than 60 items then stop. Otherwise go back to 1) using the set of items
identified in 2) as the corpus.
7. 4.2.2 Wikipedia link frequencies
This is a novel method for taxonomy creation which uses Wikipedia article links as the concept
nodes in the taxonomy. The first step is to add inline article links to all the item texts in the
collection using Wikipedia Miner (see Section 3.1). A confidence threshold of 0.5 was used to
help ensure the links were of high quality - that is they are correctly disambiguated and relevant
to the topic of the item.
The first step is to find the frequency counts of all article links that occur in the items. Let L be
the set of all links found in the items. Then the frequency function F : L → gives the global
frequency count for occurrences of the link in all items.
The following procedure is then used for each item to create and populate the taxonomy. Let
S ⊂ L be the set of links found in that item. The links are ordered in S by order of frequency
according to the F function (most frequently occuring first) to give an ordered list of links
a1 , a2 , a3 · · · an . The item is then inserted into the tree under the branch a1 → a2 → a3 · · · → an ,
with a1 at the top level in the tree and the item appearing under the node an . If this branch
does not already exist in the tree then it is created.
It was found that using this approach the branching factor was very high at some levels so the
number of child nodes at each level was limited to at most 20. Furthermore only items with at
least 2 links were used to prevent a large number of single-linked items appearing at the top
level. Concepts with less than 20 items were also filtered out.
This method is labelled WikiFreq in the remainder of the paper.
4.3 Taxonomy statistics
Table 1 shows some statistics for each taxonomy:
• The number of items that are mapped into the taxonomy.
• The average number of parents for each item.
• The average depth from the root node to an item.
• The number of top level nodes in the taxonomy.
A problem with some of the manual taxonomies is the very high number of top level nodes,
which makes it difficult for users to browse. However there is no obvious way to select suitable
top level nodes in these taxonomies. Additionally some of the taxonomies assign items to many
parent nodes - this means that the data is repeated across the taxonomy. This is not a problem
in itself, but is likely to mean that items may often be assigned to incorrect nodes.
5 Experiments
Two evaluations were performed on the taxonomies. The first measured the cohesion of the
item clustering, and the second gathered human judgements of the relations that were found
between child-parent concept pairs in the taxonomy. For both evaluations online surveys were
created using an in-house crowdsourcing interface. Links to the surveys were sent out to a
mailing list comprising thousands of students and staff members at the University of Sheffield.
8. Type Taxonomy Items Nodes Avg. parents Avg. Depth Top nodes
LCSH 99259 285238 1.8 1.97 28901
DBpedia 178312 273 4.2 2 30
Manual
Wiki Taxonomy 275359 121359 11.7 1.13 10417
WN domains 308687 170 7.1 7.1 6
LDA topics 545896 22494 1 7.3 9
Automatic
Wiki Freq 66558 502 1 3.39 24
Table 1: Statistics for each taxonomy
5.1 Cohesion
A cohesive cluster is defined as one in which the items are similar while at the same time clearly
distinguishable from items in other clusters (Tan et al., 2006). To measure the cohesiveness
of the taxonomies we use the intruder detection task originally devised in Boyd-Graber et al.
(2009) and recently used for cultural heritage items in Hall et al. (2012). The idea of this is to
present 5 items to an evaluator. Four of these are taken from one concept node in the taxonomy
and the other (the intruder) is randomly picked from elsewhere in the taxonomy. The more
cohesive the concept in the taxonomy the more obvious it should be which is the intruder item.
Each unit was displayed as a list of five images along with the titles of the items. An example6
of a cohesive unit is shown in Figure 1. To generate good quality units for the evaluation the
informativeness of items was calculated as follows:
in f or mat iveness(i t em) = l eng th(i t em t i t le )/av g L t i t le ∗ log(N /count(i t em t i t le ))
+l eng th(i t emdesc )/av g L desc ∗ log(N /count(i t emdesc ))
+l eng th(i t emsub j )/av g Lsub j ∗ log(N /count(i t emsub j ))
where t i t l e, desc, sub j refer to the title, description and subject fields of the metadata, av g L X
is the average length of field X over the whole collection, leng th(i t emX ) is the length of that
item field text, and count(i t emX ) gives the frequency of that item field text over the whole
collection. The higher the resulting value the more informative the item is. Note that as well as
taking into account the length of the fields, this also weights by the inverse document frequency
(idf) value, so very frequently occuring terms will be downweighted.
The most informative items were selected for each category. This helped to ensure that the users
were presented with informative items, allowing them to have enough information to decide
which was most likely to be the intruder. The same also applies to the taxonomy mappings
being evaluated; it is difficult to correctly map items with very little information in the metadata.
The procedure for selecting the sample units was as follows:
6
Images reproduced from Wikipedia and subject to relevant licenses.
http://en.wikipedia.org/wiki/File:York_Minster_close.jpg.
http://en.wikipedia.org/wiki/File:Wells_Cathedral,_Wells,_Somerset.jpg.
http://en.wikipedia.org/wiki/File:West_Side_of_Westminster_Abbey,_London_-_geograph.
org.uk_-_1406999.jpg.
http://en.wikipedia.org/wiki/File:Goatfell_from_Brodick_Harbour.jpg
http://en.wikipedia.org/wiki/File:Sfec-durham-cathedral-2007-263.JPG
9. 1 2 3
York Minster Westminster Abbey Wells cathedral
4 5
Isle of Arran Durham Cathedral
Figure 1: Example of a cohesive unit. Here the intruder item is number 4.
1. Select categories at random that have at least 4 items.
2. For each category:
(a) Return the 4 items in the category which were most informative.
(b) To find the intruder item, select 100 items at random from the whole collection and
return the most informative item.
Six units were shown on each page, one of which was always a control unit. The control units
were manually chosen to be examples where the intruder was very obvious. The purpose of the
control units was to ensure the quality of the judgements, since if a participant got the control
unit wrong it was an indication that they were not taking the task seriously.
Thirty non-control units were created from each taxonomy. Altogether 134 people attempted
the survey. 23 of the users evaluated at least one control unit wrong, or evaluated less than 5
units in total, and so their answers were excluded. The remaining 111 participants contributed
a total of 1255 answers. Each unit received a minimum of 5 answers and an average of 6.97
answers. A unit was judged as cohesive if more than 80% of the annotators agreed on the same
intruder.
10. Type Taxonomy Coherent units Percentage
LCSH 19 63.3
DBpedia 17 56.7
Manual
Wiki Taxonomy 18 60.0
WN domains 15 50.0
LDA topics 17 56.7
Automatic
Wiki Freq 29 96.7
Table 2: Number of coherent units (out of 30) for each of the taxonomies.
The results (Table 2) show that most of the taxonomies achieved roughly the same level of
cohesion for the clusters, roughly between 50 and 63%. However the WikiFreq taxonomy
performed far better, with only one unit of the 30 judged as not coherent. This success shows
that the Wikipedia links are very effective as a means of grouping together similar items. This
might be explained by considering that items grouped together under the same node will share
a number of keywords which link to the same Wikipedia articles which would ensure that the
items are very similar. In contrast the Wikipedia taxonomy and DBpedia ontology use categories
rather than articles in Wikipedia as the concept nodes. These are much more loosely defined;
each article in Wikipedia can belong to many categories and each category contains many
articles. The results also indicate that Wikipedia articles as entities are much more clearly
defined than the LDA topic keywords and thus work much better at grouping together the
similar items.
5.2 Relation classification
Previous work has evaluated taxonomies by presenting child-parent pairs of concept nodes
to evaluators and asking them a simple boolean question - does the pair represents a valid
hypernymic relation, i.e. is it true that "ChildNode isa ParentNode"? (Ponzetto and Strube,
2011; Snow et al., 2004). We would expect the manually created taxonomies to perform well
here. The automatic methods also intend to create a hierarchical structure, with more general
concepts at the top nodes going to more specific in the lower nodes.
Here we conduct a slightly deeper analysis of what kinds of relations were present in these
taxonomies. Instead of a simple boolean question a two-part question was used. Given a
child-parent pair A, B the evaluators were asked two questions:
1. Are the two concepts A and B related? (Yes/No/I don’t know) The evaluators were
asked to judge the relation within the context of the cultural heritage taxonomy. A
positive example was presented: Westminster and London, which were related because
Westminster is in London. A negative example was Fish and Bicycle which were unrelated
and would not be a useful pair to include in a taxonomy.
2. If Yes, then how would you best define the relationship? Is A more specific than B, less
specific than B, neither, or don’t know? Examples were also given to help with this
question. Westminster is more specific than London since Westminster is within London.
The term Scientist is less specific than Physicist, since while all Physicists are Scientists,
not all Scientists are Physicists (they could be biologists or chemists for example). For
11. Type Taxonomy Child (A) Parent (B)
Work Human Behaviour
LCSH Braid Weaving
Time Geodetic Astronomy
Mountain Range Place
DBpedia Fern Plant
Congressman Politician
Manual
Mammals of Africa Wildlife of Africa
Wiki Taxonomy Schools in Wiltshire Schools in England
British Culture European Culture
vehicles transport
WN domains mechanics engineering
home applied science
earthenware dish
LDA topics view church
tunnel chapel
Automatic
Corrosion Coin
Wiki Freq Interior Design Industrial Design
Towpath Waterscape
Table 3: Some examples of child-parent pairs from each taxonomy.
the ‘neither’ option consider Physicist and Biologist. The concepts are related (both are
scientists) but neither is more specific than the other.
Forty non-control pairs from each taxonomy were presented to the evaluators giving a total
of 240 pairs. Examples of concept pairs from each taxonomy are shown in Table 3. As for
the previous experiment control pairs were manually identified where the answer should be
obvious. Five pairs were shown on each page of which one was always a control pair.
Altogether 270 people attempted this survey. 97 people evaluated more than half the control
pairs wrong or evaluated less than 5 pairs in total, and so their answers were excluded. Of the
173 remaining participants, a total of 3826 evaluations were made for each pair. A minimum of
8 evaluations and an average of 15.94 evaluations were made for each instance.
Type Taxonomy Yes No Don’t know Agreement
LCSH 74.2 8.8 17.0 79.1
DBpedia 86.6 11.2 2.2 88.4
Manual
Wiki Taxonomy 96.1 1.7 2.3 95.9
WN domains 77.1 14.5 8.4 83.9
LDA topics 30.3 50.3 19.3 71.6
Automatic
Wiki Freq 47.6 16.5 35.8 70.9
Table 4: Are A and B related?
The results for the relatedness question (Table 4) show a clear pattern. The manually created
taxonomies are markedly more likely to contain clearly related pairs of concepts. The Wikipedia
12. taxonomy hierarchy is the highest performing in this regard which suggests that the user-created
category hierarchy is of high quality and has easily understood concepts and relations. DBpedia
scores slightly lower. This difference might be explained due to DBpedia containing article
entities as well as categories. The lower score suggests either that articles are not always placed
in the best categories, or that it is harder for users to identify article-category relationships.
WordNet domains scores lower. This may be due to the domain concepts being sometimes quite
general and possibly harder for general users to understand (for example one pair of concepts
was ‘color’ and ‘factotum’). LCSH scored surprisingly low considering that it is a manually
created taxonomy. This suggests that the concepts and relations in this hierarchy are even
harder for users to understand or identify. Finally the two data derived taxonomies score lower
still. For the WikiFreq taxonomy a high percentage of the relations were classified as ‘Don’t
know’. This may be because a high number of the article links are about quite obscure concepts
which most people would not know about. Finally the LDA topics produced the highest number
of definite ‘No’ judgements which shows that the taxonomy may be difficult or confusing for
users to navigate.
Type Taxonomy A< B A> B Neither Don’t know Agreement
LCSH 65.4 8.7 23.4 2.5 68.7
DBpedia 76.2 4.9 18.1 0.7 78.9
Manual
Wiki Taxonomy 78.3 4.7 16.0 0.9 82.8
WN domains 63.6 6.3 28.0 2.0 67.6
LDA topics 21.4 14.8 62.1 1.6 61.0
Automatic
Wiki Freq 30.9 22.6 43.6 2.9 67.0
Table 5: Specificity of the pairs, with A the child node, and B the parent node. A < B means A is
more specific than B.
The results for the specificity question are shown in Table 5. These follow a roughly similar
pattern to the relatedness. The A < B case is the most desirable for the taxonomies since we
would prefer the most general concepts at the top of the hierarchy narrowing down into more
specific concepts. The Wikipedia taxonomy and DBpedia both score relatively highly here,
although both contain a surprisingly high number of cases where neither A or B was identified
as more specific than the other (16.0 and 18.1% respectively). For the Wikipedia taxonomy this
shows that although almost all child-parent pairs are considered to be related concepts, they
are not always easily identified with the child as more specific than the parent. Both WordNet
domains and LCSH fare worse, again with more relations identified as ‘neither’. The WikiFreq
taxonomy contains a more mixed set of results with quite a high proportion of relations the
‘wrong way round’ with A deemed to be less specific than B, although the highest number falls
into the ‘neither’ category. This result is a reflection of the nature of the links within the items.
The taxonomy is ordered with the most frequent occuring links at the top going down towards
the least. Clearly this is not enough to create the kind of general-to-specific relationships which
are desirable. Finally the results for the LDA topics show that the majority are defined as
‘neither’ - the concepts are topically related but mostly without any specificity ordering.
13. Conclusion and perspectives
Developing effective taxonomies for the purpose of organising large number of data items is a
complex task. Existing manually created taxonomies might be accurate and well structured but
may not be adequate for a specified domain or may be hard for users to navigate. Automatic
methods for deriving taxonomies have the advantage of closely reflecting the nuances of the
data - but organising the derived concepts into meaningful relations remains a problem.
The experiments in this paper shows some surprising results. The LCSH taxonomy has been
manually created for the purpose of organising library collections and so might be the obvious
choice to organise CH data online. However the results show that the relations within LCSH are
defined less clearly than that of the Wikipedia derived taxonomies. WordNet domains performs
at a similar level to the LCSH in terms of the quality of the relations. The LDA topic hierarchy
gave poor results in terms of the identified topics. The topic pairs were often unrelated, and
had no general-to-specific structure as would be desirable for this application.
The WikiFreq hierarchy performed slightly better in this regard. Just over half the concept pairs
were judged to be related. Just under a third were labelled as ‘Don’t know’ which may reflect
the obscurity of the concept nodes identified. It was hoped that organising the frequency counts
of the links would organise the hierarchy into a general-to-specific direction. This was not
achieved, although the hierarchy does have the benefit of providing the user with an overview
of the collection by immediately seeing which kind of items are most prevalent.
In terms of cohesion all the taxonomies achieved similar results except for the WikiFreq
taxonomy which achieved almost perfect cohesion. This shows how effective the Wikipedia
links are in grouping together similar items.
It is also worth noting that WikiFreq and LCSH map significantly fewer items into the taxonomy.
Future work will continue with different evaluation approaches, such as domain/task coverage
and accuracy of the mappings. We also aim to expand the evaluations to include user studies; a
key question is how well these taxonomies assist users when used for browsing large collections,
such as Europeana. The aim is to see if there is a correlation between the intrinsic results that
were found here with the extrinsic quality judgements when used in real life applications. A
promising line of work will be to build on the WikiFreq approach by integrating with the high
quality Wikipedia taxonomy knowledge base. The hope is that using this approach will generate
highly coherent units along with a well structured conceptual tree.
Acknowledgements
The research leading to these results was supported by the PATHS project
(http://paths-project.eu) funded by the European Community’s Seventh Frame-
work Programme (FP7/2007-2013) under grant agreement no. 270082. This research was also
partially funded by the Ministry of Economy under grant TIN2009-14715-C04-01 (KNOW2
project).
14. References
Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., and Vossen, P (2004).
.
The meaning multilingual central repository. In Proceedings of GWC, pages 23–30.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A
nucleus for a web of open data. The Semantic Web, pages 722–735.
Azzopardi, L., Girolami, M., and Van Rijsbergen, C. (2004). Topic based language models for
ad hoc information retrieval. In Neural Networks, 2004. Proceedings. 2004 IEEE International
Joint Conference on, volume 4, pages 3281–3286. IEEE.
Bentivogli, L., Forner, P Magnini, B., and Pianta, E. (2004). Revising the wordnet domains
.,
hierarchy: semantics, coverage and balancing. In Proceedings of the Workshop on Multilingual
Linguistic Ressources, pages 101–108. Association for Computational Linguistics.
Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., and Blei, D. (2009). Reading tea leaves:
How humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural
Information Processing Systems.
Fellbaum, C., editor (1998). WordNet: An electronic lexical database. MIT Press.
Griffiths, D. and Tenenbaum, M. (2004). Hierarchical topic models and the nested chinese
restaurant process. In Advances in neural information processing systems 16: proceedings of the
2003 conference, volume 16, page 17. The MIT Press.
Griffiths, T. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National
Academy of Sciences of the United States of America, 101(Suppl 1):5228.
Hall, M. M., Clough, P D., and Stevenson, M. (2012). Evaluating the use of clustering for
.
automatically organising digital library collections. In Theory and Practice of Digital Libraries
2012.
Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings
of the 14th conference on Computational linguistics-Volume 2, pages 539–545. Association for
Computational Linguistics.
Hearst, M. (2009). Search user interfaces. Cambridge Univ Pr.
Hoffart, J., Suchanek, F., Berberich, K., Lewis-Kelham, E., De Melo, G., and Weikum, G.
(2011). Yago2: exploring and querying world knowledge in time, space, context, and many
languages. In Proceedings of the 20th international conference companion on World wide web,
pages 229–232. ACM.
Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., and Giannopoulou, E. (2007). Ontology
visualization methods - a survey. ACM Comput. Surv., 39(4).
Malaisé, V Aroyo, L., Brugman, H., Gazendam, L., de Jong, A., Negru, C., and Schreiber, G.
.,
(2006). Evaluating a thesaurus browser for an audio-visual archive. In Staab, S. and Svátek, V.,
editors, Managing Knowledge in a World of Networks, volume 4248 of Lecture Notes in Computer
Science, pages 272–286. Springer Berlin / Heidelberg. 10.1007/11891451_25.
15. Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications
of the ACM, 49(4):41–46.
Medelyan, O., Witten, I. H., and Milne, D. (2008). Topic Indexing with Wikipedia. In
Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) WikiAI
workshop.
Milne, D. and Witten, I. H. (2008). Learning to Link with Wikipedia. In Proceeding of the 17th
ACM conference on Information and Knowledge Management, pages 509–518.
Nikolova, S., Ma, X., Tremaine, M., and Cook, P (2010). Vocabulary navigation made easier.
.
In Proceedings of the 15th international conference on Intelligent user interfaces, IUI ’10, pages
361–364, New York, NY, USA. ACM.
Padró, L. (2011). Analizadores multilingües en freeling. Linguamatica, 3(2):13–20.
Ponzetto, S. and Strube, M. (2011). Taxonomy induction based on a collaboratively built
knowledge repository. Artificial Intelligence, 175(9-10):1737–1756.
Sanderson, M. and Croft, B. (1999). Deriving concept hierarchies from text. In Proceedings of
the 22nd annual international ACM SIGIR conference on Research and development in information
retrieval, pages 206–213. ACM.
Snow, R., Jurafsky, D., and Ng, A. (2004). Learning syntactic patterns for automatic hypernym
discovery. Advances in Neural Information Processing Systems 17.
Tan, P Steinbach, M., Kumar, V et al. (2006). Introduction to data mining. Pearson Addison
., .,
Wesley Boston.
Wei, X. and Croft, W. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of
the 29th annual international ACM SIGIR conference on Research and development in information
retrieval, pages 178–185. ACM.
Wilson, M., B, K., MC, S., and B, S. (2010). From keyword search to exploration: Designing
future search interfaces for the web. Foundations and Trends in Web Science, 2(1):1–97.
Yi, M. (2008). Information organization and retrieval using a topic maps-based ontology:
Results of a task-based evaluation. Journal of the American Society for Information Science and
Technology, 59(12):1898–1911.
Yu, J., Thom, J., and Tam, A. (2007). Ontology evaluation using wikipedia categories for
browsing. In Proceedings of the sixteenth ACM conference on Conference on information and
knowledge management, pages 223–232. ACM.