How do scientists tag their data? Do they come up with their own topical terms, do they consult controlled vocabularies, or do they use some other source for these terms? The goal of this project is to write a memo or guide which will help Dryad librarians guide scientists to describe their data in the best way possible by submitting meaningful subject headings/topical headings with their data.
This presentation was provided by Chris Erdmann of Library Carpentries and by Judy Ruttenberg of ARL during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
The whitepaper addresses the challenges in the data–driven organizations, medical research and health care. It summarizes how the context-enabled and semantic enrichment can transform the traditional method to search optimum data. 3RDi has advanced content enrichment with Named Entity Recognition, Semantic similarity, Content classification and Content summarization. Get the right data at the right time that helps medical researchers and health care practitioners.
This document summarizes a session from the Force 11 Scholarly Communications Institute Summer School on data discovery. The session covered metadata, including what it is, types of metadata, and standards. It discussed how people search for and find data through various sources. The session also explored the FAIR data principles of findable, accessible, interoperable and reusable data and had breakout groups discuss applying these principles in practice.
This document discusses research challenges in data mining for science and engineering. It covers challenges related to information network analysis, discovery and understanding of patterns from data, stream data mining, and other topics. Key points discussed include analyzing complex networks in scientific domains, developing methods for mining long and approximate patterns, and designing algorithms that can handle streaming data.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
This presentation was provided by Chris Erdmann of Library Carpentries and by Judy Ruttenberg of ARL during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.
The whitepaper addresses the challenges in the data–driven organizations, medical research and health care. It summarizes how the context-enabled and semantic enrichment can transform the traditional method to search optimum data. 3RDi has advanced content enrichment with Named Entity Recognition, Semantic similarity, Content classification and Content summarization. Get the right data at the right time that helps medical researchers and health care practitioners.
This document summarizes a session from the Force 11 Scholarly Communications Institute Summer School on data discovery. The session covered metadata, including what it is, types of metadata, and standards. It discussed how people search for and find data through various sources. The session also explored the FAIR data principles of findable, accessible, interoperable and reusable data and had breakout groups discuss applying these principles in practice.
This document discusses research challenges in data mining for science and engineering. It covers challenges related to information network analysis, discovery and understanding of patterns from data, stream data mining, and other topics. Key points discussed include analyzing complex networks in scientific domains, developing methods for mining long and approximate patterns, and designing algorithms that can handle streaming data.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
This document summarizes and compares different methods for performing keyword searches in relational databases. It discusses candidate network-based methods, Steiner-tree based algorithms, and backward expanding keyword search approaches. It also evaluates methods that aim to improve search efficiency and accuracy, such as integrating multiple related tuple units and developing structure-aware indexes. The overall goal is to find an effective and efficient approach to keyword search over relational database structures.
This document discusses semantic search and how thesauri can improve search experiences. It describes different types of semantic searches and demands for smarter searches. PoolParty Semantic Search is presented as a solution that leverages thesauri to provide auto-complete, query expansion, faceted search, and integration of linked data from multiple sources. A live demo of PoolParty Semantic Search is available online.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Semantic search uses language processing to analyze the meaning of content and search queries to return more relevant results. It involves classifying content using taxonomies, identifying named entities, extracting relationships between entities, and matching these based on meaning. Implementing semantic search requires preparing content through classification, metadata, and information architecture, as well as technologies for semantic tagging, entity extraction, triple stores, and integrating these capabilities with existing search and content management systems.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
Connecting Dataverse with the Research Life CycleMerce Crosas
This document outlines how data repositories can connect with the research lifecycle. It discusses Dataverse, an open source data repository platform developed by IQSS at Harvard University. It describes several application programming interfaces (APIs) that can be used for interoperating with Dataverse, including OAI-PMH, SWORD, search, and data access APIs. It also lists several integrations that are implemented or in progress to connect Dataverse with other systems like the Open Science Framework (OSF) and support functions like depositing data management plans, data curation, assigning DOIs, and more.
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...Bradley Allen
1. Relational navigation is an approach to information discovery and access that leverages taxonomies and ontologies to navigate through related information objects, rather than just searching.
2. Taxonomies provide the navigational structure for relational navigation, representing relationships between different types of information objects.
3. Relational navigation allows users to zero in on relevant information, follow relationships between objects, and share discoveries - helping users master discovery more effectively than just search.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
Information retrieval is concerned with searching for documents and metadata about documents. Documents contain information to be retrieved. There is overlap between terms like data retrieval, document retrieval, information retrieval, and text retrieval. Automated information retrieval systems are used to reduce information overload. Libraries and universities use IR systems to provide access to materials. Web search engines are a prominent example of IR applications. The idea of using computers for information retrieval was popularized in 1945. Early automated systems emerged in the 1950s and large-scale systems in the 1970s such as the Lockheed Dialog system. Many measures exist for evaluating IR system performance including precision, recall, and precision-recall curves.
This document discusses using semantics and semantic triples to represent relationships between concepts and entities. Semantic triples use subjects, predicates, and objects to make assertions about relationships. These triples can be represented in various formats like RDF/XML, Turtle, and TriX. Semantic schemas like RDFS and OWL can define additional rules and relationships. SPARQL is then used to query the semantic graph. Combining semantics with search in MarkLogic allows documents to be flexibly joined and queried based on both their content and semantic relationships.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
The Nuclear Receptor Signaling Atlas (NURSA) is partnering with dkNET (NIDDK Information Network) to host a dataset challenge, and we invite you to join! Everyone is talking about Big Data. How can we ensure that the impact of individual scientists working on a myriad of small and focused studies that discover and probe new phenomena - is not lost in the Big Data world. In fact, there is more than one way to generate big data and we would like your help in creating and expanding “big data” for NIDDK! In this 30-minute webinar, dkNET team will give a presentation about the overview of challenge task, how to use dkNET to find research resources, and top tips!
Metadata: Towards Machine-Enabled Intelligencedannyijwest
World Wide Web has revolutionized the means of data availability, but with its current structure model , it is becoming increasingly difficult to retrieve relevant information, with reasonable precision and recall, using the major search engines. However, with use of metadata, combined with the use of improved searching techniques, helps to enhance relevant information retrieval .The design of structured, descriptions of Web resources enables greater search precision and a more accurate relevance ranking of retrieved information .One such efforts towards standardization is , Dublin Core standard, which has been developed as Metadata Standard and also other standards which enhances retrieval of a wide range of information resources. This paper discuses the importance of metadata, various metadata schemas and elements, and the need of standardization of Metadata. This paper further discusses how the metadata can be generated using various tools which assist intelligent agents for efficient retrieval
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
This document discusses mining heterogeneous information networks. It begins by defining heterogeneous information networks as information networks containing multiple object and link types. It then discusses how heterogeneous networks are richer than homogeneous networks derived from them by projection. Several examples of heterogeneous networks are given, such as bibliographic, social media, and healthcare networks. The document outlines principles for mining heterogeneous networks, including using meta-paths to explore network structures and relationships. It introduces methods for ranking, clustering, and classifying nodes in heterogeneous networks, such as the RankClus and NetClus algorithms, which integrate ranking and clustering.
This document discusses using semantic web technologies to help make sense of big data by linking and integrating heterogeneous data sources. It presents a self-adaptive natural language interface model that takes a natural language query as input, considers possible concept annotations and SPARQL query patterns, runs the queries, and returns results to a reasoner to identify the correct query and answer. The model was tested on geography and Quran ontologies and was able to correctly answer questions with different SPARQL patterns. The conclusion discusses how semantic web and linked data can help analyze big data and create more personalized applications.
Metadata is the information that is embedded
in a file whose contents are the explanation of the file. In the
handling of the main evidence with a metadata-based approach
is still a lot of manually in search for correlation related files to
uncover various cases of computer crime. However, when
correlated files are in separate locations (folders) and the
number of files will certainly be a formidable challenge for
forensic investigators in analyzing the evidence. In this study,
we will build a prototype analysis using a metadata-based
approach to analyze the correlation of the main proof file with
the associated file or deemed relevant in the context of the
investigation automatically based on the metadata parameters
of Author, Size, File Type and Date. In this research, the
related analysis read the characteristics of metadata file that is
file type Jpg, Docx, Pdf, Mp3 and Mp4 and analysis of digital
evidence correlation by using specified parameters, so it can
multiply the findings of evidence and facilitate analysis of
digital evidence. In this research, the result of correlation
analysis of digital evidence found that using parameter of
Author, Size, File Type and Date found less correlated file
while using parameter without Size and File Type found more
correlated file because of various extension and file size.
Simon Cox (Researcher @ CSIRO) and I presented on outcomes of the CSIRO Summer of Vocabularies.
This project focused on:
- Examining the state of the management of various controlled vocabularies and developing reusable processes to clean & standardise those vocabularies.
- Developing standards & technologies for vocabulary management, curation & visualisation.
We presented on these Summer of Vocabularies activities and how they are being used now to further the work of the Vocram project and the improvement and development of ANDS vocabulary services.
This document summarizes and compares different methods for performing keyword searches in relational databases. It discusses candidate network-based methods, Steiner-tree based algorithms, and backward expanding keyword search approaches. It also evaluates methods that aim to improve search efficiency and accuracy, such as integrating multiple related tuple units and developing structure-aware indexes. The overall goal is to find an effective and efficient approach to keyword search over relational database structures.
This document discusses semantic search and how thesauri can improve search experiences. It describes different types of semantic searches and demands for smarter searches. PoolParty Semantic Search is presented as a solution that leverages thesauri to provide auto-complete, query expansion, faceted search, and integration of linked data from multiple sources. A live demo of PoolParty Semantic Search is available online.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Semantic search uses language processing to analyze the meaning of content and search queries to return more relevant results. It involves classifying content using taxonomies, identifying named entities, extracting relationships between entities, and matching these based on meaning. Implementing semantic search requires preparing content through classification, metadata, and information architecture, as well as technologies for semantic tagging, entity extraction, triple stores, and integrating these capabilities with existing search and content management systems.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
Connecting Dataverse with the Research Life CycleMerce Crosas
This document outlines how data repositories can connect with the research lifecycle. It discusses Dataverse, an open source data repository platform developed by IQSS at Harvard University. It describes several application programming interfaces (APIs) that can be used for interoperating with Dataverse, including OAI-PMH, SWORD, search, and data access APIs. It also lists several integrations that are implemented or in progress to connect Dataverse with other systems like the Open Science Framework (OSF) and support functions like depositing data management plans, data curation, assigning DOIs, and more.
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...Bradley Allen
1. Relational navigation is an approach to information discovery and access that leverages taxonomies and ontologies to navigate through related information objects, rather than just searching.
2. Taxonomies provide the navigational structure for relational navigation, representing relationships between different types of information objects.
3. Relational navigation allows users to zero in on relevant information, follow relationships between objects, and share discoveries - helping users master discovery more effectively than just search.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
Information retrieval is concerned with searching for documents and metadata about documents. Documents contain information to be retrieved. There is overlap between terms like data retrieval, document retrieval, information retrieval, and text retrieval. Automated information retrieval systems are used to reduce information overload. Libraries and universities use IR systems to provide access to materials. Web search engines are a prominent example of IR applications. The idea of using computers for information retrieval was popularized in 1945. Early automated systems emerged in the 1950s and large-scale systems in the 1970s such as the Lockheed Dialog system. Many measures exist for evaluating IR system performance including precision, recall, and precision-recall curves.
This document discusses using semantics and semantic triples to represent relationships between concepts and entities. Semantic triples use subjects, predicates, and objects to make assertions about relationships. These triples can be represented in various formats like RDF/XML, Turtle, and TriX. Semantic schemas like RDFS and OWL can define additional rules and relationships. SPARQL is then used to query the semantic graph. Combining semantics with search in MarkLogic allows documents to be flexibly joined and queried based on both their content and semantic relationships.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
The Nuclear Receptor Signaling Atlas (NURSA) is partnering with dkNET (NIDDK Information Network) to host a dataset challenge, and we invite you to join! Everyone is talking about Big Data. How can we ensure that the impact of individual scientists working on a myriad of small and focused studies that discover and probe new phenomena - is not lost in the Big Data world. In fact, there is more than one way to generate big data and we would like your help in creating and expanding “big data” for NIDDK! In this 30-minute webinar, dkNET team will give a presentation about the overview of challenge task, how to use dkNET to find research resources, and top tips!
Metadata: Towards Machine-Enabled Intelligencedannyijwest
World Wide Web has revolutionized the means of data availability, but with its current structure model , it is becoming increasingly difficult to retrieve relevant information, with reasonable precision and recall, using the major search engines. However, with use of metadata, combined with the use of improved searching techniques, helps to enhance relevant information retrieval .The design of structured, descriptions of Web resources enables greater search precision and a more accurate relevance ranking of retrieved information .One such efforts towards standardization is , Dublin Core standard, which has been developed as Metadata Standard and also other standards which enhances retrieval of a wide range of information resources. This paper discuses the importance of metadata, various metadata schemas and elements, and the need of standardization of Metadata. This paper further discusses how the metadata can be generated using various tools which assist intelligent agents for efficient retrieval
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
This document discusses mining heterogeneous information networks. It begins by defining heterogeneous information networks as information networks containing multiple object and link types. It then discusses how heterogeneous networks are richer than homogeneous networks derived from them by projection. Several examples of heterogeneous networks are given, such as bibliographic, social media, and healthcare networks. The document outlines principles for mining heterogeneous networks, including using meta-paths to explore network structures and relationships. It introduces methods for ranking, clustering, and classifying nodes in heterogeneous networks, such as the RankClus and NetClus algorithms, which integrate ranking and clustering.
This document discusses using semantic web technologies to help make sense of big data by linking and integrating heterogeneous data sources. It presents a self-adaptive natural language interface model that takes a natural language query as input, considers possible concept annotations and SPARQL query patterns, runs the queries, and returns results to a reasoner to identify the correct query and answer. The model was tested on geography and Quran ontologies and was able to correctly answer questions with different SPARQL patterns. The conclusion discusses how semantic web and linked data can help analyze big data and create more personalized applications.
Metadata is the information that is embedded
in a file whose contents are the explanation of the file. In the
handling of the main evidence with a metadata-based approach
is still a lot of manually in search for correlation related files to
uncover various cases of computer crime. However, when
correlated files are in separate locations (folders) and the
number of files will certainly be a formidable challenge for
forensic investigators in analyzing the evidence. In this study,
we will build a prototype analysis using a metadata-based
approach to analyze the correlation of the main proof file with
the associated file or deemed relevant in the context of the
investigation automatically based on the metadata parameters
of Author, Size, File Type and Date. In this research, the
related analysis read the characteristics of metadata file that is
file type Jpg, Docx, Pdf, Mp3 and Mp4 and analysis of digital
evidence correlation by using specified parameters, so it can
multiply the findings of evidence and facilitate analysis of
digital evidence. In this research, the result of correlation
analysis of digital evidence found that using parameter of
Author, Size, File Type and Date found less correlated file
while using parameter without Size and File Type found more
correlated file because of various extension and file size.
Simon Cox (Researcher @ CSIRO) and I presented on outcomes of the CSIRO Summer of Vocabularies.
This project focused on:
- Examining the state of the management of various controlled vocabularies and developing reusable processes to clean & standardise those vocabularies.
- Developing standards & technologies for vocabulary management, curation & visualisation.
We presented on these Summer of Vocabularies activities and how they are being used now to further the work of the Vocram project and the improvement and development of ANDS vocabulary services.
Toward linked data: Library of Congress Medium of Performance Thesaurus Jane Frazier
This document discusses the Library of Congress Medium of Performance Thesaurus (LCMPT), a controlled vocabulary for describing the medium of performance for musical works. It provides background on the development of LCMPT, compares it to Library of Congress Subject Headings, and details its current statistics and mapping to authority data elements and SKOS. Finally, it outlines several issues for consideration in further developing LCMPT, such as handling foreign language terms, bi-level relationships, and options for its data storage platform.
A Bibliographic Overview of English-Language Opera TranslationJane Frazier
This document discusses opera translation, providing resources and guidelines. It summarizes key points about translating opera, including principles translators follow like suitability for singing and maintaining meaning. It also outlines problems translators face, such as differences between languages. Ideal translator characteristics are musical understanding, poetic skill, and comprehending each element's contribution. Resources provided include libretti, supertitles, and writings examining translation.
Flying solo: data librarians working outside (traditional) librariesJane Frazier
I used these slides for my portion of the "Flying Solo" ANDS webinar:
Did you know there are data librarians who work outside of (traditional) libraries? For some, being a data librarian means leaving the relative comfort of the library behind and ‘flying solo’ into unchartered territory. These are new and demanding roles that require a steep learning curve with minimal support. In this webinar, three data librarians working outside of libraries will share their experience of going it alone, reflecting on these challenging yet rewarding roles that push the boundaries of librarianship and open new opportunities for the profession.
Siobhann McCafferty is based at QUT’s Institute for Future Environments in Brisbane and is the Research Data Coordinator for the National Agricultural Nitrous Oxide Research Program (NANORP). She is embedded in the Healthy Ecosystems and Environmental Management group at IFE and works with researchers from across Australia to store program data and make it discoverable and reusable.
Jane Frazier is a Data Librarian at ANDS. She has previously worked in the University of North Carolina Music Library cataloging 20th century American vocal sheet music, as curatorial assistant at the Dryad digital repository, and at the UNC Metadata Research Center exploring automatic subject indexing processes for Dryad. From 2013 to 2014 Jane led the research and development of a new web-based cataloging system for collectible items with Stanley Gibbons, one of the world’s oldest stamp collecting firms.
-- Michelle Teis has more than 25 years’ industry experience, senior consultant
Michelle Teis is an enterprise information management expert specialising in content, data and knowledge management, and information privacy. http://www.glentworth.com/about-us/our-key-people/michelle-teis/
Intro to metadata, RIF-CS, RDA records [ANU 2016]Jane Frazier
The document summarizes a training workshop on using RIF-CS (Registry Interchange Format - Collections and Services) to create quality metadata records in RDA (Resource Description and Access). The workshop included introductions to metadata and metadata standards, and RIF-CS. It featured an activity using RIF-CS to create metadata records for a meteorological data collection in accordance with RDA.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
This paper aims to show how mutual information can help provide a semantic interpretation of anomalies in data, characterize the anomalies, and how mutual information can help measure the information that object item X shares with another object item Y. Whilst most link mining approaches focus on predicting link type, link based object classification or object identification, this research focused on using link mining to detect anomalies and discovering links/objects among anomalies. This paper attempts to demonstrate the contribution of mutual information to interpret anomalies using a case study.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
An improved technique for ranking semantic associationst07IJwest
The primary focus of the search techniques in the first generation of the Web is accessing relevant
documents from the Web. Though it satisfies user requirements, but it is insufficient as the user sometimes
wishes to access actionable information involvin
g complex relationships between two given entities.
Finding such complex relationships (also known as semantic associations) is especially useful in
applications
such as
National Security, Pharmacy, Business Intelligence etc. Therefore the next frontier is
discovering relevant semantic associations between two entities present in large semantic metadata
repositories. Given two entities, there exist a huge number of semantic associations between two entities.
Hence ranking of these associations is required i
n order to find more relevant associations. For this
Aleman Meza et al. proposed a method involving six metrics viz. context, subsumption, rarity, popularity,
association length and trust. To compute the overall rank of the associations this method compute
s
context, subsumption, rarity and popularity values for each component of the association and for all the
associations. However it is obvious that, many components appears repeatedly in many associations
therefore it is not necessary to compute context, s
ubsumption, rarity
,
popularity
,
and
trust
values of the
components every time for each association rather the previously computed values may be used while
computing the overall rank of the associations. Thi
s paper proposes a method to re
use the previously
computed values using a hash data structure thus reduce the execution time. To demonstrate the
effectiveness of the proposed method, experiments were conducted on SWETO ontology. Results show
that the proposed method is more efficient than the other existi
ng methods
.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
This document discusses community approaches to open data at scale. It describes the Metadata 2020 collaboration, which aims to promote richer, connected, reusable open metadata. It outlines several projects undertaken by Metadata 2020 working groups to address challenges around metadata quality, standards, and incentives. The document also summarizes two talks on improving metadata pipelines for SHARE and the evolution of metadata curation at Dryad data repository. It discusses Dryad's integration of manuscript and data submission as well as efforts to enhance interoperability, data citation, and the proposed Data Curation Network model.
The document discusses how search alone is no longer sufficient to meet people's information needs, as they want to find and follow relationships between information. It argues that using metadata expressed through Semantic Web standards can provide more powerful and cost-effective solutions than traditional search, by enabling relational navigation between heterogeneous objects. The document presents several case studies of companies that implemented navigation systems using Siderean Software's products to improve information discovery for customers, partners and employees.
DRI Introductory Training: Introduction to Metadatadri_ireland
This presentation was delivered by National Open Research Coordinator Dr Daniel Bangert as part of a Digital Repository of Ireland (DRI) Introductory Training seminar aimed at the University College Cork (UCC) research community on 14 June 2021. The presentation covers an introduction to the basics of metadata.
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
Presentación del Dr. Getaneh Alemu (Solent University, Reino Unido), en el II Congreso de Información, Comunicación e Investigación (CICI 2018) “Metadatos y Organización de la Información”. Facultad de Filosofía y Letras de la Universidad Autónoma de Chihuahua, México. Evento organizado por el Cuerpo Académico 'Estudios de la Información' y el Grupo Disciplinar ‘Información, Lenguaje, Comunicación y Desarrollo Sostenible’. 29 de octubre de 2018.
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGcsandit
With the enormous growth of data, retrieving information from the Web became more desirable
and even more challenging because of the Big Data issues (e.g. noise, corruption, bad
quality…etc.). Expert seeking, defined as returning a ranked list of expert researchers given a
topic, has been a real concern in the last 15 years. This kind of task comes in handy when
building scientific committees, requiring to identify the scholars’ experience to assign them the
most suitable roles in addition to other factors as well. Due to the fact the Web is drowning with
plenty of data, this opens up the opportunity to collect different kinds of expertise evidence. In
this paper, we propose an expert seeking approach with specifying the most desirable features
(i.e. criteria on which researcher’s evaluation is done) along with their estimation techniques.
We utilized some machine learning techniques in our system and we aim at verifying the
effectiveness of incorporating influential features that go beyond publications
The document summarizes research on students' ability to critically evaluate information found online. It discusses how students struggle with skills like determining a website's credibility and relevance. The study aimed to compare the strategies used by students who were successful versus less successful on a task evaluating the reliability of websites. By analyzing think-aloud protocols, researchers hoped to better understand the relationship between offline and online evaluation skills and identify common markers used to judge a site's reliability.
The document discusses methodologies for sharing long-tail data and what has been learned. It notes that unique identifiers (PIDs) are important for identifying entities across contexts. Standards like MINI and common data elements (CDEs) help ensure data is findable, accessible, and reusable. The Neuroscience Information Framework (NIF) aggregates ontologies and searches over 200 data sources to organize information. What we have learned is that data should be in repositories, not personal servers; people are key to these efforts; and resources should be comprehensive and support each other to advance open data sharing.
The document describes a summer institute on discovering big data held in San Diego from August 5-9, 2013. It discusses several topics related to big data in neuroscience including available resources, how to find and connect relevant information, challenges around data integration from disparate sources, and using ontologies and machine learning for tasks like data tagging.
This document discusses metadata and its importance for digital libraries and humanities. It defines metadata as "data about data" that describes resources to help users find, identify and select them. Metadata plays a crucial role in managing the huge amount of digital information and data available. The document advocates for an approach of enriching metadata by allowing both experts and users to contribute, and filtering it through customizable interfaces to meet diverse user needs.
This document discusses metadata and controlled vocabularies. It defines metadata as structured data about an object that supports functions associated with that object. Metadata includes descriptive information about an object's content, context, and structure. Different types of metadata serve administrative, descriptive, preservation, technical, and use functions. Key metadata schemas discussed include Dublin Core, TEI, METS, and MODS. Controlled vocabularies are important for consistency, accuracy, and efficient searching when assigning subject terms in metadata. Quality metadata should be shareable, consistent, and conform to standards.
The document provides an outline for a presentation on the wicked problem of data literacy. It discusses emerging findings from a review of literature on data literacy, including evolving concepts of data in society and differing definitions of data literacy. Conceptual frameworks used in the research are described, such as theories of radical change, stakeholder identification, and wicked problems. Key stakeholders in advancing data literacy are identified.
Data Mining for Education
Ryan S.J.d. Baker, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
rsbaker@cmu.edu
Article to appear as
Baker, R.S.J.d. (in press) Data Mining for Education. To appear in McGaw, B., Peterson, P.,
Baker, E. (Eds.) International Encyclopedia of Education (3rd edition). Oxford, UK: Elsevier.
This is a pre-print draft. Final article may involve minor changes and different formatting.
This document describes a framework for representing and reasoning with geographic semantics. The framework enables visualizing knowledge discovery processes, automating tool selection to solve user-defined geographic problems, and evaluating semantic changes. Methods, data, and human experts are described using ontologies. An expert system then selects the optimal combination of resources to solve a given problem based on semantic descriptions. The resulting solution represents the formation history of any new information products and the semantic changes involved.
Similar to Subject tagging: Recommendations for Dryad curators and scientists (20)
Structured data:
- What is it?
- Data modelling standards
- Semantic Web
Knowledge Organisation Systems
- Natural language → controlled language
- Controlled vocabulary → ontology
- Design considerations & user impact
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
A hands-on introduction to interrogation of Wikidata content using SPARQL, the query language used to query data represented in RDF, SKOS, OWL, and other Semantic Web standards.
Presented by myself and Peter Neish, Research Data Specialist @ University of Melbourne.
Research Vocabularies Australia: vocabularies as a national serviceJane Frazier
Research Vocabularies Australia (RVA) is a service that provides controlled vocabularies to support research, data reuse, and data linking. It includes vocabularies related to food and crop research like the World Vegetable Authority, International Grain Organisation, Australian Bureau of Fruit Vegetable Vocabulary, Wheat Vocabulary, and Fruit Vocabulary. RVA connects these related vocabularies to help students, researchers, and data portals better organize and integrate information about vegetables and grains.
Day in the life of a data librarian [presentation for ANU 23Things group]Jane Frazier
This document summarizes the job responsibilities and career path of a data librarian. It describes how the librarian draws on skills from traditional librarianship, metadata work, digital curation, software development and research to support data management and sharing. The librarian's current role involves developing metadata standards, providing training and consultancy to researchers, and engaging with colleagues both within and outside their organization to improve data services. The document suggests aspiring data librarians learn new technologies, describe their skills to potential employers, and stay active developing their expertise through conferences and online resources.
Controlled vocabularies for health & medicineJane Frazier
During this workshop, given at the Health Libraries Australia Professional Development Day 9 July in Brisbane, we discussed maximising value in the use of controlled vocabularies for health & medicine.
Music Information Retrieval: A Literature ReviewJane Frazier
Music information retrieval is a small subfield within traditional information retrieval. For the purposes of this piece, I selected publications which are overviews of MIR as a field and publications having to do with user discovery of music.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...APNIC
Adli Wahid, Senior Internet Security Specialist at APNIC, delivered a presentation titled 'Honeypots Unveiled: Proactive Defense Tactics for Cyber Security' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Securing BGP: Operational Strategies and Best Practices for Network Defenders...APNIC
Md. Zobair Khan,
Network Analyst and Technical Trainer at APNIC, presented 'Securing BGP: Operational Strategies and Best Practices for Network Defenders' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
2. 2
SUBJECT TAGGING: RECOMMENDATIONS FOR DRYAD CURATORS AND
SCIENTISTS" 1"
PROBLEM SPECIFICATION" 3"
THE VALUE OF QUALITY DESCRIPTIVE METADATA" 4"
METRICS FOR METADATA QUALITY" 4"
HUMAN METADATA GENERATION" 5"
MODELS FOR CREATING METADATA" 6"
CONTROLLED TERMS" 7"
WHAT IS A CONTROLLED VOCABULARY?" 7"
THE VALUE OF CONTROLLED TERMS" 7"
CONTROLLED VOCABULARIES USEFUL IN DESCRIBING DRYAD DATA" 7"
UNCONTROLLED TERMS" 10"
THE VALUE OF UNCONTROLLED TERMS" 11"
FOLKSONOMIES" 11"
RECOMMENDATIONS" 12"
HOW TO USE THE DESCRIPTIVE METADATA FIELDS" 12"
SUBJECT KEYWORDS" 13"
TEMPORAL KEYWORDS " 14"
SPATIAL KEYWORDS" 15"
TAXONOMIC KEYWORDS" 16"
BIBLIOGRAPHY" 19"
APPENDIX 1" 22"
FURTHER RESOURCES FOR AUTHORS" 22
3. 3
"
PROBLEM SPECIFICATION!
"
The Dryad project is an online repository for the data which underlie publications in
the biosciences. Submissions to Dryad consist of the data that is used to create scientific
publications, for example phylogenetic trees, tables, spreadsheets, images, maps, gene
alignments, matrices, and the like. When authors submit their data to Dryad, they have the
opportunity to enter topical headings/subject headings that will be used to categorise and
retrieve their data in the future.
There are currently four types of topical metadata that Dryad accepts: subject,
temporal, spatial, and taxonomic. Because these headings are not controlled in any way,
scientists often do not submit any headings (leave the fields blank) or submit them in ways
that might not benefit future users of Dryad (poor formatting, abbreviations, etc.). The data
is much more accessible to users of Dryad - other scientists, the public, Dryad curators - if
the data is described in a meaningful and thorough way.
Studying this topic has allowed me to gain a great understanding of how scientists
tag their data. Do they come up with their own topical terms, do they consult controlled
vocabularies, or do they use some other source for these terms? The goal of this project
will be to write a memo or guide which will help Dryad librarians guide scientists to
describe their data in the best way possible by submitting meaningful subject headings/
topical headings with their data. A summary of the benefits of scientific data archival is best
summed up in Whitlock’s 2009 article Data archiving in ecology and evolution: best
practices:
“Data archives serve science in a variety of ways. Publicly archived
data enable more transparent science, with better error checking and
verification of results. Archiving also enables data to be re-used for
broader meta-analyses and to address new questions. Available data
can serve a powerful educational role, both in teaching the statistical
and technical aspects of research and to engage students in the
process of science. Public data archiving is also a powerful mechanism
for data security, providing a mechanism by which data can be saved
and re-accessed by the original authors and others even after hard disk
failure or other catastrophes.”
"
"
4. 4
THE VALUE OF QUALITY DESCRIPTIVE METADATA!
"
METRICS FOR METADATA QUALITY
"
Metadata are defined as the data providing information about aspects of the data.
This may include the means of the creation of data, the purpose of the data, the time and
date of the creation, the name of the creator or author of the data, the location where the
data was created, or descriptions of the data’s about-ness.
There has been much research on criteria which can and should be used to
measure the quality of this metadata. Rotherberg (1996) identifies correctness and
appropriateness as two main criteria for data evaluation. In their 1997 article The Role of
Content Analysis in Evaluating Metadata for the US Government Information Locator
Service (GILS): Results from an exploratory study, Moen et al. describe 23 evaluation
criteria with which to analyse metadata quality. From these 23 criteria, the researchers
distilled four main criteria: accuracy, consistency, completeness, and currency. These four
criteria partially overlap with Tozer’s (1999) data quality measures of accuracy,
consistency, completeness, timeliness, and intelligibility. Bruce and Hillman (2004) refine
the previously mentioned criteria and modify them for the library community, suggesting
completeness, accuracy, provenance, conformance to expectation, logical consistency,
coherence, timeliness, and accessibility. In her 2009 article, Park evaluates Moen et al.
(1997) and Bruce and Hillman’s (2004) criteria in addition to the criteria determined by
seven other research teams, and determines that accuracy, completeness, and
consistency are the most commonly used criteria in measuring metadata quality.
Accuracy (or correctness) of metadata indicates the accurate description and input
of data. According to Park (2009), this is made up of three elements: the accurateness of
the content of the data element, the correctness of the intellectual property, and the
correctness of the instantiation (or particular instance). Park includes errors in spelling,
date format, capitalisation, punctuation as well as non-authoritative forms of terms,
typographical errors and incorrect data values as potential problems with the accuracy of
metadata.
According to Park, completeness can be measured by full access capacity to
individual resources and connection to the collections in which they are housed.
Completeness points to resource discovery as the functional purpose of metadata, but
5. 5
does not necessarily indicate that all elements in a particular metadata scheme must be
used. Because completeness is directly affected by policies, best practices, and
application profiles for specific domains, the completeness of a metadata records may vary
depending on the environment in which they are housed. Completeness can be achieved
in a metadata record if the given resource type, its relation to the local collection and the
local metadata guidelines are met satisfactorily.
Consistency (or comparability) can be measured by examining the values of the
metadata and the format of the metadata. Park states that metadata values must be
examined on the conceptual level by measuring the degree to which the same data values
are used for delivering similar concepts in the description of a resource, and the data
format must be examined on a structural level by measuring the extent to which the same
structure or format is used for presenting similar attributes of a resource. For example,
differences between the encoding of a date element (e.g., MM-DD-YYYY versus DD-MM-
YY) are structurally inconsistent, and may cause problems for future users of the data.
HUMAN METADATA GENERATION
"
According to Greenberg and Robertson (2002), human metadata generation takes
place when an individual is responsible for the identification and assignment or recording
of resource materials. This type of metadata generation may take place in a number of
ways by different types of individuals. The three types of individuals that have been
identified in the literature are professional metadata creators, resource authors, and social
taggers. Professional metadata creators may be catalogers, indexers or curators who have
formal training and are proficient in the use of descriptive standards (Greenberg et al.
2002). According to Lu et al. (2002), social taggers apply their own descriptors to sources
that interest them. Resource authors, like the submitters of data to Dryad, are the
individuals responsible for the creation of the intellectual content of a work. Most
importantly, they are “intimate with their creations and have knowledge of unrecorded
information for producing descriptive metadata,” allowing them to have a unique ability to
describe their data with the highest accuracy (Greenberg and Robertson 2002).
The fact that resource authors may lack the knowledge of indexing and cataloging
principles that professional metadata creators possess is well documented (Greenberg et
al. 2003). However, attitudes toward resource authors as metadata creators are somewhat
split in the literature. Wilson, in her 2007 article Toward Releasing the Metadata Bottleneck
6. 6
states that resource authors “seldom provide sufficient metadata for their digital resources”
and Greenberg et al. (2003) state that authors have “reported confusion or uncertainty
regarding specific fields and have requested greater assistance in determining appropriate
inputs, especially for subject fields.” However, Greenberg et al. also report that resource
authors state a desire for better understanding of the metadata record and its purpose.
Currier et al. (2003) discuss the debate between allowing metadata professionals or
resource authors to create metadata for resources:
“How may this difficult and complex task best be carried out for
maximum resource discoverability by a heterogeneous population of
searchers? Should the resource author, who may know their subject
area and its terminology well, create the subject metadata? Or should it
be a metadata specialist, who may know the specific area less well, but
may be better placed to step back and think about all the potential users
of a resource, and about consistency of key words and classifications
across a repository or network?”
"
MODELS FOR CREATING METADATA
"
Currier et al. (2003) recommend three models for metadata creation: creation by a
resource author only, creation by a metadata specialist only, or creation by collaboration
between a resource author and a metadata specialist. A data collection centre (like Dryad)
would be well advised to ensure that their system supports user support and training if
they are to rely on metadata creation by resource authors alone. Conversely, metadata
specialists, who already possess the skills needed to create a quality metadata record may
lack the knowledge about the context, history os subject area of the resource in order to
best record its metadata. Currently, Dryad uses a semi-collaborative approach to metadata
creation. Although resource authors do not consult with the curators, or metadata
specialists while they are submitting data, the curators spend time checking the author-
created metadata for quality issues. Greenberg and Robertson (2002) recommend this
model, stating that “...the integration of expert and author generated descriptive metadata
can advance and improve the quality of metadata for web content, which in turn could
provide useful data for intelligent web agents, ultimately supporting the development of the
Semantic Web. [...] If such partnerships are well planned and evaluated, they could make
a significant contribution to achieving the Semantic Web.” Along with the three models for
metadata creation that Currier et al. (2003) present, social tagging and the rise of
7. 7
folksonomies should be mentioned as a fourth model. Folksonomies will be discussed in a
later section of this paper.
"
CONTROLLED TERMS!
"
WHAT IS A CONTROLLED VOCABULARY?
"
A controlled vocabulary allows for organisation of some content, or knowledge, in a
way in which it can be easily retrieved at a later time. Vocabularies are 'controlled' in that
they make use of authorised descriptions of the content they contain. These groupings of
concepts are carefully selected and described so that the information they contain can be
retrieved in the most efficient ways possible.
THE VALUE OF CONTROLLED TERMS
"
Natural language, or the way that humans speak in everyday life, is messy. We use
multiple terms and phrases to describe the same things, and there are fine (or grey) lines
between one meaning and another. The organisation, categorisation, and labelling of our
knowledge can be achieved by way of controlled vocabularies. A controlled vocabulary
allows for all concepts to be consistently labeled using language that is unambiguous and
is familiar to its users. More importantly, controlled vocabularies allow us to search for
concepts and achieve successful, quality results.
CONTROLLED VOCABULARIES USEFUL IN DESCRIBING DRYAD DATA
"
Medical Subject Headings (MeSH) is the controlled vocabulary of the United1
States National Library of Medicine. Currently consisting of more than 177,000 terms
situated in a twelve-level hierarchy, MeSH allows for the indexing of articles from
biomedical journals for the MEDLINE/PubMED database. MeSH is comprised of three
main types of terms: descriptors (main headings), qualifiers (subheadings), and
supplementary concept records (SCRs). Descriptors indicate the subject of citations
indexed in MEDLINE/PubMED, and the 83 existing topical qualifiers allow for the grouping
http://www.ncbi.nlm.nih.gov/mesh1
8. 8
together of citations concerned with a particular aspect of a subject. SCRs index chemicals
and drugs and are searchable by substance name in PubMED.
The Getty Thesaurus of Geographic Names (TGN) is a controlled vocabulary2
provided by the Getty Vocabulary Program of the J. Paul Getty Trust. TGN currently
includes approximately 1,106,000 hierarchically arranged terms that describe names and
associated information about places, including current and historical physical features and
political entities. Each term entry includes a unique identification number, known as a
subject ID, text description about the place, geographical coordinates, associated place-
names, dates referring to the usage of those names, position of the entry in the TGN
hierarchy, information about related places, information about the type of place described
in the entry, and information about the data source.
Developed by the White House Subcommittee on Biodiversity and Ecosystem
Dynamics, the Integrated Taxonomic Information System (ITIS) is a controlled3
vocabulary for describing ecosystem management and biodiversity conservation.
Information about each species includes an authoritative scientific title, a taxonomic rank
and serial number, associated synonyms and vernacular names, information about the
data source, and data quality indicators.
The BIOSIS Controlled Vocabulary contains multiple lists of terms used in the
BIOSIS Previews and Biological Abstracts databases. The vocabulary is organised into
several categories including concepts, organism classifiers, and geopolitical locations,
among others. The vocabulary includes 168 “major concepts” and 562 “concept codes”
used for subject or topical indexing; 77 “organism classifiers” and 957 “super taxa” used
for taxonomic data; and 316 geopolitical locations.
The National Biological Information Infrastructure (NBII) was a program coordinated
by the United States Geological Survey's Biological Informatics Program Office . Its4
purpose was to facilitate access to data and information on the biological resources of the
United States, utilising government agencies, academic institutions, non-government
organisations, and private industry. The NBII Biocomplexity Thesaurus, and online
thesaurus of scientifically reviewed biological terms, was initially created through a merger
http://www.getty.edu/vow/TGNSearchPage.jsp2
http://www.itis.gov/3
Development and web hosting of the NBII was terminated 15 January 2012.4
9. 9
of several individual thesauri, including the CSA Aquatic Sciences and Fisheries
Thesaurus, the Cambridge Scientific Abstracts (CSA) Life Sciences Thesaurus, the CSA
Pollution Thesaurus, the CSA Sociological Thesaurus, the CERES/NBII Thesaurus, and
the CSA Ecotourism Thesaurus. The thesaurus includes over 15,000 terms on subjects
such as aquatic sciences, life sciences, social sciences, ecotourism, and pollution.
The Library of Congress Subject Headings (LCSH) is a controlled vocabulary5
for use in subject cataloging and indexing. First published in 1898, LCSH was designed for
and is maintained by the Library of Congress, but the system has been adopted by many
other libraries. LCSH covers all subjects generally. Subject headings can consist of single
words or phrases and are divided into two types: main headings and subheadings. LCSH
uses four categories of subdivisions to further distinguish main heading topics: form
subdivisions, geographical subdivisions, chronological subdivisions, and topical
subdivisions.
AGROVOC is a multilingual controlled vocabulary covering all areas of interest to6
the Food and Agricultural Organisation of the United Nations (FAO), including food,
nutrition, agriculture, fisheries, forestry, and the environment. AGROVOC contains over
30,000 concepts organised in a hierarchy, and concepts may have labels in up to 22
languages.
Wilson and Reeder’s Mammal Species of the World is an online database of7
mammalian taxonomy. Use of the Mammal Species of the World, through search or
taxonomic browsing, allows users to verify recognised scientific names and conduct
taxonomic research.
CAB Thesaurus is the research tool for users of the CAB ABSTRACTSTM and8
Global Health databases. The thesaurus includes over 200,000 terms broad and covers
topics in the applied life sciences, technology and social sciences.
http://id.loc.gov/authorities/subjects.html5
http://aims.fao.org/standards/agrovoc/functionalities/search6
http://www.vertebrates.si.edu/msw/mswcfapp/msw/index.cfm7
http://www.cabi.org/cabthesaurus/8
10. 10
The GeoRef Thesaurus contains 23,065 valid and 7,740 invalid terms, of which9
about 1780 are newly added. The Thesaurus is a guide to the index terms used in GeoRef,
a database consisting of bibliographic citations and abstracts covering the field of geology
and its allied environmental sciences. For each term, the Thesaurus includes hierarchical
and other relationships, usage notes, dates of addition, indexing rules, geographic
coordinates, and guidelines for searching. Cross-references from invalid to valid terms are
included.
The National Agricultural Library’s NAL Agricultural Thesaurus includes10
terminology which supports biological, physical and social sciences. Biological
nomenclature comprises a majority of the terms in the thesaurus and is located in the
“Taxonomic Classification of Organisms” Subject Category. Political geography is also
included, and is mainly described at the country level.
uBIO is an initiative within the science library community to join international11
efforts to create and utilise a comprehensive and collaborative catalog of known names of
all living (and once-living) organisms. uBio’s Taxonomic Name Server (TNS) catalogs
names and classifications to enable tools that can help users find information on living
things using any of the names that may be related to an organism.
"
UNCONTROLLED TERMS!
"
Uncontrolled terms, or tags, are taken directly from natural language. Because they
do not possess the same characteristics of controlled vocabulary terms, they pose many
advantages – and disadvantages – over the use of controlled vocabulary terms in the
submission of data to Dryad.
http://www.agiweb.org/georef/lists.html "9
The entire GeoRef Thesaurus may also be found in this PDF document:"
http://www.agiweb.org/georef/PDF/Introduction.pdf
http://agclass.nal.usda.gov/dne/search_sc.shtml10
http://www.ubio.org/11
11. 11
"
THE VALUE OF UNCONTROLLED TERMS
"
According to Noruzi (2006), uncontrolled terms, or tags, are words or phrases users
attach to resources that may help in later retrieval of that resource. These tags have no
fixed categories, syntaxes, or standards. However, the fact that no time was taken to
develop standards or categorisations for these tags means that there is little overhead in
the efficiency of their creation. They are created precisely at the point of submission, and
potentially cost little time and effort on the part of the user, be it a resource author or a
social tagger. Lu et al. (2010) present several compelling advantages to the use of tags.
First, they state that tags may help to bridge the gap between professional and public
discourse by providing a source of terms not included in controlled vocabularies. Second,
they mention that tags not only allow users to search resources in their own language, but
also provide a window for the libraries to understand and learn more about user
information needs and interests. Third, their findings show that social taggers may help
enhance subject access to collections by describing resources with terms different from
those used by experts (this final sentiment is echoed in Rolla 2009).
Along with these many advantages, uncontrolled terms also pose several important
and challenging disadvantages. Because tags are not controlled in any way, a certain
individuals’ tags may conflict with another individuals’ tags. These conflicts may manifest
themselves as polysemy (words that have several meanings), synonymy (different words
with similar or identical meanings), plurality (inconsistencies in the use of plurals), or
granularity (inconsistencies in the depth or specificity of tags). Any of these problems may
lead to low precision in searching.
FOLKSONOMIES
"
A portmanteau of the words folks and taxonomy, folksonomy is an internet-based
information retrieval methodology consisting of collaboratively generated, open-ended
labels that categorise content such as web resource, online photographs, and web links
(Noruzi 2009). Folksonomies are created by social taggers, not information professionals,
and these taggers assign one or more tags to each resource for their own individual use
which is then shared through a community.
12. 12
Much research has been done to study the use of user tags versus the use of
controlled vocabulary terms. Lu et al. (2010) found that only a fraction of tag vocabulary
terms overlap with LCSH terms, and even those overlapping terms might be used by
social taggers and information professionals in different ways. Rolla (2009) reports that
users of the LibraryThing social cataloging web application assign tags that range in12
depth from general to specific, whereas LCSH terms assigned to corresponding
bibliographic records are more general in nature. In addition, cataloger-assigned LCSH
terms in approximately 55% of bibliographic records brought out topics or concepts that
LibraryThing tags did not, and approximately 75% of the time catalogers and taggers
agreed on at least a portion of what a book is ‘about’. In conclusion, the Library of
Congress Working Group on the Future of Bibliographic Control reports in 2008 that
“allowing user-supplied data in online catalogs will make the catalogs more relevant to
users accustomed to the internet and also will improve access to the materials in the
library collection.”13
"
RECOMMENDATIONS!
"
HOW TO USE THE DESCRIPTIVE METADATA FIELDS
"
In recommending best practices for providing terms for the descriptive metadata
fields in Dryad, the author would urge authors to consider the accuracy, consistency, and
completeness of the chosen terms first before submission. In addition, the use controlled
vocabulary terms are suggested, but not required. The author is recommended to weigh
the benefits and disadvantages of submitting their own tags versus authorised controlled
vocabulary terms. The following sections highlight the four individual keyword fields, and
give specific instructions on how to best provide terms.
"
"
"
http://www.librarything.com/12
Library of Congress Working Group on the Future of Bibliographic Control (2008) On the Record:13
Report of the Library of Congress Working Group on the Future of Bibliographic Control.
13. 13
"
"
SUBJECT KEYWORDS
"
"
"
Submitters of data to Dryad are required to include at least one subject keyword
with their data submission. Within the Dryad submission system this field is repeatable,
meaning that a submitter may include as many subject keywords as they choose. The
submission of multiple subject keywords may be achieved by separating individual
keywords by a comma. Using semicolons, dashes, periods, or any other types of
punctuation will result in a list of keywords concatenated into single keyword. For example,
if the two subject keywords Facial structure and Testosterone are entered into the subject
field as [Facial Structure; Testosterone] the two will be recognised in Dryad’s system as a
single entity and will be shown as one subject keyword “Facial structure; Testosterone”,
not as two separate subject keywords “Facial structure” and “Testosterone.”
According to Dryad’s metadata schema, the subject keyword field is associated with
the Dublin Core metadata term Subject. According to the Dublin Core Metadata Initiative14
(DCMI), a subject “will be represented using keywords, key phrases, or classification
codes.” DCMI also recommends the use of controlled vocabulary terms for use in the
Subject field. Because Dryad does not support the use of integrated controlled
vocabularies, data submitters may either create their own subject keywords or they may
draw from any controlled vocabulary they choose.
RECOMMENDED CONTROLLED VOCABULARIES!
"
" The following controlled vocabularies are recommended for reference in adding
subject keywords to Dryad data submissions:
"
DC term Subject is located at the namespace http://purl.org/dc/elements/1.1/subject14
14. 14
• AGROVOC
• BIOSIS
• CAB Thesaurus
• GeoRef
• LCSH
• Mammal Species of the World
• MeSH
• NAL Agricultural Thesaurus
• uBio
"
TEMPORAL KEYWORDS
"
!
"
Submitters of data to Dryad are not required to include temporal keywords with their
data submission, but are urged to do so if the field is applicable to the nature of the data.
This field in the submission system is also repeatable, meaning that a submitter may
include as many temporal keywords as they choose. Again, the submission of multiple
temporal keywords may be achieved by separating individual keywords by a comma, and
using semicolons, dashes, periods, or any other types of punctuation will result in a list of
keywords concatenated into single keyword.
15. 15
According to Dryad’s metadata schema, the temporal keyword field is associated
with the Dublin Core metadata term Temporal. According to the DCMI, a temporal15
keyword should be used to describe “temporal characteristics of the resource.”
RECOMMENDED CONTROLLED VOCABULARIES!
"
The following controlled vocabularies are recommended for reference in adding temporal
keywords to Dryad data submissions:
• LCSH
• MeSH
• TGN
"
SPATIAL KEYWORDS
"
!
Submitters of data to Dryad are not required to include spatial keywords with their
data submission, but are urged to do so if the field is applicable to the nature of the data.
This field in the submission system is also repeatable, meaning that a submitter may
include as many spatial keywords as they choose. Again, the submission of multiple
spatial keywords may be achieved by separating individual keywords by a comma, and
using semicolons, dashes, periods, or any other types of punctuation will result in a list of
keywords concatenated into single keyword. Submitters of data to Dryad should note that
DC term Temporal is located at the namespace http://purl.org/dc/elements/1.1/temporal15
16. 16
locations with multi-part names, such as Los Angeles, California, will be automatically split
into two terms.
According to Dryad’s metadata schema, the spatial keyword field is associated with
the Dublin Core metadata term Spatial. According to the DCMI, a spatial keyword should16
be used to describe “spatial description of the dataset specified by a geographic
description and geographic coordinates.” The instructions given for entering data into this
field in the Dryad submission system indicate that “locations may include names of cities,
regions, or coordinates.” Like the DCMI, the Dryad curation team recommends using terms
from standard taxonomies of controlled vocabularies for use in the spatial keyword field.
RECOMMENDED CONTROLLED VOCABULARIES!
"
The following controlled vocabularies are recommended for reference in adding
spatial keywords to Dryad data submissions:
• BIOSIS
• LCSH
• TGN
"
TAXONOMIC KEYWORDS
"
"
"
Submitters of data to Dryad are not required to include taxonomic keywords with
their data submission, but are urged to do so if the field is applicable to the nature of the
data. This field in the submission system is also repeatable, meaning that a submitter may
include as many taxonomic keywords as they choose. Again, the submission of multiple
DC term Spatial is located at the namespace http://purl.org/dc/elements/1.1/spatial16
17. 17
taxonomic keywords may be achieved by separating individual keywords by a comma, and
using semicolons, dashes, periods, or any other types of punctuation will result in a list of
keywords concatenated into single keyword.
According to Dryad’s metadata schema, the taxonomic keyword field is associated
with the Darwin Core metadata term Specific Epithet. According to the Biodiversity17
Information Standards (TDWG), a taxonomic keyword should be used to describe “The
specific epithet of the scientific name applied to the organism.” The instructions given for
entering data into this field in the Dryad submission system indicate that taxonomic
keywords should be used to describe “the full name of the lowest level taxon to which the
organism has been identified in the most recent accepted determination, specified as
precisely as possible.” Like the DCMI, the Dryad curation team recommends using terms
from standard taxonomies of controlled vocabularies for use in the spatial keyword field.
The following quotation, taken from the Borer et al. (2009) article Some Simple Guidelines
for Effective Data Management, discusses the problems associated with taxonomic
keywords and suggested solutions.
“Over time, the names of taxa often are changed as their evolutionary
relationships are clarified. The same taxonomic name can actually refer
to two or more different concepts of a species. However, scientific
names in ecological data are fixed as originally recorded, and so it is
critical for long-term preservation to document which taxonomic
descriptions were intended by each taxon name used in a data set. This
becomes particularly important when comparing species information
from data collected at different times, as the names used in the data
sets can be ambiguous, which affects calculations of diversity and
richness, among other issues. The best way to clarify a taxonomic
name is to document the taxonomic authority you are using for the
name. For example, Homo sapiens Linn. clarifies that the authority for
this binomial is Linnaeus. Unfortunately, there are several formats to
choose from for specifying taxonomic authority information, but any
reference information is better than none.”18
"
RECOMMENDED CONTROLLED VOCABULARIES!
"
The following controlled vocabularies are recommended for reference in adding taxonomic
keywords to Dryad data submissions:
Darwin Core term Specific Epithet is located at the namespace http://wiki.tdwg.org/twiki/bin/view/17
DarwinCore/SpecificEpithet
Borer, E.T., Seabloom, E.W., Jones, M.B., Schildhauer M. (2009) Some Simple Guidelines for18
Effective Data Management. Bulletin of the Ecological Society of America 90(2): 205-214.
18. 18
• BIOSIS
• ITIS
• LCSH
• Mammal Species of the World
• MeSH
• NAL Agricultural Thesaurus
• uBio
"
"
19. 19
BIBLIOGRAPHY!
"
Babinec, M. and Mercer, H. (2009) Introduction: Metadata and digital repositories.
Cataloging & Classification Quarterly. 47: 209-212.
Borer, E.T., Seabloom, E.W., Jones, M.B., Schildhauer M. (2009) Some Simple Guidelines
for Effective Data Management. Bulletin of the Ecological Society of America 90(2):
205- 214. Retrieved from http://www.esajournals.org/doi/pdf/
10.1890/0012-9623-90.2.205
CAB International. (2013). CAB Thesaurus. Retrieved from http://www.cabi.org/
cabthesaurus/
Carrier, Sarah W. (2008) The Dryad Repository Application Profile: Process, Development,
and Refinement. A Master’s paper for the M.S. in I.S. degree.
Currier, Sarah and Barton, Jane. (2003) Quality Assurance for Digital Learning Object
Repositories: How Should Metadata Be Created? Communities of Practice. ALT-C
2003 Research Proceedings.
Greenberg, Jane and Robertson, W. (2002) Davenport. Semantic Web construction: An
inquiry of author’s views on collaborative metadata generation. Proc. Int. Conf. on
Dublin Core and Metadata for e-Communities. 45-52.
Greenberg, Jane et al. (2002) Author-generated Dublin Core Metadata for Web
Resources: A Baseline Study in an Organization. Journal of Digital Information 2(2).
Greenberg, Jane et al. (2003) Iterative Design of Metadata Creation Tools for Resource
Authors. DC-2003--Seattle Proceedings.
ITIS Integrated Taxonomic Information System. (2013). Retrieved from http://www.itis.gov/
American Geosciences Institute. (2013). GeoRef Thesaurus Lists. Retrieved from http://
www.agiweb.org/georef/lists.html
!
Bates, Marcia J. Encyclopedia of Library and Information Sciences. 3rd ed, eds M.J. Bates
and M.N. Maack. Boca Raton, FL: CRC Press, 2010. Print.
Bruce, T.R. and Hillman, D. (2004). The Continuum of Metadata Quality: Defining,
Expressing, Exploiting. In Metadata in Practice, eds. D. Hillman and E.L.
Westbrooks (Chicago: American Library Association).
Food and Agriculture Organization of the United Nations. (2013). AGROVOC. Retrieved
from http://aims.fao.org/standards/agrovoc/about
J. Paul Getty Trust . (2013). Getty Thesaurus of Geographic Names® Online. Retrieved
from http://www.getty.edu/vow/TGNSearchPage.jsp
Library of Congress. (2013). Library of Congress Subject Headings. Retrieved from http://
id.loc.gov/authorities/subjects.html
Library of Congress Working Group on the Future of Bibliographic Control (2008) On the
Record: Report of the Library of Congress Working Group on the Future of
20. 20
Bibliographic Control. Retrieved from www.loc.gov/bibliographic-future/news/lcwg-
ontherecord-jan08-final.pdf
Lu, C., Park, J. and Hu, X. (2010) User tags versus expert-assigned subject terms: A
comparison of LibraryThing tags and Library of Congress Subject Headings.
Journal of Information Science 36(6): 763-779.
Moen, W.E., Stewart, E.L., and McClure, C.R. (1997) The Role of Content Analysis in
Evaluating Metadata for the US Government Information Locator Service (GILS):
results from an exploratory study.
Noruzi, Alireza. (2006) Folksonomies: (Un)Controlled Vocabulary? Knowledge
Organization 33(4): 199-203.
Page, Roderic. (2006) Taxonomic names, metadata, and the Semantic Web. Biodiversity
Informatics 3.
Park, Jung-Ran. (2009) Metadata quality in digital repositories: A survey of the current
state of the art. Cataloging & Classification Quarterly 47: 213-228.
Rolla, Peter J.(2008) User tags versus subject headings: Can user-supplied data improve
subject access to library collections? Library Resources & Technical Services 53(3):
174-184.
Rothenberg, J. (1996) Metadata to Support Data Quality and Longevity. 1st IEEE
Metadata Conference, Silver Spring, Maryland.
Strader, C. Rockelle. (2009) Author-Assigned Keywords versus Library of Congress
Subject Headings. Library Resources & Technical Services 53(4): 243-250.
Tozer, G. (1999) Metadata Management for Information Control and Business Success.
Boston: Artech House.
Trant, Jennifer. (2009) Studying Social Tagging and Folksonomy: A Review and
Framework. Journal of Digital Information 10(1).
Marine Biological Laboratory. (2013). uBio Universal Biological Indexer and Organizer.
Retrieved from http://www.ubio.org/
National Center for Biotechnology Information, U.S. National Library of Medicine. (2013).
MeSH. Retrieved from http://www.ncbi.nlm.nih.gov/mesh
Smithsonian Institution. (2013). Wilson & Reeder’s Mammal Species of the World.
Retrieved from http://www.vertebrates.si.edu/msw/mswcfapp/msw/index.cfm
Thompson Reuters. (2013). BIOSIS. Retrieved from http://thomsonreuters.com/
products_services/science/science_products/a-z/biosis/
U.S. Department of Agriculture National Agricultural Library. (2013) Thesaurus: Browse by
Subject Category. Retrieved from http://agclass.nal.usda.gov/dne/search_sc.shtml
Whitlock, M.C. (2011) Data archiving in ecology and evolution: best practices. Trends in
Ecology & Evolution 26(2): 61–65. Retrieved from http://dx.doi.org/10.1016/j.tree.
2010.11.006
21. 21
Wilson, A.J. (2007) Toward Releasing the Metadata Bottleneck: A Baseline Evaluation of
Contributor-supplied Metadata. Library Resources & Technical Services 51(1):
16-27.
"
22. 22
APPENDIX 1!
"
FURTHER RESOURCES FOR AUTHORS
"
The following resources are intended to assist authors in further assistance with the
use of controlled and uncontrolled terms in Dryad.
HIVE Browser
http://hive.nescent.org/ConceptBrowser.html
"
LibraryThing Concepts
http://www.librarything.com/concepts
"
The Encyclopaedia of Life
http://eol.org/
"
Depositing data in Dryad frequently asked questions
http://datadryad.org/pages/depositing
"
How to submit data in Dryad (Youtube video)
http://www.youtube.com/watch?feature=player_embedded&v=RP33cl8tL28