Sanjeet Mann conducted a study measuring the availability of electronic resources at the University of Redlands Armacost Library. He tested 400 citations from 10 databases and found an overall availability of 62% with a 38% error rate. The types of errors were categorized, with the most common being proxy errors, source errors, and knowledge base errors. Mann discussed solutions like updating the proxy, customizing the knowledge base, and simplifying interfaces. He noted strengths in collecting both quantitative and qualitative data but weaknesses in not accounting for user issues. Mann proposed expanding the study to test availability through live student searches and evaluations.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
Research data management for medical data with pyradigm.
Python data structure for biomedical data to manage multiple tables linked via patient info or other washable IDs. Allowing continuous validation, this data structure would improve ease of use as well as integrity of the dataset.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
The document discusses searching for answers to keyword queries in linked data. It presents the problem of keyword query routing, which aims to identify a valid set of data sources that can produce non-empty answers to a keyword query. It proposes using keyword-element relationship graphs at the element, schema, and data source levels to model relationships between keywords and data elements or sources. Experiments on a chunk of the Billion Triple Challenge dataset indicate that considering relationships between elements within a maximum path length performs better than considering only direct relationships, and identifies valid plans for multi-source queries.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
Research data management for medical data with pyradigm.
Python data structure for biomedical data to manage multiple tables linked via patient info or other washable IDs. Allowing continuous validation, this data structure would improve ease of use as well as integrity of the dataset.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
The document discusses searching for answers to keyword queries in linked data. It presents the problem of keyword query routing, which aims to identify a valid set of data sources that can produce non-empty answers to a keyword query. It proposes using keyword-element relationship graphs at the element, schema, and data source levels to model relationships between keywords and data elements or sources. Experiments on a chunk of the Billion Triple Challenge dataset indicate that considering relationships between elements within a maximum path length performs better than considering only direct relationships, and identifies valid plans for multi-source queries.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Shalin Hai-Jew
This document summarizes a presentation on using NVivo 10 software to code and analyze qualitative and mixed methods research data. It introduces NVivo 10 as a data management and analysis tool, demonstrates how to import and code data from various sources, and shows how to visualize and analyze coded data through matrices, models, and queries. The goals are to introduce NVivo 10's capabilities and to demonstrate the process of setting up a project for qualitative or mixed methods research.
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
NeXML is a proposed data exchange standard for phylogenetics that addresses issues with the current NEXUS format. It defines an XML schema for representing phylogenetic data like trees, networks, and character data. The schema is designed to be extensible, reuse prior standards, and take advantage of existing XML tools. An implementation includes XML parsers and writers in multiple programming languages and experiments with semantic annotation and web services.
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...María Poveda Villalón
The document proposes a lightweight methodology called LOT (Linked Open Terms) for developing Linked Data ontologies and vocabularies in a reusable way. The methodology is data-driven and focuses on ontology search, selection, integration, completion and evaluation activities. It provides guidelines for reusing existing terms and linking them according to Linked Data principles while keeping the processes lightweight. The methodology is intended to help domain experts create ontologies and vocabularies for publishing data on the semantic web in an interoperable way without requiring extensive knowledge engineering expertise. Future work involves providing more detailed guidelines, examples, and connecting existing tools to support each step of the methodology.
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
Despite the attention Semantic Search is continuously gaining, several challenges affecting tool performance and user experience remain unsolved. Among these are: matching user terms with the searchspace, adopting view-based interfaces in the Open Web as well as supporting users while building their queries. This paper proposes an approach to move a step forward towards tackling these challenges by creating models of usage of Linked Data concepts and properties extracted from semantic query logs as a source of collaborative knowledge. We use two sets of query logs from the USEWOD workshops to create our models and show the potential of using them in the mentioned areas.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
Matrix Queries and Matrix Data Representations in NVivo 11 PlusShalin Hai-Jew
This slideshow, "Matrix Queries and Matrix Data Representations in NVivo 11 Plus," covers the following points:
Matrices and their basic structures
Types of elements (variables) for matrix comparisons
Setting up matrix queries in NVivo 11
Specific matrix “use cases” in qualitative and mixed methods research
Wrap-up
This document provides a summary of 25 electronic resources for classicists in 25 minutes. It outlines in-house resources available through the Faculty of Classics website and library catalogues. It then describes bibliographical databases, full text databases, dictionaries, encyclopedias, images databases, and a referencing tool. The resources cover topics such as Greek and Latin texts, dictionaries, encyclopedias, inscriptions, artworks, and free online courses relevant to classics. The presentation aims to introduce classicists to key online resources available through the library.
This document provides an introduction to academic e-resources and how to use them. It outlines the services available at the Learning Resource Centre, including its hours and borrowing policies. It defines what an e-resource is, such as e-books and e-journals, and explains why students need to use them as they contain up-to-date peer-reviewed research. It provides steps to find e-books and e-journals through the university library website and search individual databases. It offers tips for choosing effective keywords and search techniques using Boolean operators and wildcards. Students are given tasks to practice these skills and evaluate sources. Contact information is provided for getting help from library staff.
Connect Your Resources, Save Time, Save Money:: Connecting library electron...Richard Bernier
The document discusses how linking a library's electronic resources like databases and catalogs can reduce redundant searching and save time and money. It provides examples of databases like EBSCOhost, ProQuest, and OPAC systems that have features to dynamically link full text articles to local holdings information. Setting up these links requires coordinating with database vendors and ensuring compatible search features between systems.
E-LIS: Disciplinary Repository For Library and Information Sciencessanat kumar behera
E-LIS is a global digital archive for library and information science established in 2003. It aims to provide open access to documents in the field and currently contains over 12,000 papers in 37 languages. E-LIS uses the OAI-PMH protocol to allow metadata harvesting and supports depositing of various document types from researchers, librarians, and information professionals. It has an international editorial team that oversees operations and works to promote open access scholarship globally.
This document summarizes a presentation about service learning and the work of Librarians Without Borders (LWB). It introduces service learning and LWB, discussing two case studies of LWB initiatives in Costa Rica and Guatemala. In Costa Rica, LWB students helped build a school library, developing its collection and setting it up. In Guatemala, LWB has partnered with a school to implement a library through ongoing fundraising, service trips, and support. The presentation previews LWB's future plans and takes questions from the audience.
The document outlines plans for a National Digital Library in Finland to aggregate and provide access to the digital collections of libraries, archives, and museums. The goals are to [1] create a common user interface by 2011 for searching across these collections, [2] digitize important cultural heritage materials, and [3] develop long-term preservation solutions. It will work with Europeana to increase the visibility and impact of Finnish cultural collections internationally. Realizing this vision requires national coordination, common standards, and sustainable funding and resources.
Discover - e: Tips and Tricks for Connecting Users to Library-provided Electr...St. Petersburg College
OCLC events at ALA Annual 2009 (July 12).
A panel will share advice about helping library users connect with library-provided electronic resources and discuss current innovations in formation discovery.
Access and Ownership Issues of Electronic Resources in the LibraryFe Angela Verzosa
Presented by Fe Angela M. Verzosa at the Conference sponsored by the Central Luzon Librarians Association, held at Holy Angel University, Angeles City, Philippines on 7 December 2009
Tutorial presented at 2012 ACM SIGHIT International Health Informatics Symposium (IHI 2012), January 28-30, 2012. http://sites.google.com/site/web2011ihi/participants/tutorials
This tutorial weaves together three themes and the associated topics:
[1] The role of biomedical ontologies
[2] Key Semantic Web technologies with focus on Semantic provenance and integration
[3] In-practice tools and real world use cases built to serve the needs of sleep medicine researchers, cardiologists involved in clinical practice, and work on vaccine development for human pathogens.
The MIAPA ontology: An annotation ontology for validating minimum metadata re...Hilmar Lapp
This document describes the MIAPA (Minimum Information About a Phylogenetic Analysis) ontology, which was developed to standardize the annotation and reporting of metadata for phylogenetic analyses. The MIAPA ontology reuses terms from existing ontologies and is designed according to OBO Foundry best practices. It provides a standard way to annotate key information about phylogenetic tree topologies, operational taxonomic units, branch lengths, character matrices, alignment and tree inference methods. The goal is to facilitate increased access to and reuse of phylogenetic data through consistent annotation of published trees according to the MIAPA standard.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Shalin Hai-Jew
This document summarizes a presentation on using NVivo 10 software to code and analyze qualitative and mixed methods research data. It introduces NVivo 10 as a data management and analysis tool, demonstrates how to import and code data from various sources, and shows how to visualize and analyze coded data through matrices, models, and queries. The goals are to introduce NVivo 10's capabilities and to demonstrate the process of setting up a project for qualitative or mixed methods research.
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Elsevier aims to construct knowledge graphs to help address challenges in research and medicine. Knowledge graphs link entities like people, concepts, and events to provide answers. Elsevier analyzes text and data to build knowledge graphs using techniques like information extraction, machine learning, and predictive modeling. Their knowledge graph integrates data from publications, clinical records, and other sources to power applications that help researchers, medical professionals, and patients. Knowledge graphs are a critical component for delivering value, especially as data volumes and needs accelerate.
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
NeXML is a proposed data exchange standard for phylogenetics that addresses issues with the current NEXUS format. It defines an XML schema for representing phylogenetic data like trees, networks, and character data. The schema is designed to be extensible, reuse prior standards, and take advantage of existing XML tools. An implementation includes XML parsers and writers in multiple programming languages and experiments with semantic annotation and web services.
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...María Poveda Villalón
The document proposes a lightweight methodology called LOT (Linked Open Terms) for developing Linked Data ontologies and vocabularies in a reusable way. The methodology is data-driven and focuses on ontology search, selection, integration, completion and evaluation activities. It provides guidelines for reusing existing terms and linking them according to Linked Data principles while keeping the processes lightweight. The methodology is intended to help domain experts create ontologies and vocabularies for publishing data on the semantic web in an interoperable way without requiring extensive knowledge engineering expertise. Future work involves providing more detailed guidelines, examples, and connecting existing tools to support each step of the methodology.
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
Despite the attention Semantic Search is continuously gaining, several challenges affecting tool performance and user experience remain unsolved. Among these are: matching user terms with the searchspace, adopting view-based interfaces in the Open Web as well as supporting users while building their queries. This paper proposes an approach to move a step forward towards tackling these challenges by creating models of usage of Linked Data concepts and properties extracted from semantic query logs as a source of collaborative knowledge. We use two sets of query logs from the USEWOD workshops to create our models and show the potential of using them in the mentioned areas.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
Matrix Queries and Matrix Data Representations in NVivo 11 PlusShalin Hai-Jew
This slideshow, "Matrix Queries and Matrix Data Representations in NVivo 11 Plus," covers the following points:
Matrices and their basic structures
Types of elements (variables) for matrix comparisons
Setting up matrix queries in NVivo 11
Specific matrix “use cases” in qualitative and mixed methods research
Wrap-up
This document provides a summary of 25 electronic resources for classicists in 25 minutes. It outlines in-house resources available through the Faculty of Classics website and library catalogues. It then describes bibliographical databases, full text databases, dictionaries, encyclopedias, images databases, and a referencing tool. The resources cover topics such as Greek and Latin texts, dictionaries, encyclopedias, inscriptions, artworks, and free online courses relevant to classics. The presentation aims to introduce classicists to key online resources available through the library.
This document provides an introduction to academic e-resources and how to use them. It outlines the services available at the Learning Resource Centre, including its hours and borrowing policies. It defines what an e-resource is, such as e-books and e-journals, and explains why students need to use them as they contain up-to-date peer-reviewed research. It provides steps to find e-books and e-journals through the university library website and search individual databases. It offers tips for choosing effective keywords and search techniques using Boolean operators and wildcards. Students are given tasks to practice these skills and evaluate sources. Contact information is provided for getting help from library staff.
Connect Your Resources, Save Time, Save Money:: Connecting library electron...Richard Bernier
The document discusses how linking a library's electronic resources like databases and catalogs can reduce redundant searching and save time and money. It provides examples of databases like EBSCOhost, ProQuest, and OPAC systems that have features to dynamically link full text articles to local holdings information. Setting up these links requires coordinating with database vendors and ensuring compatible search features between systems.
E-LIS: Disciplinary Repository For Library and Information Sciencessanat kumar behera
E-LIS is a global digital archive for library and information science established in 2003. It aims to provide open access to documents in the field and currently contains over 12,000 papers in 37 languages. E-LIS uses the OAI-PMH protocol to allow metadata harvesting and supports depositing of various document types from researchers, librarians, and information professionals. It has an international editorial team that oversees operations and works to promote open access scholarship globally.
This document summarizes a presentation about service learning and the work of Librarians Without Borders (LWB). It introduces service learning and LWB, discussing two case studies of LWB initiatives in Costa Rica and Guatemala. In Costa Rica, LWB students helped build a school library, developing its collection and setting it up. In Guatemala, LWB has partnered with a school to implement a library through ongoing fundraising, service trips, and support. The presentation previews LWB's future plans and takes questions from the audience.
The document outlines plans for a National Digital Library in Finland to aggregate and provide access to the digital collections of libraries, archives, and museums. The goals are to [1] create a common user interface by 2011 for searching across these collections, [2] digitize important cultural heritage materials, and [3] develop long-term preservation solutions. It will work with Europeana to increase the visibility and impact of Finnish cultural collections internationally. Realizing this vision requires national coordination, common standards, and sustainable funding and resources.
Discover - e: Tips and Tricks for Connecting Users to Library-provided Electr...St. Petersburg College
OCLC events at ALA Annual 2009 (July 12).
A panel will share advice about helping library users connect with library-provided electronic resources and discuss current innovations in formation discovery.
Access and Ownership Issues of Electronic Resources in the LibraryFe Angela Verzosa
Presented by Fe Angela M. Verzosa at the Conference sponsored by the Central Luzon Librarians Association, held at Holy Angel University, Angeles City, Philippines on 7 December 2009
Tutorial presented at 2012 ACM SIGHIT International Health Informatics Symposium (IHI 2012), January 28-30, 2012. http://sites.google.com/site/web2011ihi/participants/tutorials
This tutorial weaves together three themes and the associated topics:
[1] The role of biomedical ontologies
[2] Key Semantic Web technologies with focus on Semantic provenance and integration
[3] In-practice tools and real world use cases built to serve the needs of sleep medicine researchers, cardiologists involved in clinical practice, and work on vaccine development for human pathogens.
The MIAPA ontology: An annotation ontology for validating minimum metadata re...Hilmar Lapp
This document describes the MIAPA (Minimum Information About a Phylogenetic Analysis) ontology, which was developed to standardize the annotation and reporting of metadata for phylogenetic analyses. The MIAPA ontology reuses terms from existing ontologies and is designed according to OBO Foundry best practices. It provides a standard way to annotate key information about phylogenetic tree topologies, operational taxonomic units, branch lengths, character matrices, alignment and tree inference methods. The goal is to facilitate increased access to and reuse of phylogenetic data through consistent annotation of published trees according to the MIAPA standard.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Resource Description Framework Approach to Data Publication and FederationPistoia Alliance
Bob Stanley, CEO, IO Informatics, explains the utility to RDF as a standard way of defining and redefining data that could have utility in managing life science information.
Curation-Friendly Tools for the Scientific Researcherbwestra
Presentation for Online Northwest Conference, in Corvallis Oregon, February 10, 2012.
Highlights electronic lab notebooks (ELN) and OMERO (Open Microscopy Environment) as two tools that enable researchers to better manage their research data.
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
How Much do Availability Studies Increase Full Text Success?Sanjeet Mann
Availability Studies are a systems research technique that academic libraries can use to identify errors affecting access to electronic resources. Comparing two availability studies conducted before and after troubleshooting showed a statistically significant decrease in errors from 38% to 13%.
Preserving the Inputs and Outputs of Scholarshiptsbbbu
Tim Babbitt discusses the changing context of research and scholarship due to digitization and the internet. The inputs and outputs of research are increasingly digital and complex, including data, code, presentations, and more. ProQuest has a history of preserving scholarship through microfilming and is exploring how to preserve the full range of digital scholarly outputs and their linkages in a sustainable way. Key questions include balancing new and old preservation methods and moving beyond preserving individual objects to also preserving networks and linkages between scholarly works.
The document provides guidelines for designing effective e-learning objects and asynchronous instruction. It discusses best practices from sources like the Association of College and Research Libraries (ACRL) and Project Information Literacy. These include establishing learning outcomes, developing content that limits cognitive load, and ensuring accessibility for all students regardless of location. The document then outlines steps for instructional design using the ADDIE model of analysis, design, development, implementation and evaluation. Examples are provided for each step, with a focus on incorporating principles of multimedia learning and usability testing.
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
This document discusses the characteristics and challenges of managing life sciences data. It notes that bio-data lacks structure, grows rapidly in heterogeneous formats and file sizes. Data goes through multiple analysis stages and is associated with evolving metadata standards. Ensuring data is properly stored, shared and preserved requires significant effort in describing formats, preparing submissions to various specialized public repositories, and developing data management plans. Integrating data from different sources also poses major challenges.
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...ICZN
This document discusses developing a business model for ZooBank, a proposed online registry of zoological nomenclature. It outlines elements to consider for the business model, including the scientific, technical, social, and financial models. It also discusses how ZooBank could operate within the EDIT network to establish a prototype web taxonomy and help coordinate taxonomic data infrastructure. Funding opportunities that could support ZooBank are also mentioned.
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
1) The document discusses research objects (ROs) which aim to document the full scientific process in a digital environment, including workflows, data, software, and provenance.
2) ROs in the Wf4Ever project contain detailed semantic annotations and can be aggregated into templates to help complete the scientific record.
3) Incentives for using ROs include improved reproducibility, credit for researchers, and increased citations of papers that link to their underlying data and methods.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
The document provides an overview of semantic technologies and discusses their increasing mainstream adoption. It notes that Microsoft purchased Powerset in 2008, Apple purchased Siri in 2010, and Google bought Metaweb and released semantic search in 2013. It discusses how semantic technologies allow for interoperability through shared representations and reasoning. Examples are given of early semantic search applications from 1999-2002 and an operational semantic electronic medical record application deployed in 2006.
Similar to Measuring electronic resource availability final version (20)
5. Availability
studies
Sample of items
Available? Yes/No
Error?
Order encountered
Probabilities
Prioritize fixes
6. Development of the availability technique
•Print material availability
card catalog user surveys
(Reviewed in Mansbridge 1986, Nisonger
2007)
•Linear sequence
(De Prospo 1973)
•Branching model
(Kantor 1976)
•Applied to e-resources
500 articles from 50 high impact journals
(Nisonger 2009)
7. OpenURL performance
•OpenURL-based reasons
for availability error
(Wakimoto et al. 1998)
•“Digging into the Data” on
link resolver failure
(Trainor and Price 2010)
•NISO Initiatives:
KBART, IOTA, PIE-J
(Chandler et al. 2011, Glasser 2012, Kasprowski
2012)
8. Usability studies focusing on e-resources
•Database link
pages
(Fry 2011, Ponsford et al. 2011b)
•Resolver menus
(O’Neill 2009, Imler &
Eichelberger 2011, Ponsford et
al. 2011a)
•Discovery
services
(Williams & Foster 2011,
Fagan et al. 2012)
•Entire process
9. Methodology
400 citations
4 questions X 10 databases X 10 results
[18:11] redlandsreference:
what is your research topic?
Arts &
Humanities
[18:11] meeboguest59808: RILM
Oral Motor Activity MLA
Philosopher’s Index
[18:11] redlandsreference:
Is this for a Communicative
Disorders class? Social Sciences
America: History &
Life
EconLit
Sociological Index
Sciences
Biological Abstracts
ComDisDome,
15. Error details 3: Knowledge base errors
Title not selected in
knowledge base
Title selected, but in
poorly chosen collection
Knowledge base
holdings do not reflect
access entitlement
(embargo, back issues,
etc.)
16. Error details 4: Link resolver error
Confusion between two
similar titles
Unusual OpenURL syntax
17. Error details 5: Target errors
Content not loaded
(supplement, embargo)
Records concatenated
from full text and non-full-
text databases
Server downtime
18. Error details 6: ILLIAD errors
Unicode metadata not
displayed properly
rft.title used for both
book title and article title,
affects chapters and
dissertations
20. Sampling
Necessary sample size for a yes/no condition is determined by:
To use this, you need:
•Availability rate from a small pre-test
•Choose acceptable % confidence (95%)
•Choose acceptable margin of error (+/- 5%)
Plug values into the formula…
•p = 0.625 (250 / 400 successes)
•1-p = 0.375 (150 / 400 errors)
•C = 0.95 (95% confidence)
•Zc = 1.96 (statistical textbook or
http://www.measuringusability.com/pcalcz.php)
•E = 0.05 (5% error)
I could have just used 360 citations…
21. Confidence
Your confidence in a study of a particular sample size is given by:
I could have just used 360 citations…
29. Summary
•400 citations obtained through likely keyword searches
of 10 A&I databases
•62% availability / 38% error rate (98% confidence, +/- 5%)
•26% downloadable full text
•Responses include fixing proxy, kb holdings, interfaces,
upgrading systems
•Strengths: quant + qual data, very flexible
(n=100 allows 85% confidence)
•Weaknesses: Does not account for issues with
interfaces, searching or evaluation faced by actual users
30. Towards availability testing with live students
• More barriers:
o confusing interfaces
o difficulty formulating
searches and evaluating
sources
o login errors
• How to test:
o cognitive walkthrough +
recorded task protocols
o analysis informs
information literacy and
interface design
• Deliverables:
o availability %
o branching model
o usability report
This is the online version of my presentation given March 5, 2013 at SCELC Research Day, Loyola Marymount University.
This diagram presents an overview of Armacost Library’s e-resource discovery infrastructure. Five systems (proxy server, source database, knowledge base/link resolver, target database and ILL system) must work together using common standards for students and faculty to be able to discover full text.
Electronic resource errors cost libraries in terms of unrealized value on paid-for content that cannot be accessed, and in terms of staff time spent on troubleshooting. Unnecessary ILL requests also add staff costs and IFM/copyright charges. Errors frustrate student and faculty expectations and undermine library staff confidence in the accuracy of their own systems for day-to-day use. Scarce physical and fiscal resources are already compelling libraries to justify their relevance to their campuses; unavailable e-resources only fuel skeptics’ concerns. Errors also require instruction librarians to take precious course time away from higher-order thinking skills to explain technical workarounds and search mechanics in greater detail.
My research study asks the question, how often can Armacost Library users get to the full text of sources they find in abstracting and indexing databases? My study includes, but is not limited to, investigation of OpenURL linking. I operationalized “availability” as two separate factors: students’ ability to download the full text of a source, and the likelihood that users would receive an error as opposed to finding that a source was available in any way (via download, in the physical library, or via ILL)
Availability studies are a systems analysis research method designed to find out why libraries are unavailable to supply materials to readers, and prioritize troubleshooting efforts. The method was first used in an academic library in 1934 (Gaskill). Investigators generate a sample of items and attempt to retrieve them from the stacks, or download them online. All unavailable items are classified according to the reason why they could not be obtained. Problems can be sorted in the order that a student would encounter them, and assigned probabilities of occurring based on their frequency in the sample. Ideally, librarians would then fix the most frequent problems first.
Nisonger and Mansbridge’s review articles give a succinct overview of the availability technique and findings from numerous studies. De Prospo, Kantor and Nisonger have also contributed significantly to our knowledge of this research method.
In addition to the literature on availability studies, research on OpenURL performance was also relevant to my study. These investigations focus on one source of error – the library’s knowledge base. Researchers tested samples of OpenURL links to determine proportions of available and erroneous items. Problems frequently involved the metadata “supply chain” linking publishers, database vendors and knowledge base providers. Several NISO initiatives have sought to improve the quality of e-resource metadata to reduce the frequency of metadata-related errors.
Many library website usability studies have focused on how students access electronic resources. These studies focus on interface design and vocabulary issues that affect electronic resource availability. Researchers have used a variety of usability methods, including task protocols and cognitive walkthroughs. Studies have either isolated parts of the library’s online presence, or glanced over the entire process a student would use, as in Kress’s study of the reasons why students might place an unnecessary ILL request for an article contained in a subscribed e-journal.
I collected a sample of 400 citations by identifying 4 actual student research topics (mentioned in our reference transactions) and searching the topic keywords in each of 10 A&I databases covering a variety of subject areas. I attempted to retrieve the full text of the first 10 search results from each database (I did not modify the default sort order or page to subsequent result screens in order to more accurately simulate student research behavior)
For each of the 400 items tested, I recorded bibliographic metadata in an Excel spreadsheet (see Google spreadsheet link). I also collected “incoming” (from source A&I database to link resolver, see yellow “find full text” link in screenshot) and “outbound” (from link resolver to target full text database, see red circles in screen shot) OpenURLs for each item and pasted them into the spreadsheet. Finally, I recorded ability to download full text and availability as two separate yes/no parameters. (An item could be either available or erroneous. Not all available items were available via full-text download) After testing all items, I went back and assigned a category of error to each unavailable item.
Error categories require judgment calls on the part of the investigator. Many errors (such as incorrect publisher metadata) are not evident at their point of origin, only detectable by problems that occur later in the retrieval process. I developed six error categories roughly matching the five systems involved in e-resource retrieval. The next seven slides present an overview of the categories and examples of common errors of each type. The availability and openURL-testing literature contains some discussion of what constitutes an unavailable or erroneous resource. I chose to treat ILL requests as a normal part of the retrieval process, rather than as a failure of the library to obtain all items a user might need (something which is no longer possible even for libraries with the most comprehensive collection development policies) I also specified that each item must link directly to a screen providing HTML or PDF full text; screens such as the one pictured here where the link leads to a list of items could confuse students, so I counted it as an error.
A side-by-side comparison of causes for error in the print and online environments demonstrates the additional complexity of conducting research in an online environment.
Failure can be localized to the proxy server because the full text target database’s domain is missing from the proxy server forward table, because the proxy server SSL certificate does not contain the full text target database domain, or because the proxy server slowed the connection significantly, causing the web browser to time out. User logins are another source of error (not tested in this study)
The source A&I database can cause problems due to its interface design (not tested in this study) or because metadata are missing or erroneous. This can cause the link resolver to fail or delay the processing of ILL requests. Our ILL staff frequently need to verify requests with duplicate information in the article and journal title fields or nonsensical dates (e.g. “0001”). Libraries that have configured their ILL system to automatically send requests (e.g. ILLIAD Direct Request) will experience slower performance as erroneous requests are flagged by the system for human intervention.
Library staff are responsible for selecting titles and collections in knowledge bases that reflect their subscription entitlements. Errors occur if the entire title, or the starting and ending range of the library’s holdings of that title, are either selected when they shouldn’t be (“false positive” error) or not selected when they should be (“false negative” error). Sometimes the same title is listed in multiple collections; library staff must choose the collection with the most complete metadata or risk errors such as the missing article-level link illustrated here (the “SCELC Wiley-Blackwell Collection” lacked information necessary to achieve article level linking throughout the collection, while a different collection with the same titles contained that information) Publishers and knowledge base vendors can also contribute to problems at this stage, when a publisher does not notify the knowledge base vendor in a timely manner of publication changes, or when a knowledge base vendor does not accurately reflect the publisher’s embargo or other information pertaining to access.
Link resolvers and target databases contributed relatively few errors in my study. Most link resolver errors involved a failure to draw a match between the requested citation and the item in the target resource. This could be due to idiosyncratic metadata or even to variation in libraries’ cataloging practices. In this example, an article from Costerus , a journal not held online, was then run as a catalog title search, which matched on an issue that had been cataloged as a serial monograph. (This problem is likely incomprehensible to our undergraduates)
Target databases most commonly generated errors because of missing content (either because their publisher agreement forbids loading that content or because they had not notified the link resolver of an embargo). Interface issues represent another source of error not tested in my study. One provider’s tendency to concatenate records for the same item from multiple databases (one containing full text, one containing only an abstract) created problems when the full text record was consistently “hidden” in favor of the abstract-only record (which was consistently targeted by the link resolver)
Many errors manifested at the point of submitting an ILL request. Articles with foreign-language characters in the title did not display properly because the then-current version of ILLIAD did not support Unicode. (A subsequent upgrade fixed the problem). Also, when ILLIAD received an OpenURL that only used rft.title, it listed the field twice, in both journal name and article name. Our ILL staff frequently referred these issues to me because they were not sure which was the journal title (used to select the correct OCLC record to request)
These sample findings can be generalized to the entire population of all e-resources at Armacost Library with over 95% confidence and +/-5% margin of error (see next slide) Out of every 100 citations, I would expect to find: 38 errors significant enough to prevent a student from obtaining full text or successfully placing an ILL request 34 potentially successful ILL requests 2 items available from the physical collection 26 full text downloads
Full-text availability and the presence of error are yes/no (Bernoulli or Binomial) outcomes. Statistical textbooks give the formula for determining the sample size for a binomial population. You will need to conduct a pre-test first to obtain values for the proportion of successful and unsuccessful outcomes. You can choose confidence (c) and error (E) values arbitrarily. The lower the values, the less strong your study, but the easier it is to conduct because you can use a smaller sample. The value Zc is found in a table online or in a statistical textbook. It is related to your confidence: the higher your confidence, the greater Zc becomes.
Rewriting the equation to solve for Zc gives you this equation, which lets you state your level of confidence in a study of a particular sample size. Look up the Zc value in the table of standard normal distributions or online to determine the confidence probability. Note that small, convenient samples can still obtain a reasonably high confidence probability.
Most of the sources I tested were articles, but dissertations and book chapters generated a disproportionately great number of errors.
Most of my errors occurred at the source database, knowledge base or ILL stages.
There was considerable variation by discipline. The music searches produced results that were rarely downloadable full-text and frequently triggered errors, while history searches frequently led me to a seamless full-text download. Several factors could be influencing these results. Different databases may have different metadata standards and publishers and are distributed by different vendors. Some databases index a lot of hard-to-obtain items like conference proceedings, while others mostly consist of journal articles. Some vendors augment their A&I search results with “linked full text” PDFs from another database on the same platform. This type of direct comparison is not useful for assigning responsibility for error, but is interesting for subject librarians who need to know the challenges students in their liaison areas may be facing.
I attacked some simpler solutions immediately after finishing my study. I added missing domains to the proxy forward table…
Upgraded ILLIAD to get Unicode support…
Corrected holdings in Serials Solutions…
… and worked with our web team to make the link resolver result screen easier to understand.
Availability studies are a flexible technique to get quantifiable information about students’ access to full text. My study was a “simulated” study because I tested the access myself, rather than using actual library patrons.
No studies have attempted an electronic resource availability study with library patrons so far. Such a study would add several potential causes for error. The study would need to incorporate usability methods, for example, conducting a cognitive walkthrough of the path Armacost Library users could follow to research a particular topic, then observing students trying to search that same topic.
Follow these links to view my literature review and dataset. Email me with questions at sanjeet_mann@redlands.edu