Presentation on how research data can be divided into categories and how this can help data management for both service providers and researchers. Paper will be published in the journal Informaatiotutkimus in December 2018.
IC-SDV 2019: Competitive Intelligence: how to optimize the analysis of pipeli...Dr. Haxel Consult
The document discusses two methods for optimizing data visualization for competitive intelligence analysis of pipeline and clinical trials data:
1) Using BizInt and VantagePoint solutions to combine pipeline, clinical trial, and other data sources and generate customizable reports and visualizations.
2) Developing an in-house tool called VALEM360 to provide a 360-degree view of pancreatic cancer competitive landscape data through multiple interactive visualization dashboards and a treatment decision tree.
The VALEM360 tool demonstrated the potential of a data-driven approach but would benefit from improving data quality, automation of updates, and applying the methodology to other disease areas. Overall, data visualization is very useful for competitive intelligence analysis but requires expertise in topic areas,
The OntoChem IT Solutions GmbH ...
... was founded in 2015 as a purely IT-oriented offshoot of the OntoChem GmbH. Even before we had many years of experience and it has always been our mission to provide added value to our customers by helping them to navigate today’s complex information world by developing cognitive computing solutions, indexing intranet and internet data and applying semantic search solutions for pharmaceutical, material sciences and technology driven businesses.
We strive to support our customers with the most useful tools for knowledge discovery possible, encompassing up-to-date data sources, optimized ontologies and high-throughput semantic document processing and annotation techniques.
We create new knowledge from structured and unstructured data by extracting relationships thereby exploiting the full potential of full-text documents & databases while also scanning social media, news flows and analyzing web-pages.
We aim at an unprecedented, machine understanding of text and subsequent knowledge extraction and inference. The application of our methods towards chemical compounds and their properties supports our customers in generating intellectual property and their use as novel therapeutics, agrochemical products, nutraceuticals, cosmetics and in the field of novel materials.
It's our mission to provide added value to customers by:
developing and applying cognitive computing solutions
creating intranet and internet data indexing and semantic search solutions
Big Data analytics for technology driven businesses
supporting product development and surveillance.
We deliver useful tools for knowledge discovery for:
creating background knowledge ontologies
high-throughput semantic document processing and annotation
knowledge mining by extracting relationships
exploiting the full potential of full-text documents & databases while also scanning social media, news flows and analyzing web-pages.
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
Abstract
In this presentation, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap that NIH is following to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem. Finally, she will weave in how this strategy is being leveraged to address the COVID-19 pandemic.
Presenter: Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health
dkNET Webinar Information: https://dknet.org/about/webinar
Investigating plant systems using data integration and network analysisCatherine Canevet
The document discusses challenges in integrating plant data from multiple sources and proposes solutions. It notes that plant data is sparse, distributed across many databases in various formats, and focused primarily on the model plant Arabidopsis. Data integration is necessary to address key biological questions by consolidating information from pathway databases, gene annotations, protein interactions, and more. The document outlines approaches to data integration including controlled vocabularies, ontologies, data standards, and integration applications specifically designed to combine data sources like Ondex. Effective integration is important to fully leverage available plant data.
The document discusses data sharing requirements for publishing in PLOS journals. PLOS requires authors to share all underlying data without restriction. Acceptable methods include depositing data in public repositories like Dryad or Figshare. The document also discusses other data repositories and journals, as well as obtaining identifiers like Data DOIs from services like Cite My Data. It provides an example of how the Hawkesbury Institute for the Environment publishes data using their HIEv application and Figshare to obtain DOIs for datasets associated with journal publications.
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
The slides from the Pistoia Alliance Debates Webinar where a panel of experts from technology support providers and the biopharma industry, who have been invited to share their views on the "Benefits and costs of FAIR Implementation for life science industry".
Research data sharing enables validation and new analyses of results, ensures efficient use of public funds, and counters misconduct. Funding agencies can encourage open data practices by requiring long-term storage, promoting data publication, and helping make data findable through catalogs. They should work with research communities to understand infrastructure needs, partner with libraries on preservation, and consider discipline-specific approaches rather than one-size-fits-all solutions.
IC-SDV 2019: Competitive Intelligence: how to optimize the analysis of pipeli...Dr. Haxel Consult
The document discusses two methods for optimizing data visualization for competitive intelligence analysis of pipeline and clinical trials data:
1) Using BizInt and VantagePoint solutions to combine pipeline, clinical trial, and other data sources and generate customizable reports and visualizations.
2) Developing an in-house tool called VALEM360 to provide a 360-degree view of pancreatic cancer competitive landscape data through multiple interactive visualization dashboards and a treatment decision tree.
The VALEM360 tool demonstrated the potential of a data-driven approach but would benefit from improving data quality, automation of updates, and applying the methodology to other disease areas. Overall, data visualization is very useful for competitive intelligence analysis but requires expertise in topic areas,
The OntoChem IT Solutions GmbH ...
... was founded in 2015 as a purely IT-oriented offshoot of the OntoChem GmbH. Even before we had many years of experience and it has always been our mission to provide added value to our customers by helping them to navigate today’s complex information world by developing cognitive computing solutions, indexing intranet and internet data and applying semantic search solutions for pharmaceutical, material sciences and technology driven businesses.
We strive to support our customers with the most useful tools for knowledge discovery possible, encompassing up-to-date data sources, optimized ontologies and high-throughput semantic document processing and annotation techniques.
We create new knowledge from structured and unstructured data by extracting relationships thereby exploiting the full potential of full-text documents & databases while also scanning social media, news flows and analyzing web-pages.
We aim at an unprecedented, machine understanding of text and subsequent knowledge extraction and inference. The application of our methods towards chemical compounds and their properties supports our customers in generating intellectual property and their use as novel therapeutics, agrochemical products, nutraceuticals, cosmetics and in the field of novel materials.
It's our mission to provide added value to customers by:
developing and applying cognitive computing solutions
creating intranet and internet data indexing and semantic search solutions
Big Data analytics for technology driven businesses
supporting product development and surveillance.
We deliver useful tools for knowledge discovery for:
creating background knowledge ontologies
high-throughput semantic document processing and annotation
knowledge mining by extracting relationships
exploiting the full potential of full-text documents & databases while also scanning social media, news flows and analyzing web-pages.
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
Abstract
In this presentation, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap that NIH is following to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem. Finally, she will weave in how this strategy is being leveraged to address the COVID-19 pandemic.
Presenter: Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health
dkNET Webinar Information: https://dknet.org/about/webinar
Investigating plant systems using data integration and network analysisCatherine Canevet
The document discusses challenges in integrating plant data from multiple sources and proposes solutions. It notes that plant data is sparse, distributed across many databases in various formats, and focused primarily on the model plant Arabidopsis. Data integration is necessary to address key biological questions by consolidating information from pathway databases, gene annotations, protein interactions, and more. The document outlines approaches to data integration including controlled vocabularies, ontologies, data standards, and integration applications specifically designed to combine data sources like Ondex. Effective integration is important to fully leverage available plant data.
The document discusses data sharing requirements for publishing in PLOS journals. PLOS requires authors to share all underlying data without restriction. Acceptable methods include depositing data in public repositories like Dryad or Figshare. The document also discusses other data repositories and journals, as well as obtaining identifiers like Data DOIs from services like Cite My Data. It provides an example of how the Hawkesbury Institute for the Environment publishes data using their HIEv application and Figshare to obtain DOIs for datasets associated with journal publications.
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
The slides from the Pistoia Alliance Debates Webinar where a panel of experts from technology support providers and the biopharma industry, who have been invited to share their views on the "Benefits and costs of FAIR Implementation for life science industry".
Research data sharing enables validation and new analyses of results, ensures efficient use of public funds, and counters misconduct. Funding agencies can encourage open data practices by requiring long-term storage, promoting data publication, and helping make data findable through catalogs. They should work with research communities to understand infrastructure needs, partner with libraries on preservation, and consider discipline-specific approaches rather than one-size-fits-all solutions.
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesPistoia Alliance
The FAIR (Findable, Accessible, Interoperable and Reusable) principles aim to maximize the discovery and reuse of digital resources. Using recently developed software and metrics to assess FAIRness and supported through an ELIXIR Implementation Study, Michel worked with a subset of ELIXIR Core Data Resources to apply these technologies. In this webinar, he will discuss their approach, findings, and lessons learned towards the understanding and promotion of the FAIR principles.
II-SDV 2016 Stefan Geißler Navigating complex information landscapes – Semant...Dr. Haxel Consult
Information that is relevant for researchers and decision makers in the Life Sciences comes from many different backgrounds: Scientific publications, patents, news, clinical reports, user-generated content, they all may be required to understand trends, opportunities and threats. A key to providing quick and comprehensive overview is having information from various source in one place and semantically enrich and normalize them and relate them to one another.
We present the key principles of a platform that serves that purpose and that provides users with insights into the scientific, clinical and competitive intelligence landscape of their respective area of interest. Forged in close collaboration with industry practitioners, the Luxid Biopharma Navigator is today used in production by hundreds of experts.
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
Requirements for reproducibility in computational chemistry publications include making available the data, code or algorithms, and results from the study. Authors should provide all data necessary to understand and assess their conclusions. Source code or detailed algorithm descriptions should also be included to allow independent reproduction of the work. Finally, publications must contain the actual results from applying the method rather than just describing results. Adopting these standards of transparency helps ensure others can evaluate and build upon published research claims.
Slides to be presented at a webinar arranged by Metasolution as part of a Vinnova project http://metasolutions.se/2014/03/webbinarium-med-kerstin-forsberg-om-lankade-data-i-lakemedelsforskningen/
The document proposes a federated in-memory database system for life sciences that addresses the needs of patients, clinicians, and researchers by enabling real-time analysis of big medical data while maintaining data privacy and locality. It describes key actors and a use case in cancer treatment. The proposed solution incorporates local compute resources through a federated in-memory database with a cloud service provider managing shared algorithms and master data, while sensitive patient data resides locally.
International perspective for sharing publicly funded medical research dataARDC
Presentation by Olivier Salvado, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation by Hugo Leroux and Liming Zhu, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
This document provides guidance on creating a Data Management Plan (DMP). It discusses the key elements that should be included in a DMP, such as the types of data that will be collected, metadata standards, data sharing and access policies, plans for reusing and redistributing data, and archiving data for long-term preservation. It also notes that costs for implementing the DMP may be included in the proposal budget and that the DMP will be reviewed as part of the NSF proposal process. Template codes for elements like variable names and labels that could be included in a DMP are also provided.
This document discusses data management requirements for predictive modeling using large datasets from multiple clinical, specimen, and lab repositories. It notes the need to assemble complete and up-to-date datasets while maintaining quality assurance and transparency. Over time, data storage systems experience problems with exponential data growth, manual data curation difficulties, and challenges integrating heterogeneous databases across different research groups. The document examines a spectrum of potential data management approaches and highlights collaborative networks and use of open source platforms as ways to address these issues.
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
This document summarizes the work of developing a Data Discovery Index prototype that helps users find and access shared biomedical data from various repositories. It ingests metadata from different standards and sources using ElasticSearch. It was presented at the Alan Turing Institute Symposium in April 2016. The project aims to organize data through an aggregator framework and portal. It involves mapping various metadata standards to have maximum coverage of use cases with minimal data elements. More information can be found at the listed websites.
This document summarizes Catriona MacCallum's presentation on data publishing at PLOS. The key points are:
1) PLOS requires authors to make all underlying data openly available without restriction, with rare exceptions. Authors must provide a Data Availability Statement describing compliance.
2) Over 47,000 PLOS papers have included a data statement. Most data is found within submission files or repositories like Dryad and Figshare. PLOS checks data accessibility and ensures anonymity of clinical datasets.
3) PLOS supports initiatives like CRediT for attributing research contributions and data citation principles for giving credit to data producers. PLOS is also involved in projects beyond traditional publishing like preprints and experimental
Clinical Data Models - The Hyve - Bio IT World April 2019Kees van Bochove
Population genetics and genomics is an emerging topic for the application of machine learning methods in healthcare and biomedical sciences. Currently, several large genomics initiatives, such as Genomics England, UK Biobank, the All of Us Project, and Europe's 1 Million Genomes Initiative are all in the process of making both clinical and genomics data available from large numbers of patients to benefit biomedical research. However, a key challenge in these initiatives is the standardization of the clinical and outcomes data in such a way that machine learning methods can be effectively trained to discover useful medical and scientific insights. In this talk, we will look at what data is available at scale, and review some of examples of the application of common data and evidence models such as OMOP, FHIR, GA4GH etc. in order to achieve this, based on projects which The Hyve has executed with some of these initiatives to harmonize their clinical, genomics, imaging and wearables data and make it FAIR.
Presentation by Dr Steve McEachern, ADA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Is one enough? Data warehousing for biomedical researchGreg Landrum
The document discusses challenges in storing and managing real-world biomedical data from multiple sources for analysis. It describes three different data warehouse case studies used at Novartis - Avalon, MAGMA, and the Entity Warehouse. The Entity Warehouse takes a novel approach of modeling data as entities that can be linked together, with results stored in tables by type. It is designed to integrate both internal and external data while allowing broad access. However, the document concludes that no single warehouse fits all needs, and multiple solutions may be required to fully enable data analysis.
Open science and medical evidence generation - Kees van Bochove - The HyveKees van Bochove
Presentation about open science, the FAIR principles, and medical evidence generation with the OHDSI COVID-19 study-a-thon as an example. I've used variations on this deck in a couple of classroom and online courses for PhD and master students early 2020.
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber – 2nd LEARN Workshop, Vienna, 6th April 2016
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
Knowledge work adds value to raw data; how this activity is performed is critical for how reliably results can be reproduced and scrutinized. With a brief diversion into epistemology, the presentation will outline the challenges for practitioners and consumers of Big Data analysis, and demonstrate how these were tackled at Inforsense (life sciences workflow analytics platform) and Musicmetric (social media analytics for music).
The talk covers the following issues with concrete examples:
- Representations of provenance
- Considerations to allow analysis computation to be recreated
- Reliable collection of noisy data from the internet
- Archiving of data and accommodating retrospective changes
- Using linked data to direct Big Data analytics
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesPistoia Alliance
The FAIR (Findable, Accessible, Interoperable and Reusable) principles aim to maximize the discovery and reuse of digital resources. Using recently developed software and metrics to assess FAIRness and supported through an ELIXIR Implementation Study, Michel worked with a subset of ELIXIR Core Data Resources to apply these technologies. In this webinar, he will discuss their approach, findings, and lessons learned towards the understanding and promotion of the FAIR principles.
II-SDV 2016 Stefan Geißler Navigating complex information landscapes – Semant...Dr. Haxel Consult
Information that is relevant for researchers and decision makers in the Life Sciences comes from many different backgrounds: Scientific publications, patents, news, clinical reports, user-generated content, they all may be required to understand trends, opportunities and threats. A key to providing quick and comprehensive overview is having information from various source in one place and semantically enrich and normalize them and relate them to one another.
We present the key principles of a platform that serves that purpose and that provides users with insights into the scientific, clinical and competitive intelligence landscape of their respective area of interest. Forged in close collaboration with industry practitioners, the Luxid Biopharma Navigator is today used in production by hundreds of experts.
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
Requirements for reproducibility in computational chemistry publications include making available the data, code or algorithms, and results from the study. Authors should provide all data necessary to understand and assess their conclusions. Source code or detailed algorithm descriptions should also be included to allow independent reproduction of the work. Finally, publications must contain the actual results from applying the method rather than just describing results. Adopting these standards of transparency helps ensure others can evaluate and build upon published research claims.
Slides to be presented at a webinar arranged by Metasolution as part of a Vinnova project http://metasolutions.se/2014/03/webbinarium-med-kerstin-forsberg-om-lankade-data-i-lakemedelsforskningen/
The document proposes a federated in-memory database system for life sciences that addresses the needs of patients, clinicians, and researchers by enabling real-time analysis of big medical data while maintaining data privacy and locality. It describes key actors and a use case in cancer treatment. The proposed solution incorporates local compute resources through a federated in-memory database with a cloud service provider managing shared algorithms and master data, while sensitive patient data resides locally.
International perspective for sharing publicly funded medical research dataARDC
Presentation by Olivier Salvado, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Presentation by Hugo Leroux and Liming Zhu, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
This document provides guidance on creating a Data Management Plan (DMP). It discusses the key elements that should be included in a DMP, such as the types of data that will be collected, metadata standards, data sharing and access policies, plans for reusing and redistributing data, and archiving data for long-term preservation. It also notes that costs for implementing the DMP may be included in the proposal budget and that the DMP will be reviewed as part of the NSF proposal process. Template codes for elements like variable names and labels that could be included in a DMP are also provided.
This document discusses data management requirements for predictive modeling using large datasets from multiple clinical, specimen, and lab repositories. It notes the need to assemble complete and up-to-date datasets while maintaining quality assurance and transparency. Over time, data storage systems experience problems with exponential data growth, manual data curation difficulties, and challenges integrating heterogeneous databases across different research groups. The document examines a spectrum of potential data management approaches and highlights collaborative networks and use of open source platforms as ways to address these issues.
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
This document summarizes the work of developing a Data Discovery Index prototype that helps users find and access shared biomedical data from various repositories. It ingests metadata from different standards and sources using ElasticSearch. It was presented at the Alan Turing Institute Symposium in April 2016. The project aims to organize data through an aggregator framework and portal. It involves mapping various metadata standards to have maximum coverage of use cases with minimal data elements. More information can be found at the listed websites.
This document summarizes Catriona MacCallum's presentation on data publishing at PLOS. The key points are:
1) PLOS requires authors to make all underlying data openly available without restriction, with rare exceptions. Authors must provide a Data Availability Statement describing compliance.
2) Over 47,000 PLOS papers have included a data statement. Most data is found within submission files or repositories like Dryad and Figshare. PLOS checks data accessibility and ensures anonymity of clinical datasets.
3) PLOS supports initiatives like CRediT for attributing research contributions and data citation principles for giving credit to data producers. PLOS is also involved in projects beyond traditional publishing like preprints and experimental
Clinical Data Models - The Hyve - Bio IT World April 2019Kees van Bochove
Population genetics and genomics is an emerging topic for the application of machine learning methods in healthcare and biomedical sciences. Currently, several large genomics initiatives, such as Genomics England, UK Biobank, the All of Us Project, and Europe's 1 Million Genomes Initiative are all in the process of making both clinical and genomics data available from large numbers of patients to benefit biomedical research. However, a key challenge in these initiatives is the standardization of the clinical and outcomes data in such a way that machine learning methods can be effectively trained to discover useful medical and scientific insights. In this talk, we will look at what data is available at scale, and review some of examples of the application of common data and evidence models such as OMOP, FHIR, GA4GH etc. in order to achieve this, based on projects which The Hyve has executed with some of these initiatives to harmonize their clinical, genomics, imaging and wearables data and make it FAIR.
Presentation by Dr Steve McEachern, ADA, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.
Is one enough? Data warehousing for biomedical researchGreg Landrum
The document discusses challenges in storing and managing real-world biomedical data from multiple sources for analysis. It describes three different data warehouse case studies used at Novartis - Avalon, MAGMA, and the Entity Warehouse. The Entity Warehouse takes a novel approach of modeling data as entities that can be linked together, with results stored in tables by type. It is designed to integrate both internal and external data while allowing broad access. However, the document concludes that no single warehouse fits all needs, and multiple solutions may be required to fully enable data analysis.
Open science and medical evidence generation - Kees van Bochove - The HyveKees van Bochove
Presentation about open science, the FAIR principles, and medical evidence generation with the OHDSI COVID-19 study-a-thon as an example. I've used variations on this deck in a couple of classroom and online courses for PhD and master students early 2020.
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber – 2nd LEARN Workshop, Vienna, 6th April 2016
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
Knowledge work adds value to raw data; how this activity is performed is critical for how reliably results can be reproduced and scrutinized. With a brief diversion into epistemology, the presentation will outline the challenges for practitioners and consumers of Big Data analysis, and demonstrate how these were tackled at Inforsense (life sciences workflow analytics platform) and Musicmetric (social media analytics for music).
The talk covers the following issues with concrete examples:
- Representations of provenance
- Considerations to allow analysis computation to be recreated
- Reliable collection of noisy data from the internet
- Archiving of data and accommodating retrospective changes
- Using linked data to direct Big Data analytics
FAIR Ddata in trustworthy repositories: the basicsOpenAIRE
This video illustrates how certified digital repositories contribute to making and keeping research data findable, accessible, interoperable and reusable (FAIR). Trustworthy repositories support Open Access to data, as well as Restricted Access when necessary, and they offer support for metadata, sustainable and interoperable file formats, and persistent identifiers for future citation. Presented by Marjan Grootveld (DANS, OpenAIRE).
Main references
• Core Trust Seal for trustworthy digital repositories: https://www.coretrustseal.org/
• EUDAT FAIR checklist: https://doi.org/10.5281/zenodo.1065991
• European Commission’s Guidelines on FAIR data management: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
• FAIR data principles: www.force11.org/group/fairgroup/fairprinciples
• Overview of metadata standards and tools: https://rdamsc.dcc.ac.uk/
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
Managing and Sharing Research Data - Workshop at UiO - December 04, 2017Michel Heeremans
These slides were presented during a workshop on Research Data Management, given at the University of Oslo, Department of Geosciences on December 04, 2017
This document provides biographical and contact information for Professor Aboul Ella Hassanien, including that he is the founder and chair of the Scientific Research Group in Egypt and formerly served as dean of the faculty of computers and information at Beni-Suef University. It announces an upcoming presentation by Professor Hassanien on sharing scientific data, ethics, and consent taking place on January 20, 2018 at Cairo University.
The document discusses sharing research data through open data platforms. It describes the CGIAR as uniquely positioned to collect agricultural data worldwide and argues that most CGIAR data should be archived and shared to increase its value. However, data archiving across CGIAR centers is currently poor. The document then discusses using the Dataverse platform to improve data sharing. Dataverse allows researchers to publish, share, cite, and analyze data. It also facilitates making data available while giving credit to data authors and institutions.
The document provides an introduction to data management, defining data and describing requirements for data sharing from federal funding agencies. It discusses best practices for data management, such as developing data management plans and file organization, as well as options for data preservation, sharing, and archiving. Resources for data management assistance at Northwestern University are also outlined.
Presentation by Ruth Wilson on Nature Publishing Group's Scientific Data journal given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
The document summarizes the Jisc Managing Research Data Programme which aims to support universities in improving research data management. It discusses why managing research data is important, highlighting funder policies and the benefits of open data. It provides an overview of Jisc's activities including training projects, guidance resources, and funding for institutional infrastructure services and repositories. The presentation emphasizes the importance of institutional policies, support services, skills development and cultural change to effectively manage research data in line with funder expectations.
Research data can be categorized as observational, experimental, simulation, derived or compiled, and reference or canonical. A highly effective data pyramid outlines key aspects for research data: being stored, preserved, accessible, discoverable, citable, comprehensible, reviewed, reproducible, reusable, and integrated. A data-driven company is one where decision makers have independent access to data when needed and the company continuously measures business metrics. Properties of data-driven companies include being comfortable with uncertainty, adapting culture, being agile, forward-looking technology acquisitions, updating processes, CEO leadership, removing organizational barriers, allocating resources differently, and productizing data.
Pine Biotech conducts monthly informational workshops on the topics related to high-throughput data analysis, interpretation and integration. The workshops highlight our research tools and educational resources developed with collaborators in the US and across the world.
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
Big Data – Shining the Light on Enterprise Dark DataHitachi Vantara
Content stored for a business purpose is often without structure or metadata required to determine its original purpose. With Hitachi Data Discovery Suite and Hitachi Content Platform, businesses can uncover dark data that could be leveraged for better business insight and uncover compliance issues that could prevent business risks. View this session and learn: What is enterprise dark data? How can enterprise dark data impact business decisions? How can you augment your underutilized data and deliver more value? How can you decrease the headache and challenges created by dark data? For more information please visit: http://www.hds.com/products/file-and-content/
I o dav data workshop prof wafula final 19.9.17Tom Nyongesa
The document summarizes an iODaV Data Workshop held at JKUAT in Kenya on open data and the JORD policy. It discusses why open data is important for reproducibility, innovation and scientific discovery. It outlines the FAIR principles for open data and metadata to make data findable, accessible, interoperable and reusable. It also discusses opportunities and challenges of open data for universities, including developing skills and infrastructure. Finally, it provides examples of open data initiatives at JKUAT including developing an open data policy, the iODaV program, contributions to national ICT policies, and the digital health applied research centre.
DRIVE CENTRAL STUDY PLATFORM: Data flow, data quality and statistical analysi...DRIVE research
DRIVE annual forum 2019, Helsinki, Finland, 17th-18th September
Development of Robust and Innovative Vaccine Effectiveness
Increasing understanding of influenza vaccine effectiveness in Europe
This presentation was provided by Dr. Paul Burton of the University of Bristol during the NISO Symposium, Privacy Implications of Research Data, held on September 11, 2016, in conjunction with the International Data Week in Denver, Colorado.
Similar to Supporting FAIR data principles with data categorization (20)
The document outlines a road map for PID Forum Finland with 3 key steps: 1) Creating engagement around PIDs by raising awareness and building skills and trust. 2) Organizing management and funding by describing use cases, creating proofs of concept, and defining requirements. 3) Creating infrastructure by ensuring interoperability, building a resolver, and organizing support services. The overall goal is to make information traceable across different channels now and in the future.
Esitys kirjastoverkkopäivillä lokakuussa 2021. Puhuin tutkimusaineistoista kuvailun kohteena, pysyvistä tunnisteista ja joistakin muista asioista liittyen tutkimusdatan erityispiirteisiin.
Esitys Kansalliskirjaston Kulttuuriperintöaineistot ja tutkimusdata --yhteistyön rajapintoja verkkotapahtumassa 4.3.2021. In this presentation I discussed research data management and how the Fairdata services enables implementing the FAIR data principles in research data publication.
Presentation at Digital Humanities in the Nordics 2020 conference in panel: Towards deterioration, disappearance or destruction? Discussing the critical issue of long-term sustainability of digital humanities projects
In an expert webinar on April 15th 2020 we discussed (in Finnish) how the FAIR data principles affect service development in RDM services. I presented some relevant outputs from the FAIRsFAIR project. These are the slides (in English). The webinar will be published on the fairdata.fi service site https://www.fairdata.fi/koulutus/koulutuksen-tallenteet/
1) The document summarizes a report on requirements for FAIR (Findable, Accessible, Interoperable, Reusable) data persistence and interoperability.
2) It describes a 36-month, 10 million euro project involving 22 partners from 8 EU member states working on practical implementations of semantic interoperability across research infrastructures.
3) The report analyzes the current landscape of FAIR technologies, semantic artifacts, and infrastructure initiatives; identifies challenges around scope, terminology, and rapid development; and concludes that solutions must be user-friendly, context-sensitive, and transparent while promoting adoption of standards and registries.
Collections meet the researcher. Digitalization, disintegration and disillusi...Jessica Parland-von Essen
Presentation at the LAM3 seminar in Uppsala, 9th of October 2019. On digitalization, researchers and data in the context of cultural heritage collections. The slides mostly contain headings, but the two last slides include a list of relevant reading on the subject.
This document discusses best practices for organizing, managing, and publishing research data. It recommends using standardized file naming and folder structures, documenting data through code books and metadata, selecting open formats, and considering issues like data security, versions, and citations. FAIR principles of findable, accessible, interoperable and reusable data are presented. Options in Finland for publishing and archiving research data include repositories like FSD Tietoarkisto and Zenodo. Adopting these practices helps ensure well-organized, documented data that can enable reproducibility and reuse.
This document discusses making data Findable, Accessible, Interoperable and Reusable (FAIR). It provides principles for each component and examples of metadata standards and repositories that help achieve FAIR data. Resources referenced include guidelines for assigning persistent identifiers to data and metadata, describing data with rich metadata using shared vocabularies, and indexing metadata in searchable resources to enable discovery and access.
The document discusses open science and how it has changed research practices. It defines open science as making research data, notes, and processes openly available for collaboration and reuse. It outlines benefits like increasing quality, impact and innovation. Barriers like publishing costs are mentioned. The document recommends openly licensing data and publications, using open peer review and platforms, and sharing materials like code and presentations. Proper data management is important for openness, reproducibility and ensuring research integrity.
This document discusses data management practices in research. It defines research data and emphasizes the importance of good data management for ensuring integrity, reproducibility and excellence in science. Key aspects of data management include planning, documentation, metadata, sustainability, and publication. Funders increasingly require and support open access to publications and research data. The document provides guidance and considerations for implementing responsible data management and open science practices.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
Supporting FAIR data principles with data categorization
1. CSC – Suomalainen tutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskusCSC – Suomalainen tutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskus
Supporting FAIR data.
Categorization of research data as a tool in
data management
Jessica Parland-von Essen https://orcid.org/0000-0003-4460-3906, Katja Fält https://orcid.org/0000-
0002-6172-5377, Zubair Maalick https://orcid.org/0000-0002-0975-1471, Miika Alonen
https://orcid.org/0000-0002-0065-0017, Eduardo Gonzalez https://orcid.org/0000-0003-1400-0995
3. Persistent identifiers
3
a) Cite a specific slice or subset (the set of updates to the
dataset made during a particular period of time or to a particular
area of the dataset).
b) Cite a specific snapshot (a copy of the entire dataset made at
a specific time).
c) Cite the continuously updated dataset, but add Access Date
and Time to the citation. (Does not necessarily ensure
reproducibility.)
d) Cite a query, time-stamped for re-execution against a
versioned database.
DYNAMIC DATASETS
IMMUTABLE DATASETS
4. Maybe we need to be more specific and find common ground in concepts?
4
CHUNKING UP RESEARCH DATA
5. Categorization according to technical properties
• Modality, DCMI types
oDublin Core –type of thinking
• Format, DCMI format
oMIME types
oSoftware related
• Language, coding
oHuman interpretation
5
By Lin Kristensen from New Jersey, USA (Timeless Books) [CC BY 2.0
(https://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
6. Categorization according to contextual traits
• Origin
oObservational, experimental,
simulation, derived etc
• Use category
oSource, output, method
• Provenance, lifecycle
oPrimary, secondary, data levels,
qualitative, quantitative
6
By David Monniaux CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/), from Wikimedia Commons
7. Categorization according to inherent traits
• Access type (availability)
oOpen data, sensitive data
• Semantic structure
oCoherence, levels of measurement,
groupings, classifications
• Research data type (stability)
oGeneric data, Generic research data,
research data publications
7
9. 9
Dynamic and growing
datasets
URN allows use of
fragments
Avoid PID inflation
Consider costs and
sustainability
Ad hoc creation rather
than automatic minting
and allocation?
10. Operational data Generic research data Research dataset
Description Data for any use, private or government
owned, might fall within PSI.
Produced by/with/for
researchers, validated, good
quality, well documented, might
be raw or processed.
Dataset produced for a certain
research question
Might be highly processed,
reuse difficult unless mature
field. The main purpose is
assessment and reproducibilty.
Format May be dynamic mature solutions,
active or even hot data.
Coherent and well documented
formats. Data should be quite
stable with versioning. Should be
possible to cite and enable
reproducible research.
Usually in files, but might also
be a database with
applications. Citation does not
require date. Two-tier resolver
for identifier and landing page
with metadata available even
after data is gone. Might have
defined lifespan.
Examples - weather data
- data catalogue
- big data from social media
- corpora
- time series of
experimental or
observational data from
technical instruments
- similar social or clinical
surveys
- data paper
- data cited in article and
published in Zenodo,
EUDAT B2Share, other
or journal repository
11. Using research data types …
… makes it easier to describe services
… makes it easier for researchers to plan data life cycle
… makes developing solutions for citation and FAIR data
creation and use easier
…makes it easier to describe and manage research data
11