Big Data and the Health domain (vis-a-vis the respective H2020 Societal Challenge) - Opportunities, Challenges and Requirements. As presented and discussed in the public launch of the BigDataEurope project.
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
Edge Informatics is an approach to accelerate collaboration in the BioPharma pipeline. By combining technical and social solutions knowledge can be shared and leveraged across the multiple internal and external silos participating in the drug development process. This is accomplished by making data assets findable, accessible, interoperable and reusable (FAIR). Public consortia and internal efforts embracing FAIR data and Edge Informatics are highlighted, in both preclinical and clinical domains.
This talk was presented at the Molecular Medicine Tri-Conference in San Francisco, CA on February 20, 2017
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
In this presentation, you will learn how to transform a Big Data initiative into a realized, measurable ROI:
• Understand the complex mix of business expectation, hype, reality, and new information source opportunities in the Big Data space
• Use the Business Case process to help to you identify what you can achieve and what is not yet ready
• Build communities of interest around prototypes and plan for success for your company’s advantage
• Learn how to industrialize your Big Data innovations to achieve measurable, sustainable benefits
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
Slides to be presented at a webinar arranged by Metasolution as part of a Vinnova project http://metasolutions.se/2014/03/webbinarium-med-kerstin-forsberg-om-lankade-data-i-lakemedelsforskningen/
Big Data and the Health domain (vis-a-vis the respective H2020 Societal Challenge) - Opportunities, Challenges and Requirements. As presented and discussed in the public launch of the BigDataEurope project.
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
Edge Informatics is an approach to accelerate collaboration in the BioPharma pipeline. By combining technical and social solutions knowledge can be shared and leveraged across the multiple internal and external silos participating in the drug development process. This is accomplished by making data assets findable, accessible, interoperable and reusable (FAIR). Public consortia and internal efforts embracing FAIR data and Edge Informatics are highlighted, in both preclinical and clinical domains.
This talk was presented at the Molecular Medicine Tri-Conference in San Francisco, CA on February 20, 2017
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
In this presentation, you will learn how to transform a Big Data initiative into a realized, measurable ROI:
• Understand the complex mix of business expectation, hype, reality, and new information source opportunities in the Big Data space
• Use the Business Case process to help to you identify what you can achieve and what is not yet ready
• Build communities of interest around prototypes and plan for success for your company’s advantage
• Learn how to industrialize your Big Data innovations to achieve measurable, sustainable benefits
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
As scientists in the life sciences we are trained to pursue singular goals around a publication or a validated target or a drug submission. Our failure rates are exceedingly high especially as we move closer to patients in the attempt to collect sufficient clinical evidence to demonstrate the value of novel therapeutics. This wastes resources as well as time for patients depending upon us for the next breakthrough.
Edge Informatics is an approach to ameliorate these failures. Using both technical and social solutions together knowledge can be shared and leveraged across the drug development process. This is accomplished by making data assets discoverable, accessible, self-described, reusable and annotatable. The Open PHACTS project pioneered this approach and has provided a number of the technical and social solutions to enable Edge Informatics. A number of pre-competitive consortia and some content providers have also embraced this approach, facilitating networks of collaborators within and outside a given organization. When taken together more accurate, timely and inclusive decision-making is fostered.
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
Slides to be presented at a webinar arranged by Metasolution as part of a Vinnova project http://metasolutions.se/2014/03/webbinarium-med-kerstin-forsberg-om-lankade-data-i-lakemedelsforskningen/
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen. Our processes enable simple creation of dataset records and linking to source data, providing a seamless federated knowledge graph for novice and advanced users alike.
Presented May 7th, 2019 at the Knowledge Graph Conference, Columbia University.
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
What to do About FAIR…
In the experience of most pharma professionals, FAIR remains fairly abstract, bordering on inconclusive. This session will outline specific case studies – real problems with real data, and address opportunities and real concerns.
·
Why making data Findable, Actionable, Interoperable and Reusable is important.
Talk presented at the Data Driven Drug Development (D4) conference on March 20th, 2019.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
The internet now offers access to a myriad of online resources that can be of value to chemists working in the Life Sciences. While finding information online is, in many cases, a simple search away, the accuracy and validity of the associated data and information should be questioned. As more databases and resources are introduced online, and commonly not integrated to other resources, a scientist must perform multiple searches and then undertake the task of meshing and merging data. ChemSpider is a freely accessible online database that has taken on the challenge of meshing together distributed resources across the internet to provide a structure-based hub. It is a crowdsourcing environment hosting over 26 million unique compounds linked out to over 400 data sources. With well defined programming interfaces for integration ChemSpider has been integrated to many commercial and open software packages and is presently serving as the chemistry foundation for the IMI Open PHACTS project.
How to Create a Big Data Culture in PharmaChris Waller
A talk presented at the Big Data and Analytics conference in Boston on January 28, 2014. Emphasis on data and information sharing cultures in companies.
The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.
My talk at the Open PHACTS last ever project meeting in Vienna 2016 where i was asked to talk about the challenges we addressed in open phacts with semantic web technology and what still needed to be done.
Medical innovation calls for new models for collaborations that facilitates, government, academia and industry.
Barriers to research and ultimate commercialization will be lowered by bringing best practices from industry and academic settings.
Hippocrates platform facilitates early drug development extending from basic research to drug invention and commercialization significantly saving time and money.
The platform is designed in such way to facilitate collaboration amongst stakeholders as well as taking advantage of the vast resources currently available on the web to generate and aggregate content based on the needs of the research of the end-user.
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen. Our processes enable simple creation of dataset records and linking to source data, providing a seamless federated knowledge graph for novice and advanced users alike.
Presented May 7th, 2019 at the Knowledge Graph Conference, Columbia University.
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
What to do About FAIR…
In the experience of most pharma professionals, FAIR remains fairly abstract, bordering on inconclusive. This session will outline specific case studies – real problems with real data, and address opportunities and real concerns.
·
Why making data Findable, Actionable, Interoperable and Reusable is important.
Talk presented at the Data Driven Drug Development (D4) conference on March 20th, 2019.
FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.
This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
The internet now offers access to a myriad of online resources that can be of value to chemists working in the Life Sciences. While finding information online is, in many cases, a simple search away, the accuracy and validity of the associated data and information should be questioned. As more databases and resources are introduced online, and commonly not integrated to other resources, a scientist must perform multiple searches and then undertake the task of meshing and merging data. ChemSpider is a freely accessible online database that has taken on the challenge of meshing together distributed resources across the internet to provide a structure-based hub. It is a crowdsourcing environment hosting over 26 million unique compounds linked out to over 400 data sources. With well defined programming interfaces for integration ChemSpider has been integrated to many commercial and open software packages and is presently serving as the chemistry foundation for the IMI Open PHACTS project.
How to Create a Big Data Culture in PharmaChris Waller
A talk presented at the Big Data and Analytics conference in Boston on January 28, 2014. Emphasis on data and information sharing cultures in companies.
The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.
My talk at the Open PHACTS last ever project meeting in Vienna 2016 where i was asked to talk about the challenges we addressed in open phacts with semantic web technology and what still needed to be done.
Medical innovation calls for new models for collaborations that facilitates, government, academia and industry.
Barriers to research and ultimate commercialization will be lowered by bringing best practices from industry and academic settings.
Hippocrates platform facilitates early drug development extending from basic research to drug invention and commercialization significantly saving time and money.
The platform is designed in such way to facilitate collaboration amongst stakeholders as well as taking advantage of the vast resources currently available on the web to generate and aggregate content based on the needs of the research of the end-user.
The Pistoia Alliance Biology Domain Strategy April 2011Pistoia Alliance
Michael Braxenthaler (Roche and external liaison officer for Pistoia) describes the Pistoia Alliance biology domain strategy at the first Pistoia Alliance Conference in April 2011.
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
Introduction of disruptive technologies, including use of unstructured data, is critical to Pharmaceutical R&D. We will explore how MongoDB can be used to accelerate this. We will also have an open discussion with panel members who are using MongoDB in this space. This session will be 30 minutes and will be followed by a 20 minute panel discussion led by Jason Tetrault and Deniz Kural.
This lecture outlines the different strategies for finding a fragment hit and the subsequent elaboration strategies used in order to increase potency to develop a lead compound in drug discovery.
As the author of “Big Data in Healthcare Hype and Hope,” Dr. Feldman has interviewed over 180 emerging tech and healthcare companies, always asking, “How can your new approach help patients?” Her research shows that data, as an enabling tool, has the power to give us critical new insights into not only what causes disease, but what comprises normal. Despite this promise, few patients have reaped the benefits of personalized medicine. A panel of leading big data innovators will discuss the evolving health data ecosystem and how big data is being leveraged for research, discovery, clinical trials, genomics, and cancer care. Case studies and real-life examples of what’s working, what’s not working, and how we can help speed up progress to get patients the right care at the right time will be explored and debated.
• Bonnie Feldman, DDS, MBA - Chief Growth Officer, @DrBonnie360
• Colin Hill - CEO, GNS Healthcare
• Jonathan Hirsch - Founder & President, Syapse
• Andrew Kasarskis, PhD - Co-Director, Icahn Institute for Genomics & Multiscale Biology; Associate Professor, Genetics & Genomic Studies, Icaahn School of Medicine at Mt. Sinai
• William King - CEO, Zephyr Health
New York eHealth Collaborative Digital Health Conference
November 18, 2014
Insight into AstraZeneca's Technology Services.Nick Brown
Presentation given at the Big Data in Pharma Europe conference, London February 19th 2014 (http://bigdatapharma-europe.com/). Updated for Enterprise Search Europe Summit April 29th (http://www.enterprisesearcheurope.com/2014/Tuesday.aspx).
Overview about the innovation approach taken within Technology Services at AstraZeneca, showcasing the approach, 6 examples of pilots and proof-of-concepts, and with a case study of how to implement a revolution in search analytics, using R&D as a springboard for enterprise.
The internet continues to offer increased access to chemistry data that may be of value to scientists interested in populating systems containing reference toxicology data as well as to provide data for the development of predictive models. This presentation will give an overview of some of the various sources of data available via the internet, provide an overview of some of the challenges associated with gathering high-quality data and discuss methods by which to mesh together disparate data sources.
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.
RSC|ChemSpider is one of the world’s largest online resources for chemistry related data and services. Developed with the intention of delivering access to structure-based chemistry data via the internet the ChemSpider platform hosts over 26 million unique chemical compounds aggregated from over 400 data sources and provides an environment for the community to both annotate and curate these existing data as well as deposit new data to the system. The search system delivers flexible querying capabilities together with links to external sites for publication and patent data. ChemSpider has spawned a number of projects include ChemSpider SyntheticPages for hosting openly peer-reviewed chemical synthesis articles. This presentation will review the present capabilities of the ChemSpider system providing direct examples of how to use the system to source high quality data of value to pharmaceutical companies. We will discuss some of the challenges associated with validating data quality, examine how ChemSpider is a part of the semantic web for chemistry and investigate approaches to using ChemSpider integrated to analytical instrumentation.
The ChemSpider database is an online resource containing >26 million chemicals sourced from over 400 data sources. As a result the database is a rich resource supporting the verification and elucidation of chemical structures and is utilized by mass spectrometrists around the world using the online user interface as well as the application programming interface. This presentation will provide an overview of how ChemSpider can be used for the purpose of structure identification and will include (1) direct interaction with the online interface; (2) integration to mass spectrometry vendor software; (3) applications to the identification of “known unknowns” and a comparison with the capabilities of CAS Scifinder and (4) the hosting of online mass spectral data.
This is a presentation given at the Opal Events meeting ""Drug Discovery Partnerships: Filling the Pipeline". I was speaking in a session with Jean-Claude Bradley regarding "Pre-competitive Collaboration: Sharing Data to Increase Predictability". This presentation discussed some of the work we are doing on Open PHACTS. My thanks especially to Carole Goble, Lee Harland and Sean Ekins for their comments.
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. The Royal Society of Chemistry hosts ChemSpider, a free access website for chemists built with the intention of building community for chemists (http://www.chemspider.com/).
ChemSpider is an aggregator of chemistry related information, at present over 20 million unique chemical entities linked out to over 300 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. It is also a public deposition platform where chemists can deposit their own data including novel structures, analytical data, synthesis procedures and host data associated with the growing activities associated with Open Notebook Science.
This presentation will examine chemistry on the internet, the dubious quality of what is available and how the ChemSpider crowdsourced curation platform is fast becoming one of the centralized hubs for resourcing information about chemical entities.
We will also review our efforts to provide free resources for synthesis procedures, spectral data and structure-based searching of the chemistry literature and how chemists can contribute directly to each of these projects.
Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of the ChemSpider platform and the nature of the solutions that it helps to enable. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
dkNET Webinar: The Collaborative Microbial Metabolite Center – Democratizing ...dkNET
Presenter: Pieter Dorrestein, PhD, Professor, Skaggs School of Pharmacy and Pharmaceutical Sciences, Department of Pharmacology and Pediatrics, University of California San Diego
Abstract
In the analysis of organs, volatilome, or biofluids, the microbiome influences 15-70% of detectable mass spectrometry molecules. Typically, only 10% of human untargeted metabolomics data can be assigned a molecular structure, with merely 1-2% traceable to microbial origins. Human microbiomes contribute metabolites through the microbial metabolism of host-derived substances, digestion of food and beverage molecules, and de novo assembly using proteins encoded by genetic elements. Despite the significance of microbiome-derived metabolites to human health, there is no centralized knowledge base for community access. To address this, the "Collaborative Microbial Metabolite Center" (CMMC) leverages expertise in mass spectrometry, microbiome innovation, and the GNPS ecosystem to built a knowledgebase. It aims to create a user-accessible microbiome resource, enrich bioactivity knowledge, and facilitate data deposition. The CMMC includes the construction of a knowledge base, MicrobeMASST tool, and health phenotype enrichment workflows, the construction and use will be discussed in this presentation. The use of this ecosystem will be exemplified by the discovery of 20,000 bile acids, many of which were shown to be of microbial origin and linked to diet and IBD.
The top 3 key questions that this resource can answer:
1. How can we leverage the 1000’s of public metabolomics studies to discover microbial metabolites and their organ distributions as well as their phenotypic, including health, associations?
2. If one has an unknown molecule, how can one assess what microbes make a molecule without known structure?
3. How can one contribute to the expansion of the knowledgebase on microbial metabolites?
Upcoming webinars schedule: https://dknet.org/about/webinar
RSC|ChemSpider is one of the world’s largest online resources for chemistry related data and services. Developed with the intention of delivering access to structure-based chemistry data via the internet the ChemSpider platform hosts over 26 million unique chemical compounds aggregated from over 400 data sources and provides an environment for the community to both annotate and curate these existing data as well as deposit new data to the system. The search system delivers flexible querying capabilities together with links to external sites for publication and patent data. This presentation will review the present capabilities of the ChemSpider system providing direct examples of how to use the system to source high quality data of value to chemists. We will discuss some of the challenges associated with validating data quality and examine how ChemSpider is a part of the new “semantic web for chemistry”. ChemSpider has also spawned a number of additional projects include ChemSpider SyntheticPages for hosting openly peer-reviewed chemical synthesis articles, Learn Chemistry Wiki for students learning chemistry and SpectraSchool for learning spectroscopy.
The internet has provided access to unprecedented quantities of data. In the domain of chemistry specifically over the past decade the web has become populated with tens of millions of chemical structures and related properties of assays together with tens of thousands of spectra and syntheses. The data have, to a large extent, remained disparate and disconnected. In recent years with the wave of Web 2.0 participation any chemist can contribute to both the sharing and validation of chemistry-related data whether it be via Wikipedia, the online encyclopedia, or one of the multiple public compound databases. The presentation will offer a perspective of what is available today, our experiences of building a public compound database to link together the internet and a suggested path forward for enabling even greater integration and connectivity for chemistry data for the masses to both use and participate in developing.
ChemSpider is a structure centric database hosted by the Royal Society of Chemistry and integrating over 25 million chemical compounds to over 400 internet-based resources including many public domain databases, Wikipedia, chemical vendors, patents, publications and other web-based services. The intention is for ChemSpider to become one of the primary online hubs for chemists to source chemistry related data. During the development of the ChemSpider database we have utilized numerous approaches to standardizing, curating and validating the data supplied to us for hosting and integration. This presentation will provide an overview of our initial development of the ChemSpider database and provide an overview of our present processes and procedures for handling incoming data depositions. We will also discuss how crowdsourcing can help to expand, curate and validate the data on the ChemSpider database.
Building a semantic chemistry platform with the royal society of chemistryValery Tkachenko
We live in an exponentially expanding world of “big data”. Social networks, global portals and other distributed systems have been attempting to deal with the problem for a few years now. Scientific applications are commonly lagging behind the mainstream trends due to the complexity of the scientific domain. The Royal Society of Chemistry is building the Global Chemistry Network connecting a variety of resources both in-house and external, bridging gaps and advancing the chemical sciences. One of the main issues connected to the world of big data is the ease of navigation and comprehensiveness of the search capabilities. This is where the approach of the semantic web meets the world of big data. We will present our approaches in building a global federated chemistry platform connecting multiple domains of chemistry using semantic web technologies.
With an intention to provide a high quality free internet resource of chemistry related data for the community, ChemSpider has aggregated almost 25 million compounds linked out to over 400 data sources and provided a platform for the community to both deposit and curate data. This experiment in crowdsourcing for chemistry has now been running for over three years. This presentation will review a number of aspects of the project including (a) the level of community participation in depositing and curating data; (b) the nature of data and content supplied by the community; (c) how ChemSpider is used by the community; (d) using game-based systems to assist in data curation; (e) algorithmic-based approaches to data validation and filtering; and (f) sharing data curation efforts with other online databases.
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...Kathleen Jagodnik
The FAIR Guiding Principles facilitate the Findability, Accessibility, Interoperability, and Reusability of digital resources. The Library of Integrated Network-based Cellular Signatures (LINCS) Project has sought to implement the FAIR principles in the provision of its resources in order to optimize usability. We have surveyed the FAIR principles and are implementing specific facets within the LINCS resources. Subsequently, with reference to the literature and other efforts to measure FAIRness, we are developing quantitative metrics to assess the FAIRness of each dataset and resource in order to provide users with objective measures of the characteristics of the LINCS project. Assessing and improving the FAIRness of LINCS is an ongoing effort by our team that will benefit from community input to ensure that all LINCS users are optimally engaged with this resource.
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
I presented a talk at the Protege research meeting on the 'Current advances to bridge the usability-expressivity gap in biomedical semantic search (and visualizing linked data)' https://sites.google.com/site/protegeresearchmeeting/meeting-materials/current-advances-to-bridge-the-usability-expressivity-gap-in-semantic-search
Today ChemSpider (www.chemspider.com) is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider offers its users a structure centric platform facilitating access to publications and patents, experimental and predicted property data, spectral data and many other forms of data and information that can benefit a chemist. ChemSpider is a crowdsourcing platform allowing the community to contribute data directly to the database by allowing the deposition and sharing of structure data, properties, spectra and reaction syntheses. The crowdsourcing also allows for the annotation and curation of existing data thereby allowing the community to assist in the much-needed curation and validation of chemistry data on the internet. This work is imperative in order to provide the chemistry underpinnings to semantic web projects such as Open PHACTS (www.openphacts.org) of which Merck is sure to benefit when it is released to the community. This presentation will provide an overview of the ChemSpider platform and will also examine the challenges of dealing with heterogeneous data quality when attempting to provide a rich resource of data for the community. If you use the internet to research chemistry based data this presentation will be an essential guide to how to source high quality data.
This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". This is where I unveiled details regarding the intra/inter-validation studies validating drug structures on multiple public domain chemistry databases. I also unveiled early results regarding the SurveyMonkey study of "trust" that the community has about public domain chemistry resources
Scientists commonly find themselves in a state of overwhelm in regards to the availability of information accessible to them. The distribution of resources now includes the entire space of the worldwide web, access to primary databases such as CAS and, commonly, a plethora of internally developed systems. While the web has provided improved access to chemistry-related information there has not been an online central resource allowing integrated chemical structure-searching of chemistry databases, chemistry articles, patents and web pages such as blogs and wikis. ChemSpider has built a structure centric community for chemists by providing free access to an online database and collaboration tool for chemists. The online database offers an environment for curating the data on ChemSpider as well as the deposition of chemical structures, analytical data and associated information and provides a significant knowledge base and resource for chemists working in different domains. An overview of present and future capabilities is given.
This is a presentation I gave at the FDA on December 1st 2009 in Wahington DC as part of a symposium involving PubChem, ChemIDPLus, PillBox, DailyMed and other related systems. The focus was, as usual, on the quality of data online and how to clean up the information and with a specific focus on the quality of data on the FDA's DailyMed and our efforts to apply semantic markup to the DailyMed articles
Similar to Big data supporting drug discovery - cautionary tales from the world of chemistry for translational informatics (20)
Evolution of public chemistry databases: past and the futureValery Tkachenko
Over the last few years we have seen a tremendous growth in various chemical databases. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery.
In this talk we will share our experience spanning through decades of building chemical databases like PubChem, ChemSpider, OpenPHACTS and National Database Services and will outline fundamental problems associated with chemical databases as such as well as data quality and approaches for the modern architecture of the large-scale chemical databases.
Materials design is a grand challenge of materials science. And the main approach for solving this problem is still intuition-based. Such a way requires a lot of time and financial resources and months to years of conducting the experiment and doing characterization. Therefore, any kind of model that can be used at the very first stage of materials design and can narrow the selection area is a helpful tool for synthetic chemist. Also, an automated search for materials with human-defined target properties in the entire chemical space, i.e. inverse materials design is a highly desired tool in the exploration of materials design space.
Along with that, de novo design is not a kind of a completely new task in a field of development of new organic molecules with target properties. A lot of different generative approaches are being used along with screening the libraries of existing molecules, searching for drugs for a particular target, or generating new ones based on a very simple initial structure.
Here we would like to present a new approach for generating new materials with desired properties. We used autoencoder neural network architecture to encode materials composition and crystal structure as a vector in a latent space. In such case, any Quantitative Structure-Property Relationship (QSPR) model based on the vector can be interpreted as function in the latent space and can be used to predict property of existing materials as well as for prophetic ones. Such an approach has comparable accuracy with such classic computational methods as DFT in the case of predicting values of energies or charges, but significantly transcends them in terms of computational time.
The proposed method was tested for generating super-firm materials, but can easily be extended to any target properties, granted a database of materials properties can be provided for training.
Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko
Metal-organic frameworks (MOFs) attract a lot of interest due to their unique structure-dependent properties. Their internal pores comparable to the size of small molecules are naturally refined for various absorbance effects. Possessed properties lie in a foundation of multiple applications, such as catalysis, gas storage/separation and especially – clean energy related ones.
Theoretical calculations are a usual way of decreasing experimental costs while investigating properties of new materials, especially at a design stage. Electronic structure calculations like density functional theory (DFT) in most cases provide an appropriate accuracy in matching experimentally measured data such as adsorbate interaction energies. However, as in the case of experimental studies, large-scale materials screening studies with DFT calculations are rather time-consuming, and it can be carried out only for structures with relatively small unit cell.
Here we would like to present a theoretical and experimental results describing calculation of electron density in metal-organic frameworks. We built a model trained to predict partial charges on MOF atoms based on DFT calculations. The relative error of the model allows us to conclude that models do not decrease the level of accuracy and do not superinduce additional error comparing to DFT. At the same time, computational cost of the model is several orders of magnitude less. Models also demonstrated transferability and allowed to make prediction e.g. for MOFs containing metals not presented in the train set.
We have also built a force-field (FF) of two-centered and three-centered interatomic potentials constructed using predicted charges. The FF proved to reproduce MOF crystal structure. As a final test, we have applied the developed model and FF to a new synthesized lanthanide-containing MOFs to estimate influence of supramolecular effects on metal complexation selectivity.
As a result, we’ve built a model predicting one of basic MOF properties within relatively small computational time and tested it on experimental data, both obtained from literature sources and self-investigated.
Public repositories containing diverse chemical and biological data are one of the main sources of knowledge for further biomedical research. Unfortunately, extraction and transforming these data into a well-interpretable form is a complex exercise. Ongoing efforts of a community are mainly focused on the analysis of co-occurrences of terms, text annotation based on terms similarity and related tasks [1].
Here we present an approach based on natural-language processing techniques, which is intended to shift the focus of a search for similar texts on chemical topics from word- to document-level. PubMed records were used to implement word2vec and doc2vec models. Generated text representations can be used to search for similar abstracts; the similarity is more dependent on this representation than the co-presence of certain terms (neighbor compounds, similar publication date, etc.).
Document-level clustering was also implemented to provide insight into the PubMed text corpus structure. This approach can serve as an alternative to standard topic modeling techniques for the discovery of hidden semantic features in an unsupervised manner.
Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko
In the last decade there is an increasing interest in using in silico tools for potential risk assessment of newly released chemicals due to the large number of chemicals enter the market yearly and the big uncertainty on their possible hazardous effects. Different tools and methods based on machine learning techniques already exist and were used in a wide range of applications starting from quantitative structure-property relationships and expanding into predictive toxicology. There is a lot of historical data accumulated across multiple databases which is publicly available and can be used with novel machine learning methods. Unfortunately, due to different datasets, metrics and validation strategies, the significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. This work is an attempt to develop a multitask system which can serve as searchable curated collections of multiple chemical datasets and ready to use novel machine learning methods solely built using open source frameworks and libraries. We have implemented a set of self-tuned, using grid search and k-fold validation, traditional machine learning methods (shallow methods) such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines base on open source Scikitlearn (http://scikit-learn.org/stable/). The novel Deep Neural Networks models of different complexity have been also implemented using Keras (https://keras.io/), a deep learning open library, and a Tensorflow (www.tensorflow.org) as a backend. The machine learning models were trained and evaluated to predict measures of toxicity from the physical characteristics of the structure of chemicals using the same datasets as in the Toxicity Estimation Software Tool (https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test). The Deep Learning models showed very good performance evaluation characteristics and were found to be useful in predicting of toxicological and physicochemical parameter endpoints. The results of this work support an optimistic view that some of current obstacles in cheminformatics can be overcome by using Deep Learning methods.
Chemical workflows supporting automated research data collectionValery Tkachenko
Acquisition of data from public sources is inefficient, time consuming and limited in scope. The NIH has recently posted its intention to financially support data deposition by investigators through the ‘data sharing plan' for each funded proposal. However, this plan also points to a current weakness of the centralized data sharing and acquisition as all laboratories use different data collection and formatting approaches. These inconsistencies in data formatting by individual labs leads to the need to invest significant resources in data curation and interpretation by the technical staff involved in the maintenance of the centralized data collection resource such as CaNanoLab or Nanomaterial Registry. It would be far more efficient and useful if there were a standardized data collection and deposition template with standard key terms (such as Minimal Information about Nanomaterials, MIAN) that could be modified to add new or important additional data or parameters for each investigator. These new features cold be ultimately adopted in the classification scheme and guide the scope of the expanding database. This approach would be a win-win as it would enable structure for the investigators laboratory, consistency in data reporting and a means of transmitting data to the database in parallel to publication to eliminate the acquisition step from the process. In this talk we will outline our experience building Open Science Data Repository, a federated database system for direct acquisition, curation and management of research data, including nanomaterial data capture, transformation, and streamlined submission to nanomaterial knowledgebases. The key part of the system is microservices based architecture which exposes RESTful API suitable for direct integration into Workflow Management Systems as well as built-in modules facilitating and enforcing various lab-specific standard operating procedures.
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
Chemical and pharmaceutical companies, and government agencies regulating both chemical and biological compounds, all strive to develop new methods to provide efficient prioritization, evaluation and safety assessments for the hundreds of new chemicals that enter the market annually. While there is a lot of historical data available within the various agencies, organizations and companies, significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. Traditional QSAR methods are based on sets of features (fingerprints) which representing the functional characteristics of chemicals. Unfortunately, due to both data gaps and limitations in the development of QSAR models, read-across approaches have become a popular area of research. Successes in the application of Artificial Neural Networks, and specifically in Deep Learning Neural Networks, has delivered a new optimism that the lack of data and limited feature sets can be overcome by using Deep Learning methods. In this poster we will present a comparison of various machine learning methods applied to several toxicological and physicochemical parameter endpoints. This abstract does not reflect U.S. EPA policy.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
There are a large number of US government databases housing diverse collections of chemical data including bioassay data (PubChem), toxicity data (CompTox Chemistry Dashboard) and environmental data (a large collection of EPA databases), to name just a few. In many cases integration between the databases, at the chemical structure level, is via alphanumeric text identifiers such as CAS Numbers, or via InChI (International Chemical Identifiers). Structure-based integration is hyper-dependent on the initial inputs providing the chemical structures to the InChI generation algorithm. To ensure optimal integration between various databases, community standards and agreement regarding standardization of chemical structures would be beneficial, not only to integration of US government databases and resources but also to the international scientific community and hosts of online databases. This presentation will discuss our progress to deliver a fully Open Source chemical standardization platform as an exemplar for the community to build on and enhance. The system utilizes the CDK (Chemistry Development Kit), RD Kit and other open source components. The resource expands on our previous work regarding the Chemical Validation and Standardization Platform and has been tested using the open data collection provided by the EPA Comptox Chemistry Dashboard.
Development and comparison of deep learning toolkit with other machine learni...Valery Tkachenko
The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as deep learning. There has been increasing use of deep learning which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts, it is currently not in any of the major cheminformatics tools. It is therefore our goal to develop a deep learning algorithm and toolkit which can be used as a standalone or integrated in new software being developed by us such as the Open Science Data Repository (OSDR). We will show how classic machine learning (CML) methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) compares to cutting edge deep learning and talk about challenges associated with deep neural networks (DNN) learning models. The open source Scikit-learn (http://scikit-learn.org/stable/) ML python library was used for building, tuning, and validating all CML models. The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/), a deep learning library, and Tensorflow (www.tensorflow.org) as a backend. All the developed pipelines consist of stratified splitting of the input dataset into train (80%) and test (20%) datasets. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) were computed for each model for ADME/Tox and other physicochemical properties. DNN learning models were found to be very good in predicting activities and can outperform most of the CML models.
Living in a world of federated knowledge challenges, principles, tools and ...Valery Tkachenko
Over years a multitude of chemical formats and approaches were created to address various aspects of handling chemical information and building databases of chemical knowledge. As a result the current state of this landscape is severely affected by the lack of well-accepted and community-recognized formats, protocols, metadata standards, validation routines and standards in handling, storing and representation, lack of open toolkits which conform to the same standards as well as the lack of platforms which allow interactive and collaborative work to solve all the above problems. While such organizations as RDA and IUPAC as well as some government agencies and institutes are concerned and trying to address the problem it is still a severe pain point. In this presentation we will talk about our experience of building a federated knowledgebase called Open Science Data Repository which supports deposition of raw and structured chemical and analytical data in various formats, runs validation and standardization protocols, is build in a highly modular way that allows using both its API and its components in a Cloud or to be deployed on premises behind firewalls and supports a variety of use cases including collaborative data curation, rich analytics and visualization, real-time machine learning, formats conversion and preparing depositions into PubChem and ChemSpider from a variety of sources and fully supports FAIR principles for research data.
Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko
The Open PHACTS project (openphacts.org) is a European initiative, constituting a public–private partnership to enable easier, cheaper and faster drug discovery. The project is supported by the OpenPHACTS Foundation (www.openphactsfoundation.org) and funded by contributions from several pharmaceutical companies. As part of Open PHACTS, a 'Chemical Registration Service” was created to register chemicals of interest to the project, allowing compound linkage between data sets. A key concept is the support for 'scientific lenses,' which allows hierarchical mapping of chemical entities, including supporting characteristics such as charge state, tautomerism and stereochemistry. Open PHACTS aggregated various databases, including ChEMBL, ChEBI, HMDB, DrugBank, PDB, MeSH, and WikiPathways. A new project builds on the Chemical Registration Service to establish an open chemistry registry and mapping service for general data set linkage. This expansion requires the support of multiple cheminformatics formats, the conversion and mapping of various identifiers, harmonized but configurable standardization, validation of the chemical structures, and the creation of new identifiers, to produce scientific lenses, or 'link sets'. Furthermore, these identifiers will be related to the compounds chemical names (IUPAC and trivial) and related chemical structures. This presentation will describe our ongoing work to create a fully open source, easy to install platform, which supports the ideas introduced by the Open PHACTS project and expands it with community data including, for example, the data now available from the EPA CompTox Chemistry Dashboard (comptox.epa.gov). This new platform supports chemical formats and provides for identifier conversion and cross-validation between datasets. The project is completely based on open source cheminformatics toolkits and available as a set of libraries, docker images and a web frontend based on FAIR and Open Data principles. The openness of this platform will allow for scientists to process their own datasets, and make them interoperable with other online chemical databases.
Using the structured product labeling format to index versatile chemical dataValery Tkachenko
Structured Product Labeling (SPL) is a document markup standard approved by the Health Level Seven (HL7) standards organization and adopted by the FDA as a mechanism for exchanging product and facility information. Product information provided by companies in SPL format may be accessed from the FDA Online Label Repository (labels.fda.gov) and the National Library of Medicine DailyMed web site (dailymed.nlm.nih.gov). FDA also maintains and publishes SPL Indexing Files for Pharmacologic Class, Substance, Product Concept, Biological Drug Substance, and Billing Units. Data from the Indexing Files can be linked to data in both SPL resources and external resources via chemical and non-chemical identifiers. In this talk we will present on the latest addition to SPL which allows indexing data on proteins, polymers and structurally diverse substances. We will also discuss the potential value of SPL to the integration between public chemistry databases, especially those hosted by the United States Government.
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
Sustainable research progress in many scientific disciplines critically depends on the existence of robust specialized databases that integrate and structure all available experimental information in the respective fields. The need for such reference database is especially critical for nanoscience and nanomaterial research given the significant diversity of shapes, sizes, and properties of engineered nanomaterials and the difficulty of synthesizing engineered nanoparticles with controlled properties. The acquisition of data from public sources is inefficient, time consuming and limited in scope. Moreover, it is not clear where the resources come from to support this activity on a perpetual basis. The NIH has recently posted its intention to provide special funds toward data deposition by the experimental investigators through the ‘data sharing plan’ for each proposal. However, this points to a current weakness which is that all laboratories use different data collection approaches each of which requires interpretation by staff hosting the database. It would be far more efficient and useful if a template with key terms that could be modified to add new or important additional data or parameters for each investigator. We will discuss tools and approaches to facilitate collection and direct deposition of experimental data into Nanomaterial Registry (https://www.nanomaterialregistry.org/) - a versatile semantically enriched templates-based platform for registering diverse data pertaining to nanomaterials research.
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
In recent years there has been explosive growth in the number of public chemical databases available online, a number of these containing 10s of millions of chemical structures. Examples include PubChem, ChemSpider and ChEMBL and users of these databases have become increasingly aware of the issue of data quality associated with these public resources. Seamless integration and mapping between databases, even for some common chemicals, is challenged by differing approaches to chemical standardization prior to registration into a database. The lack of standards in representing and handling chemical information certainly contributes to aspects of this problem. The Chemistry Validation and Standardization Platform (CVSP), originally developed to support the European Innovative Medicines Initiative project known as OpenPHACTS, was developed with the intention of providing an open platform for processing and standardizing chemical compounds. The system has been used to process millions of chemical compounds for dissemination through public websites and, unlike other validation and standardization systems, the system provides support for both standard and custom rulesets. We will provide an overview of CVSP 2.0, the next generation of the platform extending support to new cheminformatics toolkits and additional capabilities such as collaborative rules authoring.
Open Science Data Repository - the platform for materials researchValery Tkachenko
Over the last few years we have seen a tremendous growth in various data repositories pushed and supported by funding bodies and various data preservation initiatives. As a result we have now a variety of scientific resources, combined into a broad network and indexed through the directories like BioSharing and re3data. Such network, while growing quickly, is still in early days of adopting semantic web standards and does not yet support deep data indexing and discoverability, leave alone that mechanisms of intellectual properties protection are as simple as making data public or private at best. The lack of standards and well defined models to describe a scientific information structure even further inhibits free information flow which is essential for scientific discovery. One of the most affected areas is not surprisingly materials sciences where due to the inherent complexity of the field of study the situation is even more severe. In this talk we present a chemistry information platform designed to support a variety of data formats along with metadata, sophisticated ways of collaboration and secure data exchanges. We will discuss challenges that we have faced developing such platform as well as solutions that we have came with.
Opportunities in chemical structure standardizationValery Tkachenko
This talk was given at EBI's Wellcome Trust Genome Campus and is dedicated to outlining problems with chemical information standardization and various efforts to tackle this problem.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
UiPath Test Automation using UiPath Test Suite series, part 5
Big data supporting drug discovery - cautionary tales from the world of chemistry for translational informatics
1. Big Data Supporting Drug Discovery
Cautionary Tales from the World of Chemistry
for Translational Informatics
Valery Tkachenko
RSC-CSIR/OSDD meeting
Pune, India
February 3rd 2014
2. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
5. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
9. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
14. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
17. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
18. •
•
•
•
~30 million chemicals and growing
Data sourced from >500 different sources
Crowdsourced curation and annotation
Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
31. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
39. Research data inflow
All databases are
sliced by data
sources/data
collections and
have simple
security model
where each data
slice/source is
private, public or
embargoed
Web UI for unified depositions
Compounds
Deposition Gateway
Reactions
API, FTP, etc
DropBox, Google Drive,
SkyDrive, etc
LabTrove and other templated
data
Compounds
Module
Raw data
Reactions
Module
Spectra
Module
Materials
Module
Textmining
Module
͙
Module
Staging
databases
Staging
databases
Validated data
Spectra
Materials
Documents
Articles / CSSP
40. Research data outflow
User
interface tier
(examples)
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Electronic Laboratory Notebook
Analytical Laboratory application
User
interface
components
tier
Data access
tier
Chemical Inventory application
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Materials
Widgets
Documents
Widgets
Compounds
API
Reactions
API
Spectra
API
Materials
API
Documents
API
Compounds
Reactions
Spectra
Materials
Documents
Data tier
41. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
45. It is so difficult to navigate…
IP?
IP?
What’s the
What’s the
structure?
structure?
Are they in
Are they in
our file?
our file?
What’s
What’s
similar?
similar?
Pharmacology
Pharmacology
data?
data?
What’s the
What’s the
target?
target?
Known
Known
Pathways?
Pathways?
Competitors?
Competitors?
Connections
Connections
to disease?
to disease?
Working On
Working On
Now?
Now?
Expressed in
Expressed in
right cell type?
right cell type?
46. Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
– Automated quality control system
48. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
49. Research data management
Scientists
Funding bodies
External clients
Publishers
Indexes
Data Repository
indexed storage
Chemically
intelligent services
Data
Data Repository provided
data storage
University 1
University 2
Data Hub
Workstations
Company 3
Data Hub
Workstations
Data Hub
Workstations
50. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
53. RSC/Rewards and Recognition
The First Step badge is
awarded when a user
submits (& has published)
their 1st CSSP article.
Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of a thousand miles begins
with a single step”. In the same
way we hope that this will be the
first of many submissions that you
make to CSSP.
54. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Visualization and navigation
Building Global Chemistry Network
58. Big Data
Chemical Space
Drug Discovery pipeline
Machine learning
Training sets
RSC/ChemSpider platforms
RSC/Archive
Research data management
Data quality, crowdsourcing and AltMetrics
Building Global Chemistry Network
62. http://www.openphacts.org
Open PHACTS is an Innovative
Medicines Initiative (IMI) project,
aiming to reduce the barriers to
drug discovery in industry,
academia and for small
businesses.
Semantic web is one of the
corner stones