This document discusses data integration challenges in a big data context using the Open PHACTS case study. Open PHACTS aims to integrate multiple biomedical data resources into a single open access point. It has developed a cloud-based production level system that provides semantic web-based APIs to access integrated data on diseases, tissues, targets, compounds and pathways. The system addresses issues like identity resolution, data quality, provenance and licensing to enable complex queries across diverse data sources.
Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
The Synthetically Accessible Virtual Inventory (SAVI) project is an international collaboration between partners in government laboratories, small companies, not-for-profits, and large corporations, to computationally generate a very large number of reliably and inexpensively synthesizable novel screening sample structures. SAVI handles reactions not by virtue of applying simple SMIRKS to a set of building blocks of unknown availability. It instead combines a set of transforms richly annotated with chemical context, coming from, or being newly developed in the mold of, the original LHASA project knowledgebase, with a set of highly annotated, reliably available, purchasable starting materials. These components are tied together for SAVI product generation with the chemoinformatics toolkit CACTVS with custom developments for this project. Each product is annotated with a number of computed properties seen as important in current drug design, including rules for identifying potentially reactive or promiscuous compounds. After having produced and made publicly available the first (beta) set of 283 million SAVI products annotated with proposed one-step syntheses, we will be reporting on the second full production run aimed at creating a database of one billion high-quality, easily synthesizable screening samples. We will present the current status, ongoing developments, as well as scientific and technical challenges of the project.
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
ACS National Meeting Boston Fall 2015
A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.
We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.
An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.
References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.
This presentation highlights known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
Presentation given at the Open PHACTS project symposium.
The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.
Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.
The Synthetically Accessible Virtual Inventory (SAVI) project is an international collaboration between partners in government laboratories, small companies, not-for-profits, and large corporations, to computationally generate a very large number of reliably and inexpensively synthesizable novel screening sample structures. SAVI handles reactions not by virtue of applying simple SMIRKS to a set of building blocks of unknown availability. It instead combines a set of transforms richly annotated with chemical context, coming from, or being newly developed in the mold of, the original LHASA project knowledgebase, with a set of highly annotated, reliably available, purchasable starting materials. These components are tied together for SAVI product generation with the chemoinformatics toolkit CACTVS with custom developments for this project. Each product is annotated with a number of computed properties seen as important in current drug design, including rules for identifying potentially reactive or promiscuous compounds. After having produced and made publicly available the first (beta) set of 283 million SAVI products annotated with proposed one-step syntheses, we will be reporting on the second full production run aimed at creating a database of one billion high-quality, easily synthesizable screening samples. We will present the current status, ongoing developments, as well as scientific and technical challenges of the project.
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
ACS National Meeting Boston Fall 2015
A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.
We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.
An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.
References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.
This presentation highlights known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BigData_Europe
Overview of Open PHACTS, the BDE Pilot project in SC1, presented at BDE SC1 Workshop 3, 13 December, 2017.
https://www.big-data-europe.eu/the-final-big-data-europe-workshop/
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (openphacts.org), established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
The EPA CompTox Chemistry Dashboard provides access to data associated with ~760,000 chemical substances. The available data includes experimental and predicted physicochemical properties, environmental fate and transport data, in vivo and in silico toxicity data, in vitro bioassay data, exposure data and a variety of other types of information. The data are under continuous expansion and curation and the experimental data have been used to develop QSAR and QSPR models. A number of these models are available via a web interface so that users can submit a chemical structure and predict properties in real time. The dashboard also provides access to pre-compiled chemical lists and categories, including pesticides, and chemicals detected in the environment via non-targeted mass spectrometry analysis. The data are searchable using chemical identifiers (systematic names, trade names, CAS Registry Numbers), by structure, mass and formula. Batch searches allow for data associated with thousands of chemicals to be obtained in a few seconds, with just a few button clicks, and downloaded to the desktop. This presentation will provide an overview of the Dashboard and its applications to accessing source data associated with agriculturally related chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
06.01.14
Presentation for the Microbe Project Interagency Team
Title: Building an Information Infrastructure to Support Microbial Metagenomic Sciences
La Jolla, CA
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach.
The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup.
In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.
More Related Content
Similar to Data Integration in a Big Data Context: An Open PHACTS Case Study
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BigData_Europe
Overview of Open PHACTS, the BDE Pilot project in SC1, presented at BDE SC1 Workshop 3, 13 December, 2017.
https://www.big-data-europe.eu/the-final-big-data-europe-workshop/
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (openphacts.org), established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.
Researchers at EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The goal of this research program is to quickly evaluate thousands of chemicals, but at a much reduced cost and shorter time frame relative to traditional approaches. The data generated by the Center includes characterization of thousands of chemicals across hundreds of high-throughput screening assays, consumer use and production information, pharmacokinetic properties, literature data, physical-chemical properties as well as the predictive computational modeling of toxicity and exposure. We have developed a number of databases and applications to deliver the data to the public, academic community, industry stakeholders, and regulators. This presentation will provide an overview of our work to develop an architecture that integrates diverse large-scale data from the chemical and biological domains, our approaches to disseminate these data, and the delivery of models supporting predictive computational toxicology. In particular, this presentation will review our new publicly-accessible CompTox Dashboard as the first application built on our newly developed architecture. This abstract does not reflect U.S. EPA policy.
With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
The EPA CompTox Chemistry Dashboard provides access to data associated with ~760,000 chemical substances. The available data includes experimental and predicted physicochemical properties, environmental fate and transport data, in vivo and in silico toxicity data, in vitro bioassay data, exposure data and a variety of other types of information. The data are under continuous expansion and curation and the experimental data have been used to develop QSAR and QSPR models. A number of these models are available via a web interface so that users can submit a chemical structure and predict properties in real time. The dashboard also provides access to pre-compiled chemical lists and categories, including pesticides, and chemicals detected in the environment via non-targeted mass spectrometry analysis. The data are searchable using chemical identifiers (systematic names, trade names, CAS Registry Numbers), by structure, mass and formula. Batch searches allow for data associated with thousands of chemicals to be obtained in a few seconds, with just a few button clicks, and downloaded to the desktop. This presentation will provide an overview of the Dashboard and its applications to accessing source data associated with agriculturally related chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
06.01.14
Presentation for the Microbe Project Interagency Team
Title: Building an Information Infrastructure to Support Microbial Metagenomic Sciences
La Jolla, CA
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
The Bioschemas community (http://bioschemas.org) is a loose collaboration formed by a wide range of life science resource providers and informaticians. The community is developing profiles over Schema.org to enable life science resources such as data about a specific protein, sample, or training event, to be more discoverable on the web. While the content of well-known resources such as Uniprot (for protein data) are easily discoverable, there is a long tail of specialist resources that would benefit from embedding Schema.org markup in a standardised approach.
The community have developed twelve profiles for specific types of life science resources (http://bioschemas.org/specifications/), with another six at an early draft stage. For each profile, a set of use cases have been identified. These typically focus on search, but several facilitate lightweight data exchange to support data aggregators such as Identifiers.org, FAIRsharing.org, and BioSamples. The next stage of the development of a profile consists of mapping the terms used in the use cases to existing properties in Schema.org and domain ontologies. The properties are then prioritised in order to support the use cases, with a minimal set of about six properties identified, along with a larger set of recommended and optional properties. For each property, an expected cardinality is defined and where appropriate, object values are specified from controlled vocabularies. Before a profile is finalised, it must first be demonstrated that resources can deploy the markup.
In this talk, we will outline the progress that has been made by the Bioschemas Community in a single year through three hackathon events. We will discuss the processes followed by the Bioschemas Community to foster collaboration, and highlight the benefits and drawbacks of using open Google documents and spreadsheets to support the community develop the profiles. We will conclude by summarising future opportunities and directions for the community.
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
This presentation was prepared for my faculty Christmas conference.
Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.
Many areas of scientific discovery rely on combining data from multiples data sources. However there are many challenges in linking data. This presentation highlights these challenges in the context of using Linked Data for environmental and social science databases.
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it.
The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data.
Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.
SensorBench is a benchmark suite for wireless sensor networks. The design of wireless sensor network systems sits within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. SensorBench enables the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The benchmark identifies key variables and performance metrics, and specifies experiments that explore how different types of task perform under different metrics for the controlled variables. The benchmark is demonstrated by its application on representative platforms.
Full details of the benchmark are available from http://dl.acm.org/citation.cfm?id=2618252 (DOI: 10.1145/2618243.2618252)
What are the research and technical challenges of linked data that are relevant to data science?
This presentation introduces the ideas of linked data using the BBC sport web site as an example. It then identifies several research challenges that remain to be addressed.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Computing Identity Co-Reference Across Drug Discovery DatasetsAlasdair Gray
This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Alasdair Gray
The Open PHACTS Discovery Platform aims to provide an integrated information space to advance pharmacological research in the area of drug discovery. Effective drug discovery requires comprehensive data coverage, i.e. integrating all available sources of pharmacology data. While many relevant data sources are available on the linked open data cloud, their content needs to be combined with that of commercial datasets and the licensing of these commercial datasets respected when providing access to the data. Additionally, pharmaceutical companies have built up their own extensive private data collections that they require to be included in their pharmacological dataspace. In this paper we discuss the challenges of incorporating private and commercial data into a linked dataspace: focusing on the modelling of these datasets and their interlinking. We also present the graph-based access control mechanism that ensures commercial and private datasets are only available to authorized users.
http://link.springer.com/chapter/10.1007/978-3-642-41338-4_5
Including Co-Referent URIs in a SPARQL QueryAlasdair Gray
Linked data relies on instance level links between potentially differing representations of concepts in multiple datasets. However, in large complex domains, such as pharmacology, the inter-relationship of data instances needs to consider the context (e.g. task, role) of the user and the assumptions they want to apply to the data. Such context is not taken into account in most linked data integration procedures. In this paper we argue that dataset links should be stored in a stand-off fashion, thus enabling different assumptions to be applied to the data links during query execution. We present the infrastructure developed for the Open PHACTS Discovery Platform to enable this and show through evaluation that the incurred performance cost is below the threshold of user perception.
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Data Integration in a Big Data Context: An Open PHACTS Case Study
1. Data Integration in a
Big Data Context
Open PHACTS Case Study
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
2. Big Data
@gray_alasdair Big Data Integration 2
Volume Velocity
Variety Veracity
http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg
3. Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
@gray_alasdair Big Data Integration 3
4. Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Free
Access Point
@gray_alasdair Big Data Integration 4
8. OPS Discovery Platform
@gray_alasdair Big Data Integration 8
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies
9. App Ecosystem
@gray_alasdair
An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
Big Data Integration 9https://www.openphacts.org/2/sci/apps.html
13. API Hits
@gray_alasdair Big Data Integration 13
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API
14. OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
@gray_alasdair Big Data Integration 14
16. John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
20. P12047
X31045
GB:29384
Identity Mapping
@gray_alasdair Big Data Integration 20
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
22. Gleevec®: Imatinib Mesylate
@gray_alasdair Big Data Integration 22
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
23. Big Data Integration 23
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
@gray_alasdair
I need to perform an analysis, give me
details of the active compound in
Gleevec.
24. Big Data Integration 24
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
@gray_alasdair
Which targets are known to interact
with Gleevec?
29. Open PHACTS Approach
1. Know your audience
Web developers
2. Understand your use cases
Prioritised business questions
3. Identify access pathways
Identify data
Identify connections
Implement API
@gray_alasdair Big Data Integration 31
30. Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts
@gray_alasdair Big Data Integration 32
Editor's Notes
Deriving value from the data
Volume: More data than you can process – relative term; complexity of processing
Velocity: Data constantly being generated
Variety: Multiple sources, formats, models
Veracity: Accuracy of the data
Open PHACTS: Not dealt with Velocity, although it is a challenge for us
1 of 83 business driver questions
Took a team of 5 experienced researchers 6 hours to manually gather the answer
Start of the project couldn’t be answered by a computer system
6 months in 30s with prototype
now subsecond
Pharma are all accessing, processing, storing & re-processing external research data Big waste of resources
No competitive advantage
OPS: 29 partners including many major pharma
83 questions ranked and top 20 taken as target
18 of top 20
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Not just in-house apps
Actively being used for different purposes
Public launch April 2013
Averaging 20 million hits a month from the start of 2015
38 million in the last 30 days
Heavy usage from pharma, academia, and biotech
500+ registered users
Import data into cache
Integration approach
Data kept in original model but cached centrally
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 3 billion triples – 12 datasets
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
Interactions needed to satisfy use cases
Gradually added additional types of data and interactions
No standard units
Even in curated sources!
Feedback issues to data providers
Validation & Standardization Platform
Developed by Royal Society of Chemistry
http://bit.ly/NZF5VB
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Chemistry is complicated, often simplified for convenience
Data is messy!
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Open for anybody
API grouped into theme areas
Two phase interaction:
Resolve thing to identifier
Retrieve data about the identifier