SlideShare a Scribd company logo
1 of 36
Living in a world of federated knowledge:
Challenges, principles, tools and solutions
Fall ACS 2017, Washington, DC
Rick Zakharov1, Valery Tkachenko1
1 Science Data Software, Rockville, MD, United States
We live in a hyperconnected World
Data repositories
Dimensions and complexity of scientific data
Standards and authorities
Traditional data – relational
Chemical data[base]
Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
Big Data Integration 9
OpenPHACTS
FAIR Data Principles
VirtualStandardFAIRDataBus
Other Registries
Other Registries
Other Registries
D
a
t
a
Data Lake
Social
Media
Electronic
Notebooks
Databases
Sensor Med
Dev
IoT
Curated
Repository
Models
Curation &
Integration
Validation
Decision
Support
Analysis &
Modeling
Open Data Science Platform
Mining
USERS
Model-driven experimental studies
Organize your data in a natural way
● Now-natural folder structure
● Organize your data into
collections
● You have an option to
download anything to your
local drive as long as the
security context allows etc
Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization
OSDR - documents
• Integrated text-mining
Other formats
Convert between formats
● Integrated
format
transformation
● 50+ various
data formats
OSDR - mapping and conversion
OSDR - import
OSDR - export
Predefined or custom metadata
Tagging
Attributes
Taxonomies
Ontologies
Metadata
Harvesting
Industry
Standards
Metadata
Collaborative data authoring and curation
● Datacite.org
support
● Other formats
● Audit trail
● Notifications
Extensive search options
● Search language
● Elasticsearch
technology
● Domain-specific
search modules
● Search ranking
Built-in Machine Learning
● Automated ML
pipeline
● Pre-built ML
modules
● Comparison
between different
ML algorithms
● NB, NN, RF, SVM, LR
● DNN
Model Training Pipeline
Datasets used for evaluating multiple computational methods
for activity chemical properties prediction
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
solubility Huuskonen J. J Chem Inf
Comput Sci 2000
Log solubility = −5 1144 active, 155 inactive,
ratio 7.38
probe-like Litterman N. et al. J Chem Inf
Model 2014
described in reference 253 active, 69 inactive,
ratio 3.67
hERG Wang S. et al. Mol Pharm
2012
described in reference 373 active, 433 inactive,
ratio 0.86
KCNQ1 PubChem BioAssay: AID 2642
98
using actives assigned in PubChem 301,737 active, 3878 inactive,
ratio 77.81
Bubonic plague
(Yersina pestis)
PubChem single-point screen
BioAssay: AID 898
active when inhibition ≥50% 223 active, 139,710 inactive,
ratio 0.0016
Chagas disease
(Typanosoma cruzi)
Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold
difference in cytotoxicity as active
1692 active, 2363 inactive,
ratio 0.72
TB (Mycobacterium
tuberculosis)
in vitro bioactivity and
cytotoxicity data from MLSMR,
CB2, kinase, and ARRA
datasets
Mtb activity and acceptable Vero
cell cytotoxicity selectivity index =
(MIC or IC90)/CC50 ≥10
1434 active, 5789 inactive,
ratio 0.25
malaria (Plasmodium
falciparum)
CDD Public datasets (MMV, St.
Jude, Novartis, and TCAMS)
3D7 EC50 <10 nM 175 active, 19,604 inactive,
ratio 0.0089
Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active =
non inhibitors).
Solubility dataset: selected ROC
Solubility dataset: polar plots of the model evaluation metrics
BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support
Vector Machines, DNN-N (N is number of hidden layers).
AUC for all tested datasets (FCFP6, 1024)
Clark et al. J Chem Inf Model 2015
AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al.
solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866
solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933
probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757
probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563
hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849
hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840
KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842
KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848
Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810
Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753
Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800
Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789
Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727
Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685
Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977
Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
Prediction pipeline
Extensible micro-service based architecture
Micro-service
● Single responsibility
● Simple API
● One-pizza size team
● Independent development
● Independent deployment
and scaling
● Different services can be
implemented using
different technologies
Technologies
● Mix of technologies connected
through microservices
architecture
● Open source toolkits and
libraries with permissive
licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development
Summary
• OSDR is a chemistry data platform
• Supports FAIR data principles
• Can handle specific use cases via modules
• Integrated Machine Learning
• Remove proprietary software barriers
• Uses open source toolkits
• Evolve and improve continuously
Thank you!
On Web:
scidatasoft.com
Slides:
https://www.slideshare.net/valerytkachenko16
Contact us:
info@scidatasoft.com

More Related Content

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions

Assay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsAssay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsWalt Whitman
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Andrew Su
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoverySean Ekins
 
C&E news talk sept 16
C&E news talk sept 16C&E news talk sept 16
C&E news talk sept 16Sean Ekins
 
Diagnostic process
Diagnostic processDiagnostic process
Diagnostic processILRI
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataYannick Pouliot
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Nathan Olson
 
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...HopkinsCFAR
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFrançois Fauteux
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judesSean Ekins
 
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Aditya Singh
 
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_FinalLawrence Hwang
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Sean Ekins
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsSean Ekins
 
Survival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsSurvival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsDwaipayan Mukhopadhyay
 
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsSurvival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsDwaipayan Mukhopadhyay
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2Sean Ekins
 
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Covance
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveDinesh Barupal
 

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions (20)

Assay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsAssay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient results
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
 
C&E news talk sept 16
C&E news talk sept 16C&E news talk sept 16
C&E news talk sept 16
 
Diagnostic process
Diagnostic processDiagnostic process
Diagnostic process
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San Diego
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judes
 
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
 
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
Survival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsSurvival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patients
 
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsSurvival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
 
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspective
 

More from Valery Tkachenko

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureValery Tkachenko
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materialsValery Tkachenko
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsValery Tkachenko
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionValery Tkachenko
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataValery Tkachenko
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchValery Tkachenko
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardizationValery Tkachenko
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsValery Tkachenko
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical informationValery Tkachenko
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesValery Tkachenko
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction databaseValery Tkachenko
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 

More from Valery Tkachenko (20)

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materials
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representations
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpoints
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical data
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical information
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spaces
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction database
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 

Recently uploaded

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 

Recently uploaded (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 

Living in a world of federated knowledge challenges, principles, tools and solutions

  • 1. Living in a world of federated knowledge: Challenges, principles, tools and solutions Fall ACS 2017, Washington, DC Rick Zakharov1, Valery Tkachenko1 1 Science Data Software, Rockville, MD, United States
  • 2. We live in a hyperconnected World
  • 4. Dimensions and complexity of scientific data
  • 6. Traditional data – relational
  • 8. Why is it so hard to…. Competitors? What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP?
  • 9. Big Data Integration 9 OpenPHACTS
  • 12.
  • 13. D a t a Data Lake Social Media Electronic Notebooks Databases Sensor Med Dev IoT Curated Repository Models Curation & Integration Validation Decision Support Analysis & Modeling Open Data Science Platform Mining USERS Model-driven experimental studies
  • 14. Organize your data in a natural way ● Now-natural folder structure ● Organize your data into collections ● You have an option to download anything to your local drive as long as the security context allows etc
  • 15. Chemical processing ● Support for chemical formats ● Chemistry validation and standardization ● Automatic processing and visualization
  • 16. OSDR - documents • Integrated text-mining
  • 18. Convert between formats ● Integrated format transformation ● 50+ various data formats
  • 19. OSDR - mapping and conversion
  • 22. Predefined or custom metadata Tagging Attributes Taxonomies Ontologies Metadata Harvesting Industry Standards Metadata
  • 23. Collaborative data authoring and curation ● Datacite.org support ● Other formats ● Audit trail ● Notifications
  • 24. Extensive search options ● Search language ● Elasticsearch technology ● Domain-specific search modules ● Search ranking
  • 25. Built-in Machine Learning ● Automated ML pipeline ● Pre-built ML modules ● Comparison between different ML algorithms ● NB, NN, RF, SVM, LR ● DNN
  • 27. Datasets used for evaluating multiple computational methods for activity chemical properties prediction Model Datasets used and references Cutoff for active Number of molecules and ratio solubility Huuskonen J. J Chem Inf Comput Sci 2000 Log solubility = −5 1144 active, 155 inactive, ratio 7.38 probe-like Litterman N. et al. J Chem Inf Model 2014 described in reference 253 active, 69 inactive, ratio 3.67 hERG Wang S. et al. Mol Pharm 2012 described in reference 373 active, 433 inactive, ratio 0.86 KCNQ1 PubChem BioAssay: AID 2642 98 using actives assigned in PubChem 301,737 active, 3878 inactive, ratio 77.81 Bubonic plague (Yersina pestis) PubChem single-point screen BioAssay: AID 898 active when inhibition ≥50% 223 active, 139,710 inactive, ratio 0.0016 Chagas disease (Typanosoma cruzi) Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold difference in cytotoxicity as active 1692 active, 2363 inactive, ratio 0.72 TB (Mycobacterium tuberculosis) in vitro bioactivity and cytotoxicity data from MLSMR, CB2, kinase, and ARRA datasets Mtb activity and acceptable Vero cell cytotoxicity selectivity index = (MIC or IC90)/CC50 ≥10 1434 active, 5789 inactive, ratio 0.25 malaria (Plasmodium falciparum) CDD Public datasets (MMV, St. Jude, Novartis, and TCAMS) 3D7 EC50 <10 nM 175 active, 19,604 inactive, ratio 0.0089 Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active = non inhibitors).
  • 29. Solubility dataset: polar plots of the model evaluation metrics BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector Machines, DNN-N (N is number of hidden layers).
  • 30. AUC for all tested datasets (FCFP6, 1024) Clark et al. J Chem Inf Model 2015 AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al. solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866 solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933 probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757 probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563 hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849 hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840 KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842 KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848 Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810 Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753 Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800 Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789 Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727 Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685 Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977 Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
  • 33. Micro-service ● Single responsibility ● Simple API ● One-pizza size team ● Independent development ● Independent deployment and scaling ● Different services can be implemented using different technologies
  • 34. Technologies ● Mix of technologies connected through microservices architecture ● Open source toolkits and libraries with permissive licenses ● NoSQL Databases ● Containerization ● Leading practices in CI/CD ● Automated testing, rapid development
  • 35. Summary • OSDR is a chemistry data platform • Supports FAIR data principles • Can handle specific use cases via modules • Integrated Machine Learning • Remove proprietary software barriers • Uses open source toolkits • Evolve and improve continuously

Editor's Notes

  1. What about science and chemistry in particular?
  2. Remember this, some of these questions are easier to answer than others
  3. Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data
  4. The representative polar plots of the model evaluation metrics for the Solubility dataset.
  5. In general the DNN models performed well for predictions except for the AUC performance of the probe-like dataset. For AUC DNN-3 outperforms BNB on 6 of 8 datasets