SlideShare a Scribd company logo
1 of 36
Living in a world of federated knowledge:
Challenges, principles, tools and solutions
Fall ACS 2017, Washington, DC
Rick Zakharov1, Valery Tkachenko1
1 Science Data Software, Rockville, MD, United States
We live in a hyperconnected World
Data repositories
Dimensions and complexity of scientific data
Standards and authorities
Traditional data – relational
Chemical data[base]
Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
Big Data Integration 9
OpenPHACTS
FAIR Data Principles
VirtualStandardFAIRDataBus
Other Registries
Other Registries
Other Registries
D
a
t
a
Data Lake
Social
Media
Electronic
Notebooks
Databases
Sensor Med
Dev
IoT
Curated
Repository
Models
Curation &
Integration
Validation
Decision
Support
Analysis &
Modeling
Open Data Science Platform
Mining
USERS
Model-driven experimental studies
Organize your data in a natural way
● Now-natural folder structure
● Organize your data into
collections
● You have an option to
download anything to your
local drive as long as the
security context allows etc
Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization
OSDR - documents
• Integrated text-mining
Other formats
Convert between formats
● Integrated
format
transformation
● 50+ various
data formats
OSDR - mapping and conversion
OSDR - import
OSDR - export
Predefined or custom metadata
Tagging
Attributes
Taxonomies
Ontologies
Metadata
Harvesting
Industry
Standards
Metadata
Collaborative data authoring and curation
● Datacite.org
support
● Other formats
● Audit trail
● Notifications
Extensive search options
● Search language
● Elasticsearch
technology
● Domain-specific
search modules
● Search ranking
Built-in Machine Learning
● Automated ML
pipeline
● Pre-built ML
modules
● Comparison
between different
ML algorithms
● NB, NN, RF, SVM, LR
● DNN
Model Training Pipeline
Datasets used for evaluating multiple computational methods
for activity chemical properties prediction
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
solubility Huuskonen J. J Chem Inf
Comput Sci 2000
Log solubility = −5 1144 active, 155 inactive,
ratio 7.38
probe-like Litterman N. et al. J Chem Inf
Model 2014
described in reference 253 active, 69 inactive,
ratio 3.67
hERG Wang S. et al. Mol Pharm
2012
described in reference 373 active, 433 inactive,
ratio 0.86
KCNQ1 PubChem BioAssay: AID 2642
98
using actives assigned in PubChem 301,737 active, 3878 inactive,
ratio 77.81
Bubonic plague
(Yersina pestis)
PubChem single-point screen
BioAssay: AID 898
active when inhibition ≥50% 223 active, 139,710 inactive,
ratio 0.0016
Chagas disease
(Typanosoma cruzi)
Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold
difference in cytotoxicity as active
1692 active, 2363 inactive,
ratio 0.72
TB (Mycobacterium
tuberculosis)
in vitro bioactivity and
cytotoxicity data from MLSMR,
CB2, kinase, and ARRA
datasets
Mtb activity and acceptable Vero
cell cytotoxicity selectivity index =
(MIC or IC90)/CC50 ≥10
1434 active, 5789 inactive,
ratio 0.25
malaria (Plasmodium
falciparum)
CDD Public datasets (MMV, St.
Jude, Novartis, and TCAMS)
3D7 EC50 <10 nM 175 active, 19,604 inactive,
ratio 0.0089
Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active =
non inhibitors).
Solubility dataset: selected ROC
Solubility dataset: polar plots of the model evaluation metrics
BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support
Vector Machines, DNN-N (N is number of hidden layers).
AUC for all tested datasets (FCFP6, 1024)
Clark et al. J Chem Inf Model 2015
AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al.
solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866
solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933
probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757
probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563
hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849
hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840
KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842
KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848
Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810
Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753
Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800
Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789
Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727
Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685
Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977
Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
Prediction pipeline
Extensible micro-service based architecture
Micro-service
● Single responsibility
● Simple API
● One-pizza size team
● Independent development
● Independent deployment
and scaling
● Different services can be
implemented using
different technologies
Technologies
● Mix of technologies connected
through microservices
architecture
● Open source toolkits and
libraries with permissive
licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development
Summary
• OSDR is a chemistry data platform
• Supports FAIR data principles
• Can handle specific use cases via modules
• Integrated Machine Learning
• Remove proprietary software barriers
• Uses open source toolkits
• Evolve and improve continuously
Thank you!
On Web:
scidatasoft.com
Slides:
https://www.slideshare.net/valerytkachenko16
Contact us:
info@scidatasoft.com

More Related Content

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions

Assay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsAssay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsWalt Whitman
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Andrew Su
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoverySean Ekins
 
C&E news talk sept 16
C&E news talk sept 16C&E news talk sept 16
C&E news talk sept 16Sean Ekins
 
Diagnostic process
Diagnostic processDiagnostic process
Diagnostic processILRI
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataYannick Pouliot
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Nathan Olson
 
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...HopkinsCFAR
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFrançois Fauteux
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judesSean Ekins
 
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Aditya Singh
 
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_FinalLawrence Hwang
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Sean Ekins
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsSean Ekins
 
Survival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsSurvival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsDwaipayan Mukhopadhyay
 
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsSurvival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsDwaipayan Mukhopadhyay
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2Sean Ekins
 
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Covance
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveDinesh Barupal
 

Similar to Living in a world of federated knowledge challenges, principles, tools and solutions (20)

Assay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient resultsAssay Standardisation - how this leads to improved patient results
Assay Standardisation - how this leads to improved patient results
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug DiscoveryCollaborative Database and Computational Models for Tuberculosis Drug Discovery
Collaborative Database and Computational Models for Tuberculosis Drug Discovery
 
C&E news talk sept 16
C&E news talk sept 16C&E news talk sept 16
C&E news talk sept 16
 
Diagnostic process
Diagnostic processDiagnostic process
Diagnostic process
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...Laboratory AssaysCross-sectional Incidence Testing, Blood Spots, and HIV Vir...
Laboratory Assays Cross-sectional Incidence Testing, Blood Spots, and HIV Vir...
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San Diego
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judes
 
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
Beckman Coulter MicroScan - Rapid Automated Microbial Identification & Antibi...
 
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
6-23-2015 AACC Poster HIV Incidence Assay - Stengelin_Final
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
Survival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patientsSurvival analysis on kidney failure of kidney transplant patients
Survival analysis on kidney failure of kidney transplant patients
 
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant PatientsSurvival Analysis On Kidney Failure of Kidney Tranplant Patients
Survival Analysis On Kidney Failure of Kidney Tranplant Patients
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
 
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
Bioanalytical Capabilities -- Thought-Leading Science Armed with the Latest T...
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspective
 

More from Valery Tkachenko

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureValery Tkachenko
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materialsValery Tkachenko
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsValery Tkachenko
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionValery Tkachenko
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataValery Tkachenko
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchValery Tkachenko
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardizationValery Tkachenko
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsValery Tkachenko
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical informationValery Tkachenko
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesValery Tkachenko
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction databaseValery Tkachenko
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 

More from Valery Tkachenko (20)

Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
 
In silico design of new functional materials
In silico design of new functional materialsIn silico design of new functional materials
In silico design of new functional materials
 
Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...Metal-organic frameworks: from database to supramolecular effects in complexa...
Metal-organic frameworks: from database to supramolecular effects in complexa...
 
Abstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representationsAbstract recommendation system: beyond word-level representations
Abstract recommendation system: beyond word-level representations
 
Machine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpointsMachine learning methods for chemical properties and toxicity based endpoints
Machine learning methods for chemical properties and toxicity based endpoints
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Using the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical dataUsing the structured product labeling format to index versatile chemical data
Using the structured product labeling format to index versatile chemical data
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
Open Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials researchOpen Science Data Repository - the platform for materials research
Open Science Data Repository - the platform for materials research
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
OpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and LearningsOpenPHACTS - Chemistry Platform Update and Learnings
OpenPHACTS - Chemistry Platform Update and Learnings
 
Evolution of open chemical information
Evolution of open chemical informationEvolution of open chemical information
Evolution of open chemical information
 
OMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spacesOMPOL – visualisation of large chemical spaces
OMPOL – visualisation of large chemical spaces
 
Not just another reaction database
Not just another reaction databaseNot just another reaction database
Not just another reaction database
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 

Recently uploaded (20)

PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 

Living in a world of federated knowledge challenges, principles, tools and solutions

  • 1. Living in a world of federated knowledge: Challenges, principles, tools and solutions Fall ACS 2017, Washington, DC Rick Zakharov1, Valery Tkachenko1 1 Science Data Software, Rockville, MD, United States
  • 2. We live in a hyperconnected World
  • 4. Dimensions and complexity of scientific data
  • 6. Traditional data – relational
  • 8. Why is it so hard to…. Competitors? What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP?
  • 9. Big Data Integration 9 OpenPHACTS
  • 12.
  • 13. D a t a Data Lake Social Media Electronic Notebooks Databases Sensor Med Dev IoT Curated Repository Models Curation & Integration Validation Decision Support Analysis & Modeling Open Data Science Platform Mining USERS Model-driven experimental studies
  • 14. Organize your data in a natural way ● Now-natural folder structure ● Organize your data into collections ● You have an option to download anything to your local drive as long as the security context allows etc
  • 15. Chemical processing ● Support for chemical formats ● Chemistry validation and standardization ● Automatic processing and visualization
  • 16. OSDR - documents • Integrated text-mining
  • 18. Convert between formats ● Integrated format transformation ● 50+ various data formats
  • 19. OSDR - mapping and conversion
  • 22. Predefined or custom metadata Tagging Attributes Taxonomies Ontologies Metadata Harvesting Industry Standards Metadata
  • 23. Collaborative data authoring and curation ● Datacite.org support ● Other formats ● Audit trail ● Notifications
  • 24. Extensive search options ● Search language ● Elasticsearch technology ● Domain-specific search modules ● Search ranking
  • 25. Built-in Machine Learning ● Automated ML pipeline ● Pre-built ML modules ● Comparison between different ML algorithms ● NB, NN, RF, SVM, LR ● DNN
  • 27. Datasets used for evaluating multiple computational methods for activity chemical properties prediction Model Datasets used and references Cutoff for active Number of molecules and ratio solubility Huuskonen J. J Chem Inf Comput Sci 2000 Log solubility = −5 1144 active, 155 inactive, ratio 7.38 probe-like Litterman N. et al. J Chem Inf Model 2014 described in reference 253 active, 69 inactive, ratio 3.67 hERG Wang S. et al. Mol Pharm 2012 described in reference 373 active, 433 inactive, ratio 0.86 KCNQ1 PubChem BioAssay: AID 2642 98 using actives assigned in PubChem 301,737 active, 3878 inactive, ratio 77.81 Bubonic plague (Yersina pestis) PubChem single-point screen BioAssay: AID 898 active when inhibition ≥50% 223 active, 139,710 inactive, ratio 0.0016 Chagas disease (Typanosoma cruzi) Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold difference in cytotoxicity as active 1692 active, 2363 inactive, ratio 0.72 TB (Mycobacterium tuberculosis) in vitro bioactivity and cytotoxicity data from MLSMR, CB2, kinase, and ARRA datasets Mtb activity and acceptable Vero cell cytotoxicity selectivity index = (MIC or IC90)/CC50 ≥10 1434 active, 5789 inactive, ratio 0.25 malaria (Plasmodium falciparum) CDD Public datasets (MMV, St. Jude, Novartis, and TCAMS) 3D7 EC50 <10 nM 175 active, 19,604 inactive, ratio 0.0089 Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active = non inhibitors).
  • 29. Solubility dataset: polar plots of the model evaluation metrics BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support Vector Machines, DNN-N (N is number of hidden layers).
  • 30. AUC for all tested datasets (FCFP6, 1024) Clark et al. J Chem Inf Model 2015 AUC values BNB LLR ABDT RF SVM DNN-2 DNN-3 DNN-4 DNN-5 Clark et al. solubility train 0.959 0.991 0.996 0.934 0.983 1.000 1.000 1.000 1.000 0.866 solubility test 0.862 0.938 0.932 0.874 0.927 0.935 0.934 0.934 0.933 probe-like train 0.989 0.932 1.000 0.984 0.995 1.000 1.000 1.000 1.000 0.757 probe-like test 0.636 0.662 0.658 0.571 0.665 0.559 0.563 0.565 0.563 hERG train 0.930 0.916 0.992 0.922 0.960 1.000 1.000 1.000 1.000 0.849 hERG test 0.842 0.853 0.844 0.834 0.864 0.840 0.841 0.841 0.840 KCNQ train 0.795 0.864 0.809 0.764 0.864 1.000 1.000 1.000 1.000 0.842 KCNQ test 0.786 0.826 0.801 0.732 0.832 0.861 0.856 0.852 0.848 Bubonic plague train 0.956 0.946 0.985 0.895 0.992 1.000 1.000 1.000 1.000 0.810 Bubonic plague test 0.681 0.767 0.643 0.706 0.758 0.754 0.752 0.753 0.753 Chagas disease train 0.812 0.847 0.865 0.815 0.926 1.000 1.000 1.000 1.000 0.800 Chagas disease test 0.731 0.763 0.768 0.732 0.789 0.790 0.791 0.790 0.789 Tuberculosis train 0.721 0.737 0.760 0.735 0.800 1.000 1.000 1.000 1.000 0.727 Tuberculosis test 0.671 0.681 0.676 0.679 0.695 0.687 0.684 0.688 0.685 Malaria train 0.994 0.993 0.999 0.979 0.998 1.000 1.000 1.000 1.000 0.977 Malaria test 0.984 0.982 0.966 0.953 0.975 0.975 0.975 0.974 0.974
  • 33. Micro-service ● Single responsibility ● Simple API ● One-pizza size team ● Independent development ● Independent deployment and scaling ● Different services can be implemented using different technologies
  • 34. Technologies ● Mix of technologies connected through microservices architecture ● Open source toolkits and libraries with permissive licenses ● NoSQL Databases ● Containerization ● Leading practices in CI/CD ● Automated testing, rapid development
  • 35. Summary • OSDR is a chemistry data platform • Supports FAIR data principles • Can handle specific use cases via modules • Integrated Machine Learning • Remove proprietary software barriers • Uses open source toolkits • Evolve and improve continuously

Editor's Notes

  1. What about science and chemistry in particular?
  2. Remember this, some of these questions are easier to answer than others
  3. Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data
  4. The representative polar plots of the model evaluation metrics for the Solubility dataset.
  5. In general the DNN models performed well for predictions except for the AUC performance of the probe-like dataset. For AUC DNN-3 outperforms BNB on 6 of 8 datasets