Over years a multitude of chemical formats and approaches were created to address various aspects of handling chemical information and building databases of chemical knowledge. As a result the current state of this landscape is severely affected by the lack of well-accepted and community-recognized formats, protocols, metadata standards, validation routines and standards in handling, storing and representation, lack of open toolkits which conform to the same standards as well as the lack of platforms which allow interactive and collaborative work to solve all the above problems. While such organizations as RDA and IUPAC as well as some government agencies and institutes are concerned and trying to address the problem it is still a severe pain point. In this presentation we will talk about our experience of building a federated knowledgebase called Open Science Data Repository which supports deposition of raw and structured chemical and analytical data in various formats, runs validation and standardization protocols, is build in a highly modular way that allows using both its API and its components in a Cloud or to be deployed on premises behind firewalls and supports a variety of use cases including collaborative data curation, rich analytics and visualization, real-time machine learning, formats conversion and preparing depositions into PubChem and ChemSpider from a variety of sources and fully supports FAIR principles for research data.
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Living in a world of federated knowledge challenges, principles, tools and solutions
1. Living in a world of federated knowledge:
Challenges, principles, tools and solutions
Fall ACS 2017, Washington, DC
Rick Zakharov1, Valery Tkachenko1
1 Science Data Software, Rockville, MD, United States
8. Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
14. Organize your data in a natural way
● Now-natural folder structure
● Organize your data into
collections
● You have an option to
download anything to your
local drive as long as the
security context allows etc
15. Chemical processing
● Support for chemical
formats
● Chemistry validation
and standardization
● Automatic processing
and visualization
27. Datasets used for evaluating multiple computational methods
for activity chemical properties prediction
Model
Datasets used and
references
Cutoff for active
Number of molecules
and ratio
solubility Huuskonen J. J Chem Inf
Comput Sci 2000
Log solubility = −5 1144 active, 155 inactive,
ratio 7.38
probe-like Litterman N. et al. J Chem Inf
Model 2014
described in reference 253 active, 69 inactive,
ratio 3.67
hERG Wang S. et al. Mol Pharm
2012
described in reference 373 active, 433 inactive,
ratio 0.86
KCNQ1 PubChem BioAssay: AID 2642
98
using actives assigned in PubChem 301,737 active, 3878 inactive,
ratio 77.81
Bubonic plague
(Yersina pestis)
PubChem single-point screen
BioAssay: AID 898
active when inhibition ≥50% 223 active, 139,710 inactive,
ratio 0.0016
Chagas disease
(Typanosoma cruzi)
Pubchem BioAssay: AID 2044 with EC50 <1 μM, >10-fold
difference in cytotoxicity as active
1692 active, 2363 inactive,
ratio 0.72
TB (Mycobacterium
tuberculosis)
in vitro bioactivity and
cytotoxicity data from MLSMR,
CB2, kinase, and ARRA
datasets
Mtb activity and acceptable Vero
cell cytotoxicity selectivity index =
(MIC or IC90)/CC50 ≥10
1434 active, 5789 inactive,
ratio 0.25
malaria (Plasmodium
falciparum)
CDD Public datasets (MMV, St.
Jude, Novartis, and TCAMS)
3D7 EC50 <10 nM 175 active, 19,604 inactive,
ratio 0.0089
Note the active/inactive ratios for hERG and KCNQ1 are reversed as we are trying to obtain compounds that are more desirable (active =
non inhibitors).
29. Solubility dataset: polar plots of the model evaluation metrics
BNB - Bernoulli Naive Bayes, LLR - Logistic linear regression, ABDT - AdaBoost Decision Trees, RF - Random Forest, SVM - Support
Vector Machines, DNN-N (N is number of hidden layers).
33. Micro-service
● Single responsibility
● Simple API
● One-pizza size team
● Independent development
● Independent deployment
and scaling
● Different services can be
implemented using
different technologies
34. Technologies
● Mix of technologies connected
through microservices
architecture
● Open source toolkits and
libraries with permissive
licenses
● NoSQL Databases
● Containerization
● Leading practices in CI/CD
● Automated testing, rapid
development
35. Summary
• OSDR is a chemistry data platform
• Supports FAIR data principles
• Can handle specific use cases via modules
• Integrated Machine Learning
• Remove proprietary software barriers
• Uses open source toolkits
• Evolve and improve continuously
Remember this, some of these questions are easier to answer than others
Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform
Mx/psa, how calculated who did it?
Mash up. With your data too,
- top layer join together but need them all
commercial
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 5 billion triples – 14 datasets & growing
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
The representative polar plots of the model evaluation metrics for the Solubility dataset.
In general the DNN models performed well for predictions except for the AUC performance of the probe-like dataset. For AUC DNN-3 outperforms BNB on 6 of 8 datasets