PMML for QSAR Model Exchange Rajarshi Guha, Ph.D. NIH Center for Advancing TranslaEonal Sciences firstname.lastname@example.org / h0p://rguha.net
Background • CheminformaEcs – QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks • RNAi screening, high content imaging • Extensive use of machine learning • All Eed together with soLware development (GUI’s, libraries) • Contributed pmml.lm to the PMML package
Why is QSAR Useful? • Lets us predict whether a chemical is likely to be toxic, avoiding animal tesEng • PrioriEze molecules from a high throughput screen of 300K molecules • Predict whether a molecule will be (suﬃciently) soluble in water • IdenEfy molecules with anE-‐malarial properEes • Accurate, predic-ve models can save signiﬁcant -me and money (and cute bunnies)
Lots and Lots of Models • Hundreds of such models published in the literature – Usually in the form of tables of regression coeﬃcients (if we’re lucky) – If the paper describes an SVM model, no chance of reproducing the results • How can we exchange QSAR models?
QSAR Model Exchange • Build models in …., • Save them in PMML • Distribute • … • Proﬁt? – Not always The bo0leneck is evalua:ng descriptors for the new observa:ons to supply to the model
CheminformaEcs in R • rcdk provides cheminformaEcs support in R – Load and parse molecular ﬁle formats – Evaluate numerical descriptors from chemical structures rcdkCDK Jmol rpubchem rJava fingerprint XML R Programming Environment
R, rcdk, PMML • rcdk provides the means to take in molecules and output a PMML encoded model • One could record appropriate funcEons/classes in the document and use that info to evaluate descriptor for new observaEons • Since rcdk is based on the Java CDK library, could also use jpmml, a Java API for PMML documents