Each bit in the fingerprint represents one molecular fragment
Molecular similarity searching methods, seminar
Molecular similarity By: Haytham Hijazisearching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieterin drug discovery RollerA Presentation in advanced graphicalengineering systems seminar 2011/2012 1
In this work, I propose a contribution to the field of “Cheminformatic”. Cheminformatic means solving chemical problems using computational methods.James Rhodes, Stephen Boyer1, Jeffrey Kreulen, Ying Chen, Patricia Ordonez, “Mining patents using molecular similaritysearch”, IBM, Almaden Services Research, Pacific Symposium on Biocomputing 12:304-315(2007). Molecular similarity By: Haytham Hijazi searching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieter in drug discovery Roller A Presentation in advanced graphical engineering systems seminar 2011/2012 2
Agenda •The main question in this research •The principle of similarity •Drug discovery as an application •Research problem • Molecular representations (1D, 2D…) •Searching the similarity •Similarity coefficients calculations •The probabilistic model (BIM) •The contribution (MDC) •Experiments, conclusions and discussion 3A Presentation in advanced graphical engineeringsystems seminar 2011/2012
“The similarity is in the eye of the beholder” Shape Colour Size Pattern 4
Question: Which molecules in a database are similar to the query molecule?Application: •better compounds than initial lead compound (Drug discovery) •Property prediction of unknown compound. 5
Structurally similar molecules are assumed to have similar biological properties. Similar biological propritiesdrug discovery. 1. Sylvaine Roy and Laurence Lafanechère, “Chemogenomics and Chemical Genetics: A Users Introduction forBiologists, Chemists and Informaticians”, Molecular similarity, Springer Berlin, ISBN 978-3-642-19614-0, 1st Edition. 6
Historical progression ◦ Complete structure ◦ Sub-Structure Descriptors ◦ 1D (psychophysical properties), 2D, 3D, and 4D Connectivity tables and graph theory!Image Source: Karine Audouze, “Representation of molecular structures and structural 9diversity”, ChemoInformatics in Drug Discovery, 2009.
SMILES CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C= CC(=O)OC1=CC=CC=C1C(=O)O CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C SMILES – Simplified Molecular Line Entry SystemSource: Karine Audouze, “Representation of molecular structures and structural 10diversity”, ChemoInformatics in Drug Discovery, 2009.
A fingerprint is a vector encoding the presence (‘1’) or absence (‘0’) of FRAGMENT substructures in a molecule Dictionary based or and hash based fingerprints Descriptor Fragment 1 AR 2 CCCCN 3 Me 9 NH2  2. Source: Karine Audouze, “Representation of molecular structures and structural diversity”, 11ChemoInformatics in Drug Discovery, 2009.
In 3D keys the position of each bit corresponds to a certain range of distances or angels. Computationally complexSource: Karine Audouze, “Representation of molecular structures and structural 12diversity”, ChemoInformatics in Drug Discovery, 2009.
Exact structure search Structure search Substructure search Similarity searching: maximal common sub graph isomorphism, Tanimoto/Dice/Cosine coefficients 14
The similarity measure (coefficient) is a quantitative measure of similarity Used to rank the results of the query Results are ordered decreasingly Distance coefficients. Probabilistic coefficients. Correlation coefficients. Association coefficients. 15
Associative Simple matching coefficient (c+d)/(a+b-c+d) Jaccard measure (Tanimoto) c/(a+b-c) =AND/OR Cosine, Ochiai c/√(a+b)(c+d) Dice c/.5[(a+c)+(b+c)] and 2c/a+b Distance Hamming distance a+b-2c Euclidean distance √a+b-2c Soregel distance a+b-2c/a+b-c Other coefficients Pattern difference ab/(a+b c+d)2 Size (a-b)2/(a+b+c+d)2Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research 16Management Centre Project Vote – 75207, University of Malaysia, 2009
Assume we generate the fingerprint fragment based bits Molecule A: 00010100010101000101010011110100 Molecule B: 00000000100101001001000011100000 c Tanimoto coefficient = Where c=A AND B (a b) c Tanimoto=6/(13+8)-6=0.4 a c b 17
Associate the relevance of a structure to an explicit feature pi=probability that bit bi appears in an active structure. qi=probability that bit bi appears in an inactive structure αi represents a binary selector. If αi=1 means the bit occurs in the structure, else it is 0 and negated. P (A|S) is the probability of an active structure given S. P (NA|S) is the probability of an inactive structure given S. P(A) is the probability of ACTIVEs P(NA) is the probability of INACTIVESNaomie Salim, “The study of probability model for compound similarity searching”, UTM Research 18Management Centre Project Vote – 75207, University of Malaysia, 2009
Molecular dynamicsimulating tool Active compounds Database Psychophysical properties Voting Class 1 Classification Class 2 Algorithm Class n 20
Better insight about the similarity in terms of bioactivity, toxicity, reactivity...(+) The time of searching (+) Prediction and voting possibilities (+) Cost of simulation tools (-) Classification errors (-) 21
Fingerprint time gneration 30 25 20 Time (Ms) 15 2 bits 10 3 bits 5 4 bits 4 bits 0 3 bits 4 2 bits 5 6 7 8 Max path.length Consider if we have more than 1000 bits!Data source: simulating tool indicated in the report  23
Hit rate 0.18 0.16 0.14 0.12 0.1 Hit Rate 0.08 Hit rate 0.06 0.04 0.02 0 0 500 1000 1500 2000 2500 Selection Size The more we increase the size of features, the more the hit rate of finding actives decreaes.Data source: simulating tool indicated in the report  24
Even fingerprint fragment based is time consuming Probabilistic models and machine learning introduced substantial changes Mixing more than type of descriptors seems efficient i.e. Time and results quality Still need to have experimental results 25
Molecular similarity Thanks for your listeningsearching methodsin drug discovery Haytham Hijazi A Presentation to the advanced graphicalengineering systems seminar 2011/2012 26