SlideShare a Scribd company logo
1 of 1
Download to read offline
Benchmarking and Validation of
                                                 JChem ECFP and FCFP Fingerprints
                                                                                        Roger Sayle, NextMove Software Ltd, Cambridge, UK
                                                                                                                                              roger@nextmovesoftware.co.uk


   Abstract
1. Overview                                                                                 6. Fingerprint Saturation
                                                                                           6. Fingerprint Saturation
The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that               A common failing with binary fingerprints is caused by their inability to represent
similar compounds have similar therapeutic benefits. Cheminformatics tries to              the number of times a feature (such as a path or substructure) occurs. The
capture this insight by defining measures of similarity between the computer               fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically
representations of two molecules, with the hope of capturing a medicinal                   identical, as are those for many protein and DNA sequences. A more powerful
chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity.           representation that solves these issues is to replace occurrence bits with counts,
This poster evaluates the chemical similarity measures offered by ChemAxon on a            turning binary fingerprints into occurrence histograms.
standard reference benchmark. Any such benchmark must by necessity be                      LINGOs similarity achieves better results on the Briem & Lessel benchmark by
flawed; the similarity between two molecules is influenced by the framework by             using counts instead of bits. However, as described in the Continuous Tanimoto
which they are compared [2]. However a robust similarity measure should                    section below, care has to be taken to use a suitable similarity measure for
typically perform better on such benchmarks, whilst a weaker model of chemical             comparing histograms.
similarity would be expected to perform worse (on average).                                ChemAxon have announced that the upcoming release of JChem, version 5.5, will
                                                                                           support ECFP and FCFP fingerprints with counts.
2. Briem & Lessel Benchmark
The benchmark employed in this evaluation is the commonly used Briem and                   7. Continuous Tanimoto
Lessel benchmark *1+. This test assesses a method’s ability to identify near               Although there is universal agreement on how the Tanimoto coefficient should be
neighbours with the same biological activity from decoy molecules and molecules            interpreted for binary values, its application to continuous values, such as
with different biological activities. Five classes of active compounds are used: 40        histogram counts, has been implemented differently by different authors [4,5].
ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF              Consider the two alternate definitions T0 and T1 given below.
antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds,                                             𝑁                                     𝑁
                                                                                                                            𝑖 𝑥𝑖 𝑦𝑖                                𝑖 min⁡ 𝑥 𝑖 ,
                                                                                                                                                                        (         𝑦 𝑖)
the data set contains 573 “random” MDDR compounds, for a total of 953                           𝑇0 (𝑥, 𝑦) =       𝑁                                𝑇1 (𝑥, 𝑦) =      𝑁
molecules. The benchmark proceeds by determining the 10 nearest neighbours                                    𝑖       𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖                       𝑖 max(𝑥 𝑖 ,     𝑦 𝑖)
for each of the 380 active compounds. The query is not considered a neighbour              Both definitions agree for binary valued vectors, and are guaranteed to return
of itself. The score for each activity class is the fraction of these neighbours that      increasing fractional values between zero and one. Notice however that for
have the same activity as the query. Finally, the overall score is the average of the      x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923.
score of the five activity classes.                                                        In experiments with LINGO’s histograms, T1 was found to be superior (producing
                                                                                           an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%)
 3. Fingerprint Methods
3. Fingerprint Methods
Historically, the similarity method underlying ChemAxon’s JChem search engine
                                                                                           8. Conclusions
relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints
                                                                                           • ChemAxon’s Chemical Fingerprints perform comparably with other path and
similar to Daylight fingerprints, which allow a number of variants depending upon
                                                                                              feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints
parameters for the number of bits in the fingerprint, the longest bond path to
                                                                                              and PubChem/CACTVS fingerprints). All these methods perform equivalently.
encode and the number of bits set by each path. The “Marvin FP” below uses the
                                                                                           • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently
generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The
                                                                                              implemented by ChemAxon, perform exceptionally well on the standard Briem
“JChem FP” below uses the JChemManager defaults of 512 bits, paths up to
                                                                                              and Lessel benchmark.
length 6 and 2 bits per path.
                                                                                           • The announced ECFP histograms would be anticipated to set new records in 2D
Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints
                                                                                              chemical similarity.
originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4”
and “FCFP_8” below indicating the ChemAxon implementation with diameter
                                                                                           9. Acknowledgements
                                                                                           9. Acknowledgements
parameters of 4 bonds and 8 bonds respectively.
                                                                                           To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the
For reference comparison to other methods, also shown are LINGOs similarity [4],
                                                                                           ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing,
MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based
                                                                                           and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity.
chemical similarity (US20080004810A1) as used in their SIMPLE product [7].
                                                                                           10. Bibliography
 4. Tanimoto Coefficient
4. Tanimoto Coefficient                                                                    1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding
Many ways of comparing similarity between binary fingerprints have been                       Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol.
discussed in the literature [9]; generally the best performing of these is the                20, pp. 231-244, 2000.
Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌        𝑋 ∪ 𝑌 . This definition has almost magical          2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare
properties, normalizing the differences between two feature sets by their sizes,              Structure-based Clustering Methods and Descriptors for Use in Compound Selection”,
                                                                                              JCICS, Vol. 36, No. 3, pp. 572-582, 1996.
intuitively “the fraction in common”. Experimentally this correlates well to the           3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse,
chemical and biological notion of what makes two molecules similar.                           “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280,
                                                                                              2002.
5. Evaluation Results                                                                      4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle,
90%                                                                                           “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp.
        79.4%                                                                                 1912-1918, 2006.
80%               76.0%     75.6%                                                          5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based
70%                                   66.2%     65.2%                                         Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp.
                                                         63.7%     64.0%
                                                                                              1201-1213, 2006.
60%                                                                                        6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and
50%                                                                                           Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog
                                                                             42.0%            Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008.
40%
                                                                                           7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining
30%                                                                                           Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12,
20%                                                                                           pp. 304-315, 2007.
                                                                                           8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No.
10%                                                                                           5, pp. 742-754, 2010.
 0%                                                                                        9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”,
       ECFP_4    FCFP_8    LINGOs    MACCS Marvin CF JChem CF Daylight        IBM             JCICS, Vol. 38, No. 6, pp. 893-996, 1998.



                                                                                                                                                         NextMove Software Limited
                                                                                                                                                         Innovation Centre (Unit 23)
                                                                           www.nextmovesoftware.co.uk
                                                                                                                                                            Cambridge Science Park
                                                                           www.nextmovesoftware.com
                                                                                                                                                            Milton Road, Cambridge
                                                                                                                                                                   England CB4 0EY

More Related Content

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detectorUltraUploader
 
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...IJECEIAES
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAngie Miller
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesIAEME Publication
 
Applications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionApplications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionIRJET Journal
 
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfAn Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfNancy Ideker
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesCSCJournals
 
Classification With Ant Colony
Classification With Ant ColonyClassification With Ant Colony
Classification With Ant ColonyGissely Souza
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESIAEME Publication
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESIAEME Publication
 
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing IJECEIAES
 

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints (20)

GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Cdma
CdmaCdma
Cdma
 
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
 
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codes
 
Applications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionApplications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer Prediction
 
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfAn Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
Classification With Ant Colony
Classification With Ant ColonyClassification With Ant Colony
Classification With Ant Colony
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
gkv343.pdf
gkv343.pdfgkv343.pdf
gkv343.pdf
 
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
 
D111823
D111823D111823
D111823
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

  • 1. Benchmarking and Validation of JChem ECFP and FCFP Fingerprints Roger Sayle, NextMove Software Ltd, Cambridge, UK roger@nextmovesoftware.co.uk Abstract 1. Overview 6. Fingerprint Saturation 6. Fingerprint Saturation The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that A common failing with binary fingerprints is caused by their inability to represent similar compounds have similar therapeutic benefits. Cheminformatics tries to the number of times a feature (such as a path or substructure) occurs. The capture this insight by defining measures of similarity between the computer fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically representations of two molecules, with the hope of capturing a medicinal identical, as are those for many protein and DNA sequences. A more powerful chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity. representation that solves these issues is to replace occurrence bits with counts, This poster evaluates the chemical similarity measures offered by ChemAxon on a turning binary fingerprints into occurrence histograms. standard reference benchmark. Any such benchmark must by necessity be LINGOs similarity achieves better results on the Briem & Lessel benchmark by flawed; the similarity between two molecules is influenced by the framework by using counts instead of bits. However, as described in the Continuous Tanimoto which they are compared [2]. However a robust similarity measure should section below, care has to be taken to use a suitable similarity measure for typically perform better on such benchmarks, whilst a weaker model of chemical comparing histograms. similarity would be expected to perform worse (on average). ChemAxon have announced that the upcoming release of JChem, version 5.5, will support ECFP and FCFP fingerprints with counts. 2. Briem & Lessel Benchmark The benchmark employed in this evaluation is the commonly used Briem and 7. Continuous Tanimoto Lessel benchmark *1+. This test assesses a method’s ability to identify near Although there is universal agreement on how the Tanimoto coefficient should be neighbours with the same biological activity from decoy molecules and molecules interpreted for binary values, its application to continuous values, such as with different biological activities. Five classes of active compounds are used: 40 histogram counts, has been implemented differently by different authors [4,5]. ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF Consider the two alternate definitions T0 and T1 given below. antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds, 𝑁 𝑁 𝑖 𝑥𝑖 𝑦𝑖 𝑖 min⁡ 𝑥 𝑖 , ( 𝑦 𝑖) the data set contains 573 “random” MDDR compounds, for a total of 953 𝑇0 (𝑥, 𝑦) = 𝑁 𝑇1 (𝑥, 𝑦) = 𝑁 molecules. The benchmark proceeds by determining the 10 nearest neighbours 𝑖 𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖 𝑖 max(𝑥 𝑖 , 𝑦 𝑖) for each of the 380 active compounds. The query is not considered a neighbour Both definitions agree for binary valued vectors, and are guaranteed to return of itself. The score for each activity class is the fraction of these neighbours that increasing fractional values between zero and one. Notice however that for have the same activity as the query. Finally, the overall score is the average of the x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923. score of the five activity classes. In experiments with LINGO’s histograms, T1 was found to be superior (producing an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%) 3. Fingerprint Methods 3. Fingerprint Methods Historically, the similarity method underlying ChemAxon’s JChem search engine 8. Conclusions relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints • ChemAxon’s Chemical Fingerprints perform comparably with other path and similar to Daylight fingerprints, which allow a number of variants depending upon feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints parameters for the number of bits in the fingerprint, the longest bond path to and PubChem/CACTVS fingerprints). All these methods perform equivalently. encode and the number of bits set by each path. The “Marvin FP” below uses the • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The implemented by ChemAxon, perform exceptionally well on the standard Briem “JChem FP” below uses the JChemManager defaults of 512 bits, paths up to and Lessel benchmark. length 6 and 2 bits per path. • The announced ECFP histograms would be anticipated to set new records in 2D Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints chemical similarity. originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4” and “FCFP_8” below indicating the ChemAxon implementation with diameter 9. Acknowledgements 9. Acknowledgements parameters of 4 bonds and 8 bonds respectively. To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the For reference comparison to other methods, also shown are LINGOs similarity [4], ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing, MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity. chemical similarity (US20080004810A1) as used in their SIMPLE product [7]. 10. Bibliography 4. Tanimoto Coefficient 4. Tanimoto Coefficient 1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding Many ways of comparing similarity between binary fingerprints have been Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol. discussed in the literature [9]; generally the best performing of these is the 20, pp. 231-244, 2000. Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌 𝑋 ∪ 𝑌 . This definition has almost magical 2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare properties, normalizing the differences between two feature sets by their sizes, Structure-based Clustering Methods and Descriptors for Use in Compound Selection”, JCICS, Vol. 36, No. 3, pp. 572-582, 1996. intuitively “the fraction in common”. Experimentally this correlates well to the 3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse, chemical and biological notion of what makes two molecules similar. “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280, 2002. 5. Evaluation Results 4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle, 90% “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp. 79.4% 1912-1918, 2006. 80% 76.0% 75.6% 5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based 70% 66.2% 65.2% Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp. 63.7% 64.0% 1201-1213, 2006. 60% 6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and 50% Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog 42.0% Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008. 40% 7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining 30% Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12, 20% pp. 304-315, 2007. 8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No. 10% 5, pp. 742-754, 2010. 0% 9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”, ECFP_4 FCFP_8 LINGOs MACCS Marvin CF JChem CF Daylight IBM JCICS, Vol. 38, No. 6, pp. 893-996, 1998. NextMove Software Limited Innovation Centre (Unit 23) www.nextmovesoftware.co.uk Cambridge Science Park www.nextmovesoftware.com Milton Road, Cambridge England CB4 0EY