Your SlideShare is downloading. ×
0
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Talk at SMASH 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Talk at SMASH 2011

659

Published on

Automatic Generation of Negative Control Structures for Automated Structure Verification Systems …

Automatic Generation of Negative Control Structures for Automated Structure Verification Systems

The generation of positive and negative controls is a fundamental part of good experimental design. Getting a positive outcome on a test performed over a subject known to give a positive result, reasures the scientist the test is working properly. As important, if not more, is to test over subjects known to give negative results. Getting a negative outcome when expected validates the test and increases the result’s confidence when applied to unknowns.

Automated Structure Verification (ASV) is no different than any other scientific test. Postive as well as negative controls should be frequently tested to optimize performance and to obtain a measure of robustness and confidence in the results.

In this poster I will show how to automatically generate relevant negative control structures for any type of NMR data. Furthermore, I will argue that ASV systems fall in the category of binary classifiers, and that their performance can be measured by a host of metrics, already in use in the fields of statistical classification and signal detection theory.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
659
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Automatic Generation ofNegative Control Structures for Automated Structure Verification Systems Gonzalo Hernández SMASH 2011 Chamonix,France
  • 2. Outline Goal Similarity Calculation Overview NMR Specific Fingerprint Development Method Validation Applications Database Searching Automated Structure Verification (ASV)
  • 3. Goal• To develop a method that given a target chemical structure would rank other proposed structures based on the expected similarity of their NMR data, without an a priori knowledge of that data. Increased Similarity
  • 4. How to Achieve Our Goal• Calculate a molecular similarity coefficient predictive of NMR data similarity.• Develop an NMR-specific molecular fingerprint
  • 5. Molecular Similarity vs. NMR Data SimilarityMolecular Fingerprints• A molecular fingerprint is a collection of descriptors that is used to characterize a molecule. For example, the number and type of functional groups, molecular formula, etc.• Different metrics can be calculated between fingerprints to find their similarity or dissimilarity.• Most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc. F F S O S O H3C OH F O O Cl CH3 Cl CH3 FNMR Data Similarity• Which two molecules are structurally most similar?• Which molecules would present the most similar NMR data?• How to answer the previous question without knowing the actual NMR data.
  • 6. NMR-Specific Molecular Similarity CoefficientSimilarity based on Chemical Environments Around Carbon Atoms• Define the most common chemical environments up to three shells emanating from a carbon atom• Assemble them as bits of a fingerprint• Count how many times each fingerprint bit (environment) is present in each molecule• Calculate similarity between two molecules as the Euclidean distance between two fingerprints [CH1]([CH3])(OC)[CH1](C)CSMARTSSmiles ARbitrary Target Specification (SMARTS) is alanguage for specifying substructural patterns in Omolecules.[#6] any Carbon atom NH[CH3] Methyl group[n;!H0] pyrrole-type Nitrogen[#7,#8;!H0] hydrogen bond donor [cH1]([cH0](C)c)[cH1]c
  • 7. Fingerprint Development1. Generate all combinations of SMARTS code strings Bi ( bj ( Rk ) )l Where: Bi = { [CH3], [CH2], [CH1], [cH1] } bj = { -, =, #, : } Rk = { C, N, O, S, F, Cl, Br, I, c, n, o, s } l = i – j + 1, l > 02. Extract all chemical environments up to three shells from large compound database – Database contained about 4.6 million compounds, extracted from PubChem, for a total of 82 million chemical environments
  • 8. Method Validation Test set of 100 commercial compounds Calculate pairwise Molecular Similarity between all pairs (4950 pairs total) Predict 1H, 13C, and construct 1H-13C HSQC data Calculate Spectral Similarity (1D and 2D binning) Compare Molecular Similarity vs Spectral Similarity for all pairs
  • 9. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 10. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 11. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 12. 1H-1D NMR Data • Predicted similarity was calculated using a 1H specific fingerprint containing 100,000 unique three-shell chemical environments (bits) • Actual similarity was calculated as a 1D binning of the predicted 1H-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 13. 13C-1D NMR Data • Predicted similarity was calculated using a 13C specific fingerprint containing 200,000 bits • Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 14. 1H-13C HSQC 2D NMR Data • Predicted similarity was calculated using a H-C correlation specific fingerprint containing 50,000 bits • Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 15. Test Set (Database Search) (MW <= 250 Da, 1 CH3, 3 CH2, 1 CH, 4 Ar) 0 0 0 Pairwise similarity O OH 20 O O H N 20 Br NH2 20 a b c d e f g h i j N 40 H 40 40 10 O f g h i j N 60 60 60 6 f1 (ppm) f1 (ppm) f1 (ppm) 80 80 80 8 5 a 100 120 b 100 120 c 100 120 4 140 140 140 3 6 160 160 160 2 Molecule B10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 f2 (ppm) f2 (ppm) f2 (ppm) 0 0 0 1 a b c d e 20 20 20 4 0 HN 40 N 40 O H N 40 H H O N 60 60 N O 60 H f1 (ppm) 2 f1 (ppm) f1 (ppm) 80 80 80 d 100 120 e 100 120 f 100 120 140 140 0 140 0 2 4 6 8 10 160 160 16010 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 Molecule A f2 (ppm) f2 (ppm) f2 (ppm) 0 0 0 0 O O 20 20 20 20 N O 40 40 40 O 40 N O 60 60 60 O 60 H N NH2 N f1 (ppm) f1 (ppm) f1 (ppm) f1 (ppm) 80 H 80 80 80 OH g 100 120 h 100 120 i 100 120 j 100 120 140 140 140 140 160 160 160 16010 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 f2 (ppm) f2 (ppm) f2 (ppm) f2 (ppm)
  • 16. Automated Structure VerificationAre Chemical Structure and NMR data consistent with eachother?  Procedure:  Predict NMR data from proposed structure  Compare to experimental data (1H, 1H-13C HSQC)  Calculate matching score  Not seeking full structure elucidation or accurate assignmentsWhy doing this?  Best way to deal with large number of simple compounds (i.e. libraries, reagents, etc.)  Leave interesting problems for manual analysis
  • 17. ASV of Negative Control Structures 1.00 0.90 PC-1 0.80 0.70 PC-2 PC-3 Test Set 10 Positive Control StructuresASV Score 0.60  0.50  5 Negative Control structures generated 0.40 0.30 automatically 0.20  ASV run on all 6 structures against experimental 0.10 NMR data (1H-1D and HSQC) 1 0.00 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity 1.00 1.00 1.00 0.90 0.90 PC-7 0.90 PC-9 PC-4 PC-10 0.80 PC-5 0.80 PC-8 0.80 PC-6 0.70 0.70 0.70 ASV Score ASV ScoreASV Score 0.60 0.60 0.60 0.50 0.50 0.50 0.40 0.40 0.40 0.30 0.30 0.30 0.20 0.20 0.20 0.10 0.10 0.10 0.00 0.00 0.00 2.00 6.00 10.00 14.00 18.00 0.00 5.00 10.00 15.00 20.00 25.00 0.00 5.00 10.00 15.00 20.00 0.00 4.00 8.00 12.00 16.00 20.00 Molecular Similarity Molecular Similarity Molecular Similarity 1 ASV was run by Phil Keyes at Lexicon Pharmaceuticals using ACDLabs ASV system
  • 18. Negative Controls for PC1 1.00 0.90 PC-1 0.80 PC-2 0.70 PC-3 ASV Score 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity
  • 19. Negative Controls for PC5 1.00 0.90 PC-4 0.80 PC-5 PC-6 0.70 ASV Score 0.60 0.50 0.40 0.30 0.20 0.10 0.00Positive Control 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity
  • 20. ASV is a Binary Classifier• The yellow band is a myth• A Binary Classifier is a system that selects between two options• Binary classifier is a well understood, well developed area of statistical analysis with many metrics at our disposal• Used in many fields including, decision making, machine learning, signal detection theory• Set your strategy (false positive/negative tolerant) and live with it
  • 21. Summary Developed a molecular similarity method predictive of NMR data similarity for 1H-1D, 13C-1D and 1H-13C HSQC data Similarity calculation can be used for other purposes like CASE studies if linked to a structure generator The confidence level of an autoverification can be calculated by challenging the system with negative control structures of known similarity to the proposed structure
  • 22. AcknowledgmentsLexicon Pharmaceuticals Modgraph Giovanni Cianchetta Jeff Seymour Phil Keyes FundingMestreLab Carlos Cobas Chen Peng Open Source ComunityACDLabs Ryan Sasaki Sergey Golotvin OpenBabel

×