SlideShare a Scribd company logo
1 of 22
Download to read offline
Automatic Generation of
Negative Control Structures
 for Automated Structure
   Verification Systems

       Gonzalo Hernández
          SMASH 2011
         Chamonix,France
Outline

   Goal
   Similarity Calculation Overview
   NMR Specific Fingerprint Development
   Method Validation
   Applications
   Database Searching
   Automated Structure Verification (ASV)
Goal
• To develop a method that given a target chemical
  structure would rank other proposed structures
  based on the expected similarity of their NMR data,
  without an a priori knowledge of that data.




                                                Increased Similarity
How to Achieve Our Goal

• Calculate a molecular similarity coefficient predictive
  of NMR data similarity.
• Develop an NMR-specific molecular fingerprint
Molecular Similarity vs. NMR Data Similarity
Molecular Fingerprints
•   A molecular fingerprint is a collection of descriptors that is used to characterize a
    molecule. For example, the number and type of functional groups, molecular formula,
    etc.
•   Different metrics can be calculated between fingerprints to find their similarity or
    dissimilarity.
•   Most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc.
           F

      F        S     O                       S     O

                                                                       H3C   OH
      F              O                             O

                Cl       CH3                  Cl       CH3
           F


NMR Data Similarity
•   Which two molecules are structurally most similar?
•   Which molecules would present the most similar NMR data?
•   How to answer the previous question without knowing the actual NMR data.
NMR-Specific Molecular Similarity Coefficient
Similarity based on Chemical Environments Around Carbon Atoms
•   Define the most common chemical environments up to three shells emanating from a
    carbon atom
•   Assemble them as bits of a fingerprint
•   Count how many times each fingerprint bit (environment) is present in each molecule
•   Calculate similarity between two molecules as the Euclidean distance between two
    fingerprints

                                                                  [CH1]([CH3])(OC)[CH1](C)C
SMARTS
Smiles ARbitrary Target Specification (SMARTS) is a
language for specifying substructural patterns in            O
molecules.

[#6]                any Carbon atom                                     NH

[CH3]               Methyl group
[n;!H0]             pyrrole-type Nitrogen
[#7,#8;!H0]         hydrogen bond donor               [cH1]([cH0](C)c)[cH1]c
Fingerprint Development
1. Generate all combinations of SMARTS code strings
                   Bi ( bj ( Rk ) )l
     Where:
              Bi = { [CH3], [CH2], [CH1], [cH1] }
              bj = { -, =, #, : }
              Rk = { C, N, O, S, F, Cl, Br, I, c, n, o, s }
              l = i – j + 1, l > 0


2. Extract all chemical environments up to three shells
   from large compound database
  – Database contained about 4.6 million compounds,
    extracted from PubChem, for a total of 82 million
    chemical environments
Method Validation

   Test set of 100 commercial compounds
   Calculate pairwise Molecular Similarity between all
    pairs (4950 pairs total)
   Predict 1H, 13C, and construct 1H-13C HSQC data
   Calculate Spectral Similarity (1D and 2D binning)
   Compare Molecular Similarity vs Spectral Similarity
    for all pairs
Molecular Similarity vs. Spectral Similarity
                                     Similarity measured as
                                      distance. Smaller
                                      numbers mean greater
                                      similarity
                                     Molecular fingerprint
                                      contains 28,833
                                      chemical environments
                                      (bits)
                                     Spectral Similarity
                                      calculated used 2D
                                      binning and euclidean
                                      distance metric
Molecular Similarity vs. Spectral Similarity
                                     Similarity measured as
                                      distance. Smaller
                                      numbers mean greater
                                      similarity
                                     Molecular fingerprint
                                      contains 28,833
                                      chemical environments
                                      (bits)
                                     Spectral Similarity
                                      calculated used 2D
                                      binning and euclidean
                                      distance metric
Molecular Similarity vs. Spectral Similarity
                                     Similarity measured as
                                      distance. Smaller
                                      numbers mean greater
                                      similarity
                                     Molecular fingerprint
                                      contains 28,833
                                      chemical environments
                                      (bits)
                                     Spectral Similarity
                                      calculated used 2D
                                      binning and euclidean
                                      distance metric
1H-1D   NMR Data
            • Predicted similarity was
              calculated using a 1H specific
              fingerprint containing 100,000
              unique three-shell chemical
              environments (bits)
            • Actual similarity was
              calculated as a 1D binning of
              the predicted 1H-1D spectra
            • In both cases the metric used
              was Euclidean distance
              between fingerprint bits
13C-1D   NMR Data
              • Predicted similarity was
                calculated using a 13C
                specific fingerprint
                containing 200,000 bits
              • Actual similarity was
                calculated as a 1D binning
                of the predicted 13C-1D
                spectra
              • In both cases the metric
                used was Euclidean
                distance between
                fingerprint bits
1H-13C   HSQC 2D NMR Data
                  • Predicted similarity
                    was calculated using a
                    H-C correlation specific
                    fingerprint containing
                    50,000 bits
                  • Actual similarity was
                    calculated as a 1D
                    binning of the
                    predicted 13C-1D
                    spectra
                  • In both cases the
                    metric used was
                    Euclidean distance
                    between fingerprint
                    bits
Test Set (Database Search)
                               (MW <= 250 Da, 1 CH3, 3 CH2, 1 CH, 4 Ar)
                                                    0                                                               0                                                                   0                                                                 Pairwise similarity
             O       OH
                                                    20
                                                                             O
                                                                                         O
                                                                                                    H
                                                                                                    N
                                                                                                                    20                        Br

                                                                                                                                                           NH2
                                                                                                                                                                                        20
                                                                                                                                                                                                                                                   a b          c d e                    f g h   i j
                                                                                     N
                                                    40                               H                              40                                                                  40                                          10
                                                                                             O




                                                                                                                                                                                                                                   f g h i j
                 N                                  60                                                              60                                                                  60

                                                                                                                                                                                                                                                                                                        6




                                                                                                                                                                                              f1 (ppm)
                                                                                                                          f1 (ppm)
                                                            f1 (ppm)




                                                    80                                                              80                                                                  80

                                                                                                                                                                                                                                       8                                                                5
                                            a
                                                    100

                                                    120
                                                                                                            b       100

                                                                                                                    120
                                                                                                                                                                                c       100

                                                                                                                                                                                        120                                                                                                             4
                                                                                                                                                                                        140
                                                    140                                                             140                                                                                                                                                                                 3
                                                                                                                                                                                                                                       6
                                                    160                                                             160                                                                 160
                                                                                                                                                                                                                                                                                                        2




                                                                                                                                                                                                                      Molecule B
10       8           6              4   2       0                      10    8       6          4       2       0                    10            8       6            4   2       0
                         f2 (ppm)                                                    f2 (ppm)                                                                f2 (ppm)
                                                    0                                                               0                                                                   0                                                                                                               1




                                                                                                                                                                                                                                   a b c d e
                                                    20                                                              20                                                                  20                                             4                                                                0
                                                                       HN
                                                    40                           N                                  40
                               O                                                 H                                                                     N                                40
                     H                                                                                                                                 H
     O               N                              60                                                              60
                           N                                                                                                              O                                             60
                           H
                                                          f1 (ppm)




                                                                                                                                                                                                                                       2
                                                                                                                          f1 (ppm)




                                                                                                                                                                                              f1 (ppm)
                                                    80                                                              80                                                                  80


                                            d       100

                                                    120
                                                                                                            e       100

                                                                                                                    120
                                                                                                                                                                                f       100

                                                                                                                                                                                        120

                                                    140                                                             140
                                                                                                                                                                                                                                       0
                                                                                                                                                                                        140
                                                                                                                                                                                                                                               0            2                  4         6   8     10
                                                    160                                                             160                                                                 160
10       8           6              4   2       0                      10    8       6          4       2       0                    10            8       6            4   2       0                                                                                          Molecule A
                         f2 (ppm)                                                    f2 (ppm)                                                                f2 (ppm)
                                                    0                                                               0                                                                   0                                                                               0

                                                                                                                                                                                                                  O
                 O                                  20                                                              20                                                                  20                                                                              20
                                                                                                                                                   N
                                                                                                                                                                                                              O
                                                    40                                                              40                                                                  40                                                     O
                                                                                                                                                                                                                                                                        40

                 N                                                                           O
                                                    60                                                              60                                                                  60                                                O                             60
                 H
                                                                                     N                                                    NH2
                                                                                         N
                                                                                                                                                                                              f1 (ppm)
                                                            f1 (ppm)




                                                                                                                                                                                                                                                                              f1 (ppm)
                                                                                                                          f1 (ppm)




                                                    80                                   H                          80                                                                  80                                                                              80
                                                                            OH


                                            g       100

                                                    120
                                                                                                            h       100

                                                                                                                    120                                                         i
                                                                                                                                                                                        100

                                                                                                                                                                                        120
                                                                                                                                                                                                                                                                j       100

                                                                                                                                                                                                                                                                        120

                                                    140                                                             140                                                                 140                                                                             140

                                                    160                                                             160                                                                 160                                                                             160
10       8           6              4   2       0                      10    8       6          4       2       0                    10            8       6            4   2       0                    10           8                  6            4     2       0
                         f2 (ppm)                                                    f2 (ppm)                                                                f2 (ppm)                                                                      f2 (ppm)
Automated Structure Verification
Are Chemical Structure and NMR data consistent with each
other?
      Procedure:
           Predict NMR data from proposed structure
           Compare to experimental data (1H, 1H-13C HSQC)
           Calculate matching score
      Not seeking full structure elucidation or accurate assignments

Why doing this?
      Best way to deal with large number of simple compounds (i.e.
       libraries, reagents, etc.)
      Leave interesting problems for manual analysis
ASV of Negative Control Structures
            1.00
            0.90
                                                              PC-1
            0.80
            0.70
                                                              PC-2
                                                              PC-3                                                                          Test Set
                                                                                                      10 Positive Control Structures
ASV Score




            0.60                                                                                

            0.50                                                                                     5 Negative Control structures generated
            0.40
            0.30
                                                                                                      automatically
            0.20                                                                                     ASV run on all 6 structures against experimental
            0.10                                                                                      NMR data (1H-1D and HSQC) 1
            0.00
                0.00   5.00       10.00      15.00    20.00          25.00
                              Molecular Similarity
            1.00                                                                     1.00                                                                      1.00
            0.90                                                                     0.90                                           PC-7                       0.90                                           PC-9
                                                       PC-4                                                                                                                                                   PC-10
            0.80                                       PC-5                          0.80                                           PC-8                       0.80
                                                       PC-6
            0.70                                                                     0.70                                                                      0.70
                                                                         ASV Score




                                                                                                                                                   ASV Score
ASV Score




            0.60                                                                     0.60                                                                      0.60

            0.50                                                                     0.50                                                                      0.50
                                                                                                                                                               0.40
            0.40                                                                     0.40
                                                                                                                                                               0.30
            0.30                                                                     0.30
                                                                                                                                                               0.20
            0.20                                                                     0.20
                                                                                                                                                               0.10
            0.10                                                                     0.10
                                                                                                                                                               0.00
            0.00                                                                     0.00                                                                                2.00          6.00          10.00           14.00           18.00
                0.00   5.00      10.00      15.00    20.00       25.00                   0.00         5.00       10.00          15.00      20.00                  0.00          4.00          8.00           12.00           16.00           20.00
                              Molecular Similarity                                                       Molecular Similarity                                                          Molecular Similarity


                                                                                                1   ASV was run by Phil Keyes at Lexicon Pharmaceuticals using ACDLabs ASV system
Negative Controls for PC1



              1.00
              0.90
                                                               PC-1
              0.80                                             PC-2
              0.70                                             PC-3
  ASV Score




              0.60
              0.50
              0.40
              0.30
              0.20
              0.10
              0.00
                  0.00   5.00      10.00      15.00    20.00          25.00
                                Molecular Similarity
Negative Controls for PC5



                               1.00
                               0.90                                       PC-4
                               0.80                                       PC-5
                                                                          PC-6
                               0.70
                   ASV Score




                               0.60
                               0.50
                               0.40
                               0.30
                               0.20
                               0.10
                               0.00
Positive Control                   0.00   5.00      10.00      15.00    20.00    25.00
                                                 Molecular Similarity
ASV is a Binary Classifier

• The yellow band is a myth
• A Binary Classifier is a system that selects between
  two options
• Binary classifier is a well understood, well developed
  area of statistical analysis with many metrics at our
  disposal
• Used in many fields including, decision making,
  machine learning, signal detection theory
• Set your strategy (false positive/negative tolerant)
  and live with it
Summary

   Developed a molecular similarity method predictive of
    NMR data similarity for 1H-1D, 13C-1D and 1H-13C HSQC
    data

   Similarity calculation can be used for other purposes like
    CASE studies if linked to a structure generator

   The confidence level of an autoverification can be
    calculated by challenging the system with negative
    control structures of known similarity to the proposed
    structure
Acknowledgments
Lexicon Pharmaceuticals       Modgraph
  Giovanni Cianchetta            Jeff Seymour

  Phil Keyes
                              Funding
MestreLab
  Carlos Cobas
  Chen Peng
                              Open Source Comunity
ACDLabs
  Ryan Sasaki
  Sergey Golotvin
                                                OpenBabel

More Related Content

What's hot

What's hot (8)

Colorimetry B.Sc.III
Colorimetry B.Sc.IIIColorimetry B.Sc.III
Colorimetry B.Sc.III
 
Colorimetry
ColorimetryColorimetry
Colorimetry
 
How to use a colorimeter
How to use a colorimeterHow to use a colorimeter
How to use a colorimeter
 
11
1111
11
 
Colorimetry
ColorimetryColorimetry
Colorimetry
 
YSI Colorimeter Products Intro Webinar
YSI Colorimeter Products Intro WebinarYSI Colorimeter Products Intro Webinar
YSI Colorimeter Products Intro Webinar
 
Sub-windowed laser speckle image velocimetry by fast fourier transform techni...
Sub-windowed laser speckle image velocimetry by fast fourier transform techni...Sub-windowed laser speckle image velocimetry by fast fourier transform techni...
Sub-windowed laser speckle image velocimetry by fast fourier transform techni...
 
colorimeter
colorimetercolorimeter
colorimeter
 

Viewers also liked

FUJIMORI:25 YEARS JAIL
FUJIMORI:25 YEARS JAILFUJIMORI:25 YEARS JAIL
FUJIMORI:25 YEARS JAILG Garcia
 
전인대에서 밝힌 중국 경제의 향방 키움증권
전인대에서 밝힌 중국 경제의 향방 키움증권전인대에서 밝힌 중국 경제의 향방 키움증권
전인대에서 밝힌 중국 경제의 향방 키움증권준헌 이
 
Aviso convocatoria suministro emisora
Aviso convocatoria suministro emisoraAviso convocatoria suministro emisora
Aviso convocatoria suministro emisoramcriverah
 
Escuelapreparatoriaoficialno 110314181233-phpapp01
Escuelapreparatoriaoficialno 110314181233-phpapp01Escuelapreparatoriaoficialno 110314181233-phpapp01
Escuelapreparatoriaoficialno 110314181233-phpapp01Whaleejaa Wha
 
Ind prof bio_2010_bio_1011_ip02
Ind prof bio_2010_bio_1011_ip02Ind prof bio_2010_bio_1011_ip02
Ind prof bio_2010_bio_1011_ip02Min Thian
 
Energia renovable para la produccion de energia. ing sergio roko.
Energia renovable para la produccion de energia. ing sergio roko.Energia renovable para la produccion de energia. ing sergio roko.
Energia renovable para la produccion de energia. ing sergio roko.Eduardo Soracco
 
Project plan guide cmmaao pmi pmp
Project plan guide cmmaao pmi pmpProject plan guide cmmaao pmi pmp
Project plan guide cmmaao pmi pmpvishvasyadav45
 
Australian community attitudes held about nanotechnology
Australian community attitudes held about nanotechnologyAustralian community attitudes held about nanotechnology
Australian community attitudes held about nanotechnologyAndy Dabydeen
 

Viewers also liked (8)

FUJIMORI:25 YEARS JAIL
FUJIMORI:25 YEARS JAILFUJIMORI:25 YEARS JAIL
FUJIMORI:25 YEARS JAIL
 
전인대에서 밝힌 중국 경제의 향방 키움증권
전인대에서 밝힌 중국 경제의 향방 키움증권전인대에서 밝힌 중국 경제의 향방 키움증권
전인대에서 밝힌 중국 경제의 향방 키움증권
 
Aviso convocatoria suministro emisora
Aviso convocatoria suministro emisoraAviso convocatoria suministro emisora
Aviso convocatoria suministro emisora
 
Escuelapreparatoriaoficialno 110314181233-phpapp01
Escuelapreparatoriaoficialno 110314181233-phpapp01Escuelapreparatoriaoficialno 110314181233-phpapp01
Escuelapreparatoriaoficialno 110314181233-phpapp01
 
Ind prof bio_2010_bio_1011_ip02
Ind prof bio_2010_bio_1011_ip02Ind prof bio_2010_bio_1011_ip02
Ind prof bio_2010_bio_1011_ip02
 
Energia renovable para la produccion de energia. ing sergio roko.
Energia renovable para la produccion de energia. ing sergio roko.Energia renovable para la produccion de energia. ing sergio roko.
Energia renovable para la produccion de energia. ing sergio roko.
 
Project plan guide cmmaao pmi pmp
Project plan guide cmmaao pmi pmpProject plan guide cmmaao pmi pmp
Project plan guide cmmaao pmi pmp
 
Australian community attitudes held about nanotechnology
Australian community attitudes held about nanotechnologyAustralian community attitudes held about nanotechnology
Australian community attitudes held about nanotechnology
 

Similar to Talk at SMASH 2011

Two diametional (2 d) spectroscopy
Two diametional (2 d) spectroscopyTwo diametional (2 d) spectroscopy
Two diametional (2 d) spectroscopySurendra Singh
 
Electron Density Derived Descriptors in Drug Discovery and Protein Modeling
Electron Density Derived Descriptors in Drug Discovery and Protein ModelingElectron Density Derived Descriptors in Drug Discovery and Protein Modeling
Electron Density Derived Descriptors in Drug Discovery and Protein ModelingN. Sukumar
 
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAVibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAIJRES Journal
 
Wavelet analysis applied to the study of digitalized bullet striatures for th...
Wavelet analysis applied to the study of digitalized bullet striatures for th...Wavelet analysis applied to the study of digitalized bullet striatures for th...
Wavelet analysis applied to the study of digitalized bullet striatures for th...Luca Gervasio
 
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...Peter Morovic
 
Benchmarking and Validation of JChem ECFP and FCFP Fingerprints
Benchmarking and Validation of JChem ECFP and FCFP FingerprintsBenchmarking and Validation of JChem ECFP and FCFP Fingerprints
Benchmarking and Validation of JChem ECFP and FCFP FingerprintsNextMove Software
 
Vladan Mlinar 2009 Materials Research Society Spring Meeting
Vladan Mlinar 2009 Materials Research Society Spring MeetingVladan Mlinar 2009 Materials Research Society Spring Meeting
Vladan Mlinar 2009 Materials Research Society Spring MeetingVladan Mlinar
 
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...Cresset
 
Analical chemistry ok1294986152
Analical chemistry  ok1294986152Analical chemistry  ok1294986152
Analical chemistry ok1294986152Navin Joshi
 
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...Tim Cheeseright, Assessing the Similarities of Compound collections using mol...
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...Cresset
 
ssnow_manuscript_postreview
ssnow_manuscript_postreviewssnow_manuscript_postreview
ssnow_manuscript_postreviewStephen Snow
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 

Similar to Talk at SMASH 2011 (20)

Two diametional (2 d) spectroscopy
Two diametional (2 d) spectroscopyTwo diametional (2 d) spectroscopy
Two diametional (2 d) spectroscopy
 
Electron Density Derived Descriptors in Drug Discovery and Protein Modeling
Electron Density Derived Descriptors in Drug Discovery and Protein ModelingElectron Density Derived Descriptors in Drug Discovery and Protein Modeling
Electron Density Derived Descriptors in Drug Discovery and Protein Modeling
 
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICAVibration source identification caused by bearing faults based on SVD-EMD-ICA
Vibration source identification caused by bearing faults based on SVD-EMD-ICA
 
2 d nmr
2 d nmr2 d nmr
2 d nmr
 
Wavelet analysis applied to the study of digitalized bullet striatures for th...
Wavelet analysis applied to the study of digitalized bullet striatures for th...Wavelet analysis applied to the study of digitalized bullet striatures for th...
Wavelet analysis applied to the study of digitalized bullet striatures for th...
 
Noesy [autosaved]
Noesy [autosaved]Noesy [autosaved]
Noesy [autosaved]
 
Applications of Computational Quantum Chemistry
Applications of Computational Quantum ChemistryApplications of Computational Quantum Chemistry
Applications of Computational Quantum Chemistry
 
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...
Analysis and Compression of Reflectance Data Using An Evolved Spectral Correl...
 
Hetcor
HetcorHetcor
Hetcor
 
Hr3114661470
Hr3114661470Hr3114661470
Hr3114661470
 
Benchmarking and Validation of JChem ECFP and FCFP Fingerprints
Benchmarking and Validation of JChem ECFP and FCFP FingerprintsBenchmarking and Validation of JChem ECFP and FCFP Fingerprints
Benchmarking and Validation of JChem ECFP and FCFP Fingerprints
 
Basics of dip
Basics of dipBasics of dip
Basics of dip
 
Vladan Mlinar 2009 Materials Research Society Spring Meeting
Vladan Mlinar 2009 Materials Research Society Spring MeetingVladan Mlinar 2009 Materials Research Society Spring Meeting
Vladan Mlinar 2009 Materials Research Society Spring Meeting
 
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...
 
Analical chemistry ok1294986152
Analical chemistry  ok1294986152Analical chemistry  ok1294986152
Analical chemistry ok1294986152
 
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...Tim Cheeseright, Assessing the Similarities of Compound collections using mol...
Tim Cheeseright, Assessing the Similarities of Compound collections using mol...
 
NIR ppt
NIR  pptNIR  ppt
NIR ppt
 
ssnow_manuscript_postreview
ssnow_manuscript_postreviewssnow_manuscript_postreview
ssnow_manuscript_postreview
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Uv vis absorbance measurement
Uv vis absorbance measurementUv vis absorbance measurement
Uv vis absorbance measurement
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Talk at SMASH 2011

  • 1. Automatic Generation of Negative Control Structures for Automated Structure Verification Systems Gonzalo Hernández SMASH 2011 Chamonix,France
  • 2. Outline  Goal  Similarity Calculation Overview  NMR Specific Fingerprint Development  Method Validation  Applications  Database Searching  Automated Structure Verification (ASV)
  • 3. Goal • To develop a method that given a target chemical structure would rank other proposed structures based on the expected similarity of their NMR data, without an a priori knowledge of that data. Increased Similarity
  • 4. How to Achieve Our Goal • Calculate a molecular similarity coefficient predictive of NMR data similarity. • Develop an NMR-specific molecular fingerprint
  • 5. Molecular Similarity vs. NMR Data Similarity Molecular Fingerprints • A molecular fingerprint is a collection of descriptors that is used to characterize a molecule. For example, the number and type of functional groups, molecular formula, etc. • Different metrics can be calculated between fingerprints to find their similarity or dissimilarity. • Most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc. F F S O S O H3C OH F O O Cl CH3 Cl CH3 F NMR Data Similarity • Which two molecules are structurally most similar? • Which molecules would present the most similar NMR data? • How to answer the previous question without knowing the actual NMR data.
  • 6. NMR-Specific Molecular Similarity Coefficient Similarity based on Chemical Environments Around Carbon Atoms • Define the most common chemical environments up to three shells emanating from a carbon atom • Assemble them as bits of a fingerprint • Count how many times each fingerprint bit (environment) is present in each molecule • Calculate similarity between two molecules as the Euclidean distance between two fingerprints [CH1]([CH3])(OC)[CH1](C)C SMARTS Smiles ARbitrary Target Specification (SMARTS) is a language for specifying substructural patterns in O molecules. [#6] any Carbon atom NH [CH3] Methyl group [n;!H0] pyrrole-type Nitrogen [#7,#8;!H0] hydrogen bond donor [cH1]([cH0](C)c)[cH1]c
  • 7. Fingerprint Development 1. Generate all combinations of SMARTS code strings Bi ( bj ( Rk ) )l Where: Bi = { [CH3], [CH2], [CH1], [cH1] } bj = { -, =, #, : } Rk = { C, N, O, S, F, Cl, Br, I, c, n, o, s } l = i – j + 1, l > 0 2. Extract all chemical environments up to three shells from large compound database – Database contained about 4.6 million compounds, extracted from PubChem, for a total of 82 million chemical environments
  • 8. Method Validation  Test set of 100 commercial compounds  Calculate pairwise Molecular Similarity between all pairs (4950 pairs total)  Predict 1H, 13C, and construct 1H-13C HSQC data  Calculate Spectral Similarity (1D and 2D binning)  Compare Molecular Similarity vs Spectral Similarity for all pairs
  • 9. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 10. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 11. Molecular Similarity vs. Spectral Similarity  Similarity measured as distance. Smaller numbers mean greater similarity  Molecular fingerprint contains 28,833 chemical environments (bits)  Spectral Similarity calculated used 2D binning and euclidean distance metric
  • 12. 1H-1D NMR Data • Predicted similarity was calculated using a 1H specific fingerprint containing 100,000 unique three-shell chemical environments (bits) • Actual similarity was calculated as a 1D binning of the predicted 1H-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 13. 13C-1D NMR Data • Predicted similarity was calculated using a 13C specific fingerprint containing 200,000 bits • Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 14. 1H-13C HSQC 2D NMR Data • Predicted similarity was calculated using a H-C correlation specific fingerprint containing 50,000 bits • Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra • In both cases the metric used was Euclidean distance between fingerprint bits
  • 15. Test Set (Database Search) (MW <= 250 Da, 1 CH3, 3 CH2, 1 CH, 4 Ar) 0 0 0 Pairwise similarity O OH 20 O O H N 20 Br NH2 20 a b c d e f g h i j N 40 H 40 40 10 O f g h i j N 60 60 60 6 f1 (ppm) f1 (ppm) f1 (ppm) 80 80 80 8 5 a 100 120 b 100 120 c 100 120 4 140 140 140 3 6 160 160 160 2 Molecule B 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 f2 (ppm) f2 (ppm) f2 (ppm) 0 0 0 1 a b c d e 20 20 20 4 0 HN 40 N 40 O H N 40 H H O N 60 60 N O 60 H f1 (ppm) 2 f1 (ppm) f1 (ppm) 80 80 80 d 100 120 e 100 120 f 100 120 140 140 0 140 0 2 4 6 8 10 160 160 160 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 Molecule A f2 (ppm) f2 (ppm) f2 (ppm) 0 0 0 0 O O 20 20 20 20 N O 40 40 40 O 40 N O 60 60 60 O 60 H N NH2 N f1 (ppm) f1 (ppm) f1 (ppm) f1 (ppm) 80 H 80 80 80 OH g 100 120 h 100 120 i 100 120 j 100 120 140 140 140 140 160 160 160 160 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 10 8 6 4 2 0 f2 (ppm) f2 (ppm) f2 (ppm) f2 (ppm)
  • 16. Automated Structure Verification Are Chemical Structure and NMR data consistent with each other?  Procedure:  Predict NMR data from proposed structure  Compare to experimental data (1H, 1H-13C HSQC)  Calculate matching score  Not seeking full structure elucidation or accurate assignments Why doing this?  Best way to deal with large number of simple compounds (i.e. libraries, reagents, etc.)  Leave interesting problems for manual analysis
  • 17. ASV of Negative Control Structures 1.00 0.90 PC-1 0.80 0.70 PC-2 PC-3 Test Set 10 Positive Control Structures ASV Score 0.60  0.50  5 Negative Control structures generated 0.40 0.30 automatically 0.20  ASV run on all 6 structures against experimental 0.10 NMR data (1H-1D and HSQC) 1 0.00 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity 1.00 1.00 1.00 0.90 0.90 PC-7 0.90 PC-9 PC-4 PC-10 0.80 PC-5 0.80 PC-8 0.80 PC-6 0.70 0.70 0.70 ASV Score ASV Score ASV Score 0.60 0.60 0.60 0.50 0.50 0.50 0.40 0.40 0.40 0.30 0.30 0.30 0.20 0.20 0.20 0.10 0.10 0.10 0.00 0.00 0.00 2.00 6.00 10.00 14.00 18.00 0.00 5.00 10.00 15.00 20.00 25.00 0.00 5.00 10.00 15.00 20.00 0.00 4.00 8.00 12.00 16.00 20.00 Molecular Similarity Molecular Similarity Molecular Similarity 1 ASV was run by Phil Keyes at Lexicon Pharmaceuticals using ACDLabs ASV system
  • 18. Negative Controls for PC1 1.00 0.90 PC-1 0.80 PC-2 0.70 PC-3 ASV Score 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity
  • 19. Negative Controls for PC5 1.00 0.90 PC-4 0.80 PC-5 PC-6 0.70 ASV Score 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Positive Control 0.00 5.00 10.00 15.00 20.00 25.00 Molecular Similarity
  • 20. ASV is a Binary Classifier • The yellow band is a myth • A Binary Classifier is a system that selects between two options • Binary classifier is a well understood, well developed area of statistical analysis with many metrics at our disposal • Used in many fields including, decision making, machine learning, signal detection theory • Set your strategy (false positive/negative tolerant) and live with it
  • 21. Summary  Developed a molecular similarity method predictive of NMR data similarity for 1H-1D, 13C-1D and 1H-13C HSQC data  Similarity calculation can be used for other purposes like CASE studies if linked to a structure generator  The confidence level of an autoverification can be calculated by challenging the system with negative control structures of known similarity to the proposed structure
  • 22. Acknowledgments Lexicon Pharmaceuticals Modgraph Giovanni Cianchetta Jeff Seymour Phil Keyes Funding MestreLab Carlos Cobas Chen Peng Open Source Comunity ACDLabs Ryan Sasaki Sergey Golotvin OpenBabel