SlideShare a Scribd company logo
1 of 40
Download to read offline
Efficient Perception of
            Proteins and Nucleic Acids
            from Atomic Connectivity

                            Roger Sayle, Ph.D.
                           NextMove Software,
                             Cambridge, UK

ACS National Meeting, San Diego, USA 26th March 2012
motivation

• Chemical Structure Normalization/Tautomers.
• IMI Open PHACTS suggest using FDA rules.
• FDA 2007 guidelines on structure registration
  contains a section on peptides.
• Bridging the gap between cheminformatics (small
  molecules) and bioinformatics (DNA and protein
  sequences), peptides pose some interesting technical
  challenges.


 ACS National Meeting, San Diego, USA 26th March 2012
Depiction of peptides




ACS National Meeting, San Diego, USA 26th March 2012
End Result
CC[C@H](C)[C@H]1C(=O)N[C@H]2CSSC[C@@H](C(=O)N[C@@H](CSSC[C@@H](C(=O)NCC(=O)N[C@H
](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N
[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CSSC[C@H](NC(=O)[C@@H](NC(=O)[C@@H](N
C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC2
=O)CO)CC(C)C)CC3=CC=C(C=C3)O)CCC(=O)N)CC(C)C)CCC(=O)O)CC(=O)N)CC4=CC=C(C=C4)O)C(
=O)N[C@@H](CC(=O)N)C(=O)O)C(=O)NCC(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C
(=O)NCC(=O)N[C@@H](CC5=CC=CC=C5)C(=O)N[C@@H](CC6=CC=CC=C6)C(=O)N[C@@H](CC7=CC=C(
C=C7)O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N8CCC[C@H]8C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([
C@@H](C)O)C(=O)O)C(C)C)CC(C)C)CC9=CC=C(C=C9)O)CC(C)C)C)CCC(=O)O)C(C)C)CC(C)C)CC2
=CN=CN2)CO)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC2=CN=CN2)NC(=O)[C@H](CCC(=O)N)NC(=O)
[C@H](CC(=O)N)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC2=CC=CC=C2)N)C(=O)N[C@H](C(=O)N[C@
H](C(=O)N1)CO)[C@@H](C)O)NC(=O)[C@H](CCC(=O)N)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](C
(C)C)NC(=O)[C@H]([C@@H](C)CC)NC(=O)CN Human Insulin




 ACS National Meeting, San Diego, USA 26th March 2012
File Format conversion

• Information loss from MDL/SMILES to PDB/MMD
     Bond orders, Formal charges, Isotopes


• Information loss from PDB/MMD to MDL/SMILES
     Residue information, Occupancy, B-factors



ACS National Meeting, San Diego, USA 26th March 2012
The challenge

• Many structural biology and computational
  chemistry applications assume standard atom and
  residue naming, some even ordering.
• In OpenEye’s OEChem toolkit, recognizing PDB
  residues is done in OEPercieveResidues, and in
  OpenBabel it’s done in OBChainsParsers.
• This talk describes aspects of the advanced
  pattern matching algorithms that are used.

ACS National Meeting, San Diego, USA 26th March 2012
The Input
• The input is a minimal connection table.
• Only the graph connectivity is used, allowing
  molecules without bond orders to be handled.
• Only heavy atoms are required, explicit and implicit
  hydrogen counts aren’t used.
• Likewise, any formal charges, hybridization and 3D
  geometry are also ignored.
• Typical sources include SMILES and XYZ files.

 ACS National Meeting, San Diego, USA 26th March 2012
The Output (PDB file format)
ATOM     261   N     GLY     37        20.172    17.730    6.217   1.00    8.48
ATOM     262   CA    GLY     37        21.452    16.969    6.513   1.00    9.20
ATOM     263   C     GLY     37        21.143    15.478    6.427   1.00   10.41
ATOM     264   O     GLY     37        20.138    15.023    5.878   1.00   12.06
ATOM     265   N     ALA     38        22.055    14.701    7.032   1.00    9.24
ATOM     266   CA    ALA     38        22.019    13.242    7.020   1.00    9.24
ATOM     267   C     ALA     38        21.944    12.628    8.396   1.00    9.60
ATOM     268   O     ALA     38        21.869    11.387    8.435   1.00   13.65
ATOM     269   CB    ALA     38        23.246    12.697    6.275   1.00   10.43
ATOM     270   N     THR     39        21.894    13.435    9.436   1.00    8.70
ATOM     271   CA    THR     39        21.936    12.911   10.809   1.00    9.46
ATOM     272   C     THR     39        20.615    13.191   11.521   1.00    8.32
ATOM     273   O     THR     39        20.357    14.317   11.948   1.00    9.89
ATOM     274   CB    THR     39        23.131    13.601   11.593   1.00   10.72
ATOM     275   OG1   THR     39        24.284    13.401   10.709   1.00   11.66
ATOM     276   CG2   THR     39        23.340    12.935   12.962   1.00   11.81
ATOM     277   N     CYS     40        19.827    12.110   11.642   1.00    7.64
ATOM     278   CA    CYS     40        18.504    12.312   12.298   1.00    8.05
ATOM     279   C     CYS     40        18.684    12.451   13.784   1.00    7.63
ATOM     280   O     CYS     40        19.533    11.718   14.362   1.00    9.64
ATOM     281   CB    CYS     40        17.582    11.117   11.996   1.00    7.80
ATOM     282   SG    CYS     40        17.199    10.929   10.237   1.00    7.30

 ACS National Meeting, San Diego, USA 26th March 2012
chains Algorithm Overview



• Identify biopolymer backbones.

• Recognize biopolymer monomer sidechains.




 ACS National Meeting, San Diego, USA 26th March 2012
chains Algorithm Overview
• Determine connected components.
• Identify “Hetero” atoms (water, solvent, metals, ions,
  cofactors and common ligands; ATP, NAD, HEM).
• Identify biopolymer backbones (peptide, DNA, RNA).
• Traverse and number biopolymer backbone residues.
• Recognize biopolymer monomer sidechains (potentially
  substituted functional groups).
• Annotate hydrogen atoms (if present).
• Sort atoms into PDB order.
• Assign consecutive serial numbers (unique names).


 ACS National Meeting, San Diego, USA 26th March 2012
Polymer Matching

• Matching of linear, cyclic and dendrimeric
  polymers and copolymers can be efficiently
  implemented by graph relaxation algorithms.

• Each atom records a set of possible template
  equivalences represented as a bit vector.

• These bit vectors are iteratively refined using the
  bit vectors of neighboring atoms.
 ACS National Meeting, San Diego, USA 26th March 2012
Peptides and Proteins

• Example of a copolymer (substituted backbone).
                                             Q
                                 H                      OH
                                       N
                                       H           O
• Complications from glycine and proline.
• N-terminal acetyl and formyl.
• C-terminal aldehyde and amide.
 ACS National Meeting, San Diego, USA 26th March 2012
Protein Backbone Definition

• Assign initial constraints based upon atomic number
  and heavy atom degree.

         BitN                  Nitrogen                 2 neighbors
         BitCA                 Carbon                   3 neighbors
         BitC                  Carbon                   3 neighbors
         BitO                  Oxygen                   1 neighbor



 ACS National Meeting, San Diego, USA 26th March 2012
Protein Backbone Definition

• Perform iterative relaxation at each atom using its
  neighbor’s bit masks.

         BitN:                 BitCA                    BitC
         BitCA:                BitN                     BitC   AnyC
         BitC:                 BitCA                    BitN   BitO
         BitO:                 BitC



 ACS National Meeting, San Diego, USA 26th March 2012
current Protein Definitions

•     Five types of nitrogen (±proline, ±terminal & amide)
•     Two types of alpha carbon (regular vs. GLY)
•     Four types of carbonyl carbon
•     Two types of oxygen (O and OXT)

• Currently 13 atom types (bits).




    ACS National Meeting, San Diego, USA 26th March 2012
Nucleic Acids

                              Q                                              Q

                                                                                     OH
                          O                                              O
               O                                                O

      H   O    P    O             O   H                 H   O   P    O           O   H
               OH                                               OH

                    DNA                                                  RNA

• Complications include termination & methylation.
• Currently 19 atom types (bits).
 ACS National Meeting, San Diego, USA 26th March 2012
Functional Group Matching

• Given an known attachment atom/bond
  determine whether the affixed molecular
  fragment is a member of a dictionary of known
  functional groups or sidechains.

• The standard “default” strategy is to treat the
  dictionary as a list of SMARTS patterns and loop
  over each one until a hit is identified.
• This is O(n) in the size of the dictionary.
 ACS National Meeting, San Diego, USA 26th March 2012
Recognizing monomer sidechains

• Avoid backtracking by preprocessing monomer sets.
                                                 CG1   CD 1

                                      CA    CB

                                                 CG2




• Enumerate all possible depth first graph traversals

     C1,C3,C2,C1,C1 =>                     ILE: CA,CB,CG1,CD1,CG2
     C1,C3,C1,C2,C1 =>                     ILE: CA,CB,CG2,CG1,CD1

ACS National Meeting, San Diego, USA 26th March 2012
Recognizing monomer sidechains

• The sequence of atomic numbers and the heavy
  degree (or fanout) of each vertex uniquely identifies
  the sidechain.
• Conceptually, the candidate sidechain is traversed
  once and identified by looking up the “traversal” in a
  “dictionary”.
• Encode the traversal dictionary as a prefix-closed
  “tries” to efficiently match arbitrary dictionaries.


 ACS National Meeting, San Diego, USA 26th March 2012
Example Peptide Flow Chart
                                   EVAL NEIGHBOURS



                                                     Y
                                      COUNT=0?           ASSIGN GLY


                                            N

                                                     N
                                      COUNT=1?           ASSIGN UNK


                                            Y

                                                     N
                                       ELEM=C?           ASSIGN UNK


                                            Y
                                   EVAL NEIGHBOURS



                                                     Y
                                      COUNT=0?           ASSIGN ALA


                                            N

                                                     N
                                      COUNT=1?           ASSIGN UNK


                                            Y

                                                     Y
                                       ELEM=S?           ASSIGN CYS


                                            N

                                                     Y
                                       ELEM=O?           ASSIGN SER

                                            N
                                     ASSIGN UNK



ACS National Meeting, San Diego, USA 26th March 2012
Graph matching during traversal

 • Instead of initially traversing the entire sidechain
   to be recognized and then looking it up in the
   dictionary, we can perform the look-up as we
   traverse our molecule.
 • This has the benefit that we can fail fast and run-
   time is bounded by the number of atoms in our
   largest monomer.
 • In fact, the algorithm’s performance is
   independent of the number of monomer
   definitions.
 ACS National Meeting, San Diego, USA 26th March 2012
OEchem Implementation

• In OEChem, a way-ahead-of-time “compiler”
  processes an input set of sidechain definitions and
  outputs them as a bytecode program.
• These bytecodes are efficiently interpreted at run-
  time inside OEPerceiveResidues.
• Separate sidechain “programs” are used for proteins
  and nucleic acids.
• An early implementation in OELib and OpenBabel
  builds these bytecodes just-in-time.

 ACS National Meeting, San Diego, USA 26th March 2012
Resulting Bytecode
static const OEByteCode peptide_bc[301] = {
  { OEOpCode::EVAL,                    0,     1,          0   },   //    0
  { OEOpCode::COUNT,                   0,    61,         60   },   //    1
  { OEOpCode::ELEM,                    6,     3,         -1   },   //    2
  { OEOpCode::EVAL,                    0,     4,          0   },   //    3
  { OEOpCode::COUNT,                   0,     5,          6   },   //    4
  { OEOpCode::ASSIGN, BCNAM('A','L','A'),     2,          0   },   //    5
  { OEOpCode::COUNT,                   1,     7,         77   },   //    6
  { OEOpCode::ELEM,                    6,     8,         41   },   //    7
  { OEOpCode::EVAL,                    0,     9,          0   },   //    8
  { OEOpCode::COUNT,                   0,   174,        173   },   //    9
  { OEOpCode::ELEM,                    7,    11,         17   },   //   10
  { OEOpCode::EVAL,                    0,    12,          0   },   //   11
  { OEOpCode::COUNT,                   0,    13,        229   },   //   12
  { OEOpCode::ELEM,                    8,    14,         -1   },   //   13
  { OEOpCode::EVAL,                    0,    15,          0   },   //   14
  { OEOpCode::COUNT,                   0,    16,         -1   },   //   15
  { OEOpCode::ASSIGN, BCNAM('A','S','N'),     5,          0   },   //   16


 ACS National Meeting, San Diego, USA 26th March 2012
Supported Peptide Sidechains

• The twenty standard amino acids (D and L- forms).
    – ALA, ASN, ASP, ARG, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS,
      MET, PHE, PRO, SER, THR, TRP, TYR and VAL.
• Fourteen non-standard amino acids.
    – ABA, CGU, CME, CSD, HYP, LYZ, MEN, MLY, MSE, NLE, NVA,
      ORN, PCA, PTR, SEP and TPO.
• Two N-terminal modifications.
    – Acetyl (ACE) and Formyl (FOR).
• One C-terminal modification residue.
    – Amide (NH2).
 ACS National Meeting, San Diego, USA 26th March 2012
Supported Nucleic Sidechains

• Four standard DNA bases.
   – DA, DC, DG, DT.
• Four standard RNA bases.
   – A, C, G, U.
• Seven non-standard RNA bases.
   – 1MA, 2MG, 5MC, 7MG, M2G, PSU, YG.




ACS National Meeting, San Diego, USA 26th March 2012
OEPerceiveResidues Performance

• As measured on a 2.2GHz AMD Opteron.

    Code            Atoms                 Mons          Timings   Average
    1CRN            327                   46            15s/10K   1.5 ms
    4TNA            1,656                 76            98s/10K   9.8 ms
    1GD1            10,984                1336          185s/1K   185 ms

• Previous work by Siani, Weininger and Blaney
  (JCICS 1994) report taking 8.2s on an R3000 SGI to
  identify insulin (51aa) using SMARTS matching.

 ACS National Meeting, San Diego, USA 26th March 2012
byteCode Size Issues

• Using all possible traversal permutations works
  well for the standard amino and nucleic acid
  sidechains.
• To match the 34 supported amino acid sidechains
  required a “program” of only 521 bytecodes, and
  the 11 supported nucleic acid bases required
  3,488 byte codes.
• Then came wybutosine (PDB residue YG) which
  had pathological branching requiring 91,974 byte
  codes due to combinatorics.
ACS National Meeting, San Diego, USA 26th March 2012
Wybutosine




ACS National Meeting, San Diego, USA 26th March 2012
Bytecode Size Issues

   • The solution was to perform a local sorting step
     when pushing neighboring atoms on to the stack.
   • Sort uses atomic number and heavy degree.
   • No performance impact in OEChem as we were
     already sorting for permutation stability.
        Dictionary                  Before             After
        Peptide                     521                301
        Nucleic                     3,488              326
        Nucleic+YG                  95,391             649


ACS National Meeting, San Diego, USA 26th March 2012
Identifying peptides

• To a first approximation, use OpenBabel to convert
  the SMILES to a PDB file, and if it only contains
  ATOMs (no HETATMs) it’s a peptide.
• A better definition is to use an explicit list of
  acceptable amino (and nucleic) acid residues and
  check for the presence of these.
• Crosslinking is easily retrieved as bonds between
  residues, excepting standard eupeptide links.


 ACS National Meeting, San Diego, USA 26th March 2012
Crambin (FDA & IUPAC STYLES)




ACS National Meeting, San Diego, USA 26th March 2012
Human insulin (MULtiple chains)




ACS National Meeting, San Diego, USA 26th March 2012
Oxytocin (Disulfide & amide)




ACS National Meeting, San Diego, USA 26th March 2012
Gramicidin S (cyclic peptides)




ACS National Meeting, San Diego, USA 26th March 2012
LULiBERIN (terminal options)




ACS National Meeting, San Diego, USA 26th March 2012
NESIritide (stereo options)




ACS National Meeting, San Diego, USA 26th March 2012
Database statistics

• Unfortunately, of 250251 compounds in the NCI Aug
  2000 database, only 758 (0.3%) are peptide or
  peptide-like and only 86 (0.03%) are oligonucleotide
  sequences.
• Of the 32,274,581 compounds in PubChem[February
  2012], only 251,866 (0.78%) and only 1075 (0.00%)
  are oligonucleotide sequences.
• Alas the SMARTS pattern “NCC(=O)NCC(=O)NCC=O”
  for a tripeptide only matches 391,502 (1.2%), an
  upper bound on the prevalence of peptides.
 ACS National Meeting, San Diego, USA 26th March 2012
Future Work

•   Analysis of non-standard AA usage in drugs.
•   Additional non-standard sidechains (SAR).
•   Alloisoleucine (IIL) and allothreonine (ALO).
•   Glycoinformatics (post-translational sugar chains).
•   Matching improvements (compilation/profiling).




ACS National Meeting, San Diego, USA 26th March 2012
conclusions

• Peptides and biologics don’t need their own file
  formats if their sequence can be perceived.
• Generating prettier 2D depictions is simple.
• Peptide informatics (and glycoinfomatics)
  potentially offer great opportunities for research
  and innovation.
• It’s always great to find a new use for an old
  algorithm.


 ACS National Meeting, San Diego, USA 26th March 2012
acknowledgements

•     OpenEye Scientific Software.
•     AstraZeneca R&D.
•     Vertex Pharmaceuticals.
•     European Bioinformatics Institute.
•     RasMol’s user community.

• Thank you for your time.

    ACS National Meeting, San Diego, USA 26th March 2012

More Related Content

Viewers also liked

WUD 2009 - User Experience Design a telefony komórkowe
WUD 2009 - User Experience Design a telefony komórkoweWUD 2009 - User Experience Design a telefony komórkowe
WUD 2009 - User Experience Design a telefony komórkoweWorld Usability Day Tour 2009
 
New constitution - what principles should guide our business?
New constitution - what principles should guide our business?New constitution - what principles should guide our business?
New constitution - what principles should guide our business?Mark Ralphs
 
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016Rowena Marella-Daw
 
EXTRA FASHION - AW1617 Trend Report
EXTRA FASHION - AW1617 Trend ReportEXTRA FASHION - AW1617 Trend Report
EXTRA FASHION - AW1617 Trend ReportJames Jackson
 
Top 5 call center software solutions
Top 5 call center software solutionsTop 5 call center software solutions
Top 5 call center software solutionsPaul Bellys
 
Aria Telecom Profile
Aria Telecom ProfileAria Telecom Profile
Aria Telecom Profilersaini12
 
8 reasons Images Matter, plus learn how to upload custom images on Listly
 8 reasons Images Matter, plus learn how to upload custom images on Listly 8 reasons Images Matter, plus learn how to upload custom images on Listly
8 reasons Images Matter, plus learn how to upload custom images on ListlyNick Kellet
 
Planificador de proyectos actual (1)
Planificador de proyectos actual (1)Planificador de proyectos actual (1)
Planificador de proyectos actual (1)adrizinemcali2014
 
Thyatira
ThyatiraThyatira
Thyatiratccdeaf
 

Viewers also liked (9)

WUD 2009 - User Experience Design a telefony komórkowe
WUD 2009 - User Experience Design a telefony komórkoweWUD 2009 - User Experience Design a telefony komórkowe
WUD 2009 - User Experience Design a telefony komórkowe
 
New constitution - what principles should guide our business?
New constitution - what principles should guide our business?New constitution - what principles should guide our business?
New constitution - what principles should guide our business?
 
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
TOP 10 HONEYMOON DESTINATIONS_ABTA MAG_FEB 2016
 
EXTRA FASHION - AW1617 Trend Report
EXTRA FASHION - AW1617 Trend ReportEXTRA FASHION - AW1617 Trend Report
EXTRA FASHION - AW1617 Trend Report
 
Top 5 call center software solutions
Top 5 call center software solutionsTop 5 call center software solutions
Top 5 call center software solutions
 
Aria Telecom Profile
Aria Telecom ProfileAria Telecom Profile
Aria Telecom Profile
 
8 reasons Images Matter, plus learn how to upload custom images on Listly
 8 reasons Images Matter, plus learn how to upload custom images on Listly 8 reasons Images Matter, plus learn how to upload custom images on Listly
8 reasons Images Matter, plus learn how to upload custom images on Listly
 
Planificador de proyectos actual (1)
Planificador de proyectos actual (1)Planificador de proyectos actual (1)
Planificador de proyectos actual (1)
 
Thyatira
ThyatiraThyatira
Thyatira
 

Similar to Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity

SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
A pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesA pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesBing Hsieh
 
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...University of California, San Diego
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...NextMove Software
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesNextMove Software
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsNextMove Software
 
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...Lawrence kok
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...Arkansas State University
 
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...Lawrence kok
 

Similar to Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity (17)

SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
A pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesA pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteries
 
Advanced Computational Drug Design
Advanced Computational Drug DesignAdvanced Computational Drug Design
Advanced Computational Drug Design
 
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...
Insights into nanoscale phase stability and charging mechanisms in alkali o2 ...
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
Oct 2011 ualr
Oct 2011 ualrOct 2011 ualr
Oct 2011 ualr
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
 
VLSI Design Book CMOS_Circuit_Design__Layout__and_Simulation
VLSI Design Book CMOS_Circuit_Design__Layout__and_SimulationVLSI Design Book CMOS_Circuit_Design__Layout__and_Simulation
VLSI Design Book CMOS_Circuit_Design__Layout__and_Simulation
 
Tutorial komputasi chem 126
Tutorial komputasi chem 126Tutorial komputasi chem 126
Tutorial komputasi chem 126
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
 
Computational Chemistry Robots
Computational Chemistry RobotsComputational Chemistry Robots
Computational Chemistry Robots
 
Crystalography
CrystalographyCrystalography
Crystalography
 
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
IB Chemistry on ICT, 3D software, Avogadro, Jmol, Swiss PDB, Pymol for Intern...
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 

Recently uploaded

MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 

Recently uploaded (20)

MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 

Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity

  • 1. Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity Roger Sayle, Ph.D. NextMove Software, Cambridge, UK ACS National Meeting, San Diego, USA 26th March 2012
  • 2. motivation • Chemical Structure Normalization/Tautomers. • IMI Open PHACTS suggest using FDA rules. • FDA 2007 guidelines on structure registration contains a section on peptides. • Bridging the gap between cheminformatics (small molecules) and bioinformatics (DNA and protein sequences), peptides pose some interesting technical challenges. ACS National Meeting, San Diego, USA 26th March 2012
  • 3. Depiction of peptides ACS National Meeting, San Diego, USA 26th March 2012
  • 4. End Result CC[C@H](C)[C@H]1C(=O)N[C@H]2CSSC[C@@H](C(=O)N[C@@H](CSSC[C@@H](C(=O)NCC(=O)N[C@H ](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N [C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CSSC[C@H](NC(=O)[C@@H](NC(=O)[C@@H](N C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC2 =O)CO)CC(C)C)CC3=CC=C(C=C3)O)CCC(=O)N)CC(C)C)CCC(=O)O)CC(=O)N)CC4=CC=C(C=C4)O)C( =O)N[C@@H](CC(=O)N)C(=O)O)C(=O)NCC(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C (=O)NCC(=O)N[C@@H](CC5=CC=CC=C5)C(=O)N[C@@H](CC6=CC=CC=C6)C(=O)N[C@@H](CC7=CC=C( C=C7)O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N8CCC[C@H]8C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([ C@@H](C)O)C(=O)O)C(C)C)CC(C)C)CC9=CC=C(C=C9)O)CC(C)C)C)CCC(=O)O)C(C)C)CC(C)C)CC2 =CN=CN2)CO)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC2=CN=CN2)NC(=O)[C@H](CCC(=O)N)NC(=O) [C@H](CC(=O)N)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC2=CC=CC=C2)N)C(=O)N[C@H](C(=O)N[C@ H](C(=O)N1)CO)[C@@H](C)O)NC(=O)[C@H](CCC(=O)N)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](C (C)C)NC(=O)[C@H]([C@@H](C)CC)NC(=O)CN Human Insulin ACS National Meeting, San Diego, USA 26th March 2012
  • 5. File Format conversion • Information loss from MDL/SMILES to PDB/MMD Bond orders, Formal charges, Isotopes • Information loss from PDB/MMD to MDL/SMILES Residue information, Occupancy, B-factors ACS National Meeting, San Diego, USA 26th March 2012
  • 6. The challenge • Many structural biology and computational chemistry applications assume standard atom and residue naming, some even ordering. • In OpenEye’s OEChem toolkit, recognizing PDB residues is done in OEPercieveResidues, and in OpenBabel it’s done in OBChainsParsers. • This talk describes aspects of the advanced pattern matching algorithms that are used. ACS National Meeting, San Diego, USA 26th March 2012
  • 7. The Input • The input is a minimal connection table. • Only the graph connectivity is used, allowing molecules without bond orders to be handled. • Only heavy atoms are required, explicit and implicit hydrogen counts aren’t used. • Likewise, any formal charges, hybridization and 3D geometry are also ignored. • Typical sources include SMILES and XYZ files. ACS National Meeting, San Diego, USA 26th March 2012
  • 8. The Output (PDB file format) ATOM 261 N GLY 37 20.172 17.730 6.217 1.00 8.48 ATOM 262 CA GLY 37 21.452 16.969 6.513 1.00 9.20 ATOM 263 C GLY 37 21.143 15.478 6.427 1.00 10.41 ATOM 264 O GLY 37 20.138 15.023 5.878 1.00 12.06 ATOM 265 N ALA 38 22.055 14.701 7.032 1.00 9.24 ATOM 266 CA ALA 38 22.019 13.242 7.020 1.00 9.24 ATOM 267 C ALA 38 21.944 12.628 8.396 1.00 9.60 ATOM 268 O ALA 38 21.869 11.387 8.435 1.00 13.65 ATOM 269 CB ALA 38 23.246 12.697 6.275 1.00 10.43 ATOM 270 N THR 39 21.894 13.435 9.436 1.00 8.70 ATOM 271 CA THR 39 21.936 12.911 10.809 1.00 9.46 ATOM 272 C THR 39 20.615 13.191 11.521 1.00 8.32 ATOM 273 O THR 39 20.357 14.317 11.948 1.00 9.89 ATOM 274 CB THR 39 23.131 13.601 11.593 1.00 10.72 ATOM 275 OG1 THR 39 24.284 13.401 10.709 1.00 11.66 ATOM 276 CG2 THR 39 23.340 12.935 12.962 1.00 11.81 ATOM 277 N CYS 40 19.827 12.110 11.642 1.00 7.64 ATOM 278 CA CYS 40 18.504 12.312 12.298 1.00 8.05 ATOM 279 C CYS 40 18.684 12.451 13.784 1.00 7.63 ATOM 280 O CYS 40 19.533 11.718 14.362 1.00 9.64 ATOM 281 CB CYS 40 17.582 11.117 11.996 1.00 7.80 ATOM 282 SG CYS 40 17.199 10.929 10.237 1.00 7.30 ACS National Meeting, San Diego, USA 26th March 2012
  • 9. chains Algorithm Overview • Identify biopolymer backbones. • Recognize biopolymer monomer sidechains. ACS National Meeting, San Diego, USA 26th March 2012
  • 10. chains Algorithm Overview • Determine connected components. • Identify “Hetero” atoms (water, solvent, metals, ions, cofactors and common ligands; ATP, NAD, HEM). • Identify biopolymer backbones (peptide, DNA, RNA). • Traverse and number biopolymer backbone residues. • Recognize biopolymer monomer sidechains (potentially substituted functional groups). • Annotate hydrogen atoms (if present). • Sort atoms into PDB order. • Assign consecutive serial numbers (unique names). ACS National Meeting, San Diego, USA 26th March 2012
  • 11. Polymer Matching • Matching of linear, cyclic and dendrimeric polymers and copolymers can be efficiently implemented by graph relaxation algorithms. • Each atom records a set of possible template equivalences represented as a bit vector. • These bit vectors are iteratively refined using the bit vectors of neighboring atoms. ACS National Meeting, San Diego, USA 26th March 2012
  • 12. Peptides and Proteins • Example of a copolymer (substituted backbone). Q H OH N H O • Complications from glycine and proline. • N-terminal acetyl and formyl. • C-terminal aldehyde and amide. ACS National Meeting, San Diego, USA 26th March 2012
  • 13. Protein Backbone Definition • Assign initial constraints based upon atomic number and heavy atom degree. BitN Nitrogen 2 neighbors BitCA Carbon 3 neighbors BitC Carbon 3 neighbors BitO Oxygen 1 neighbor ACS National Meeting, San Diego, USA 26th March 2012
  • 14. Protein Backbone Definition • Perform iterative relaxation at each atom using its neighbor’s bit masks. BitN: BitCA BitC BitCA: BitN BitC AnyC BitC: BitCA BitN BitO BitO: BitC ACS National Meeting, San Diego, USA 26th March 2012
  • 15. current Protein Definitions • Five types of nitrogen (±proline, ±terminal & amide) • Two types of alpha carbon (regular vs. GLY) • Four types of carbonyl carbon • Two types of oxygen (O and OXT) • Currently 13 atom types (bits). ACS National Meeting, San Diego, USA 26th March 2012
  • 16. Nucleic Acids Q Q OH O O O O H O P O O H H O P O O H OH OH DNA RNA • Complications include termination & methylation. • Currently 19 atom types (bits). ACS National Meeting, San Diego, USA 26th March 2012
  • 17. Functional Group Matching • Given an known attachment atom/bond determine whether the affixed molecular fragment is a member of a dictionary of known functional groups or sidechains. • The standard “default” strategy is to treat the dictionary as a list of SMARTS patterns and loop over each one until a hit is identified. • This is O(n) in the size of the dictionary. ACS National Meeting, San Diego, USA 26th March 2012
  • 18. Recognizing monomer sidechains • Avoid backtracking by preprocessing monomer sets. CG1 CD 1 CA CB CG2 • Enumerate all possible depth first graph traversals C1,C3,C2,C1,C1 => ILE: CA,CB,CG1,CD1,CG2 C1,C3,C1,C2,C1 => ILE: CA,CB,CG2,CG1,CD1 ACS National Meeting, San Diego, USA 26th March 2012
  • 19. Recognizing monomer sidechains • The sequence of atomic numbers and the heavy degree (or fanout) of each vertex uniquely identifies the sidechain. • Conceptually, the candidate sidechain is traversed once and identified by looking up the “traversal” in a “dictionary”. • Encode the traversal dictionary as a prefix-closed “tries” to efficiently match arbitrary dictionaries. ACS National Meeting, San Diego, USA 26th March 2012
  • 20. Example Peptide Flow Chart EVAL NEIGHBOURS Y COUNT=0? ASSIGN GLY N N COUNT=1? ASSIGN UNK Y N ELEM=C? ASSIGN UNK Y EVAL NEIGHBOURS Y COUNT=0? ASSIGN ALA N N COUNT=1? ASSIGN UNK Y Y ELEM=S? ASSIGN CYS N Y ELEM=O? ASSIGN SER N ASSIGN UNK ACS National Meeting, San Diego, USA 26th March 2012
  • 21. Graph matching during traversal • Instead of initially traversing the entire sidechain to be recognized and then looking it up in the dictionary, we can perform the look-up as we traverse our molecule. • This has the benefit that we can fail fast and run- time is bounded by the number of atoms in our largest monomer. • In fact, the algorithm’s performance is independent of the number of monomer definitions. ACS National Meeting, San Diego, USA 26th March 2012
  • 22. OEchem Implementation • In OEChem, a way-ahead-of-time “compiler” processes an input set of sidechain definitions and outputs them as a bytecode program. • These bytecodes are efficiently interpreted at run- time inside OEPerceiveResidues. • Separate sidechain “programs” are used for proteins and nucleic acids. • An early implementation in OELib and OpenBabel builds these bytecodes just-in-time. ACS National Meeting, San Diego, USA 26th March 2012
  • 23. Resulting Bytecode static const OEByteCode peptide_bc[301] = { { OEOpCode::EVAL, 0, 1, 0 }, // 0 { OEOpCode::COUNT, 0, 61, 60 }, // 1 { OEOpCode::ELEM, 6, 3, -1 }, // 2 { OEOpCode::EVAL, 0, 4, 0 }, // 3 { OEOpCode::COUNT, 0, 5, 6 }, // 4 { OEOpCode::ASSIGN, BCNAM('A','L','A'), 2, 0 }, // 5 { OEOpCode::COUNT, 1, 7, 77 }, // 6 { OEOpCode::ELEM, 6, 8, 41 }, // 7 { OEOpCode::EVAL, 0, 9, 0 }, // 8 { OEOpCode::COUNT, 0, 174, 173 }, // 9 { OEOpCode::ELEM, 7, 11, 17 }, // 10 { OEOpCode::EVAL, 0, 12, 0 }, // 11 { OEOpCode::COUNT, 0, 13, 229 }, // 12 { OEOpCode::ELEM, 8, 14, -1 }, // 13 { OEOpCode::EVAL, 0, 15, 0 }, // 14 { OEOpCode::COUNT, 0, 16, -1 }, // 15 { OEOpCode::ASSIGN, BCNAM('A','S','N'), 5, 0 }, // 16 ACS National Meeting, San Diego, USA 26th March 2012
  • 24. Supported Peptide Sidechains • The twenty standard amino acids (D and L- forms). – ALA, ASN, ASP, ARG, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR and VAL. • Fourteen non-standard amino acids. – ABA, CGU, CME, CSD, HYP, LYZ, MEN, MLY, MSE, NLE, NVA, ORN, PCA, PTR, SEP and TPO. • Two N-terminal modifications. – Acetyl (ACE) and Formyl (FOR). • One C-terminal modification residue. – Amide (NH2). ACS National Meeting, San Diego, USA 26th March 2012
  • 25. Supported Nucleic Sidechains • Four standard DNA bases. – DA, DC, DG, DT. • Four standard RNA bases. – A, C, G, U. • Seven non-standard RNA bases. – 1MA, 2MG, 5MC, 7MG, M2G, PSU, YG. ACS National Meeting, San Diego, USA 26th March 2012
  • 26. OEPerceiveResidues Performance • As measured on a 2.2GHz AMD Opteron. Code Atoms Mons Timings Average 1CRN 327 46 15s/10K 1.5 ms 4TNA 1,656 76 98s/10K 9.8 ms 1GD1 10,984 1336 185s/1K 185 ms • Previous work by Siani, Weininger and Blaney (JCICS 1994) report taking 8.2s on an R3000 SGI to identify insulin (51aa) using SMARTS matching. ACS National Meeting, San Diego, USA 26th March 2012
  • 27. byteCode Size Issues • Using all possible traversal permutations works well for the standard amino and nucleic acid sidechains. • To match the 34 supported amino acid sidechains required a “program” of only 521 bytecodes, and the 11 supported nucleic acid bases required 3,488 byte codes. • Then came wybutosine (PDB residue YG) which had pathological branching requiring 91,974 byte codes due to combinatorics. ACS National Meeting, San Diego, USA 26th March 2012
  • 28. Wybutosine ACS National Meeting, San Diego, USA 26th March 2012
  • 29. Bytecode Size Issues • The solution was to perform a local sorting step when pushing neighboring atoms on to the stack. • Sort uses atomic number and heavy degree. • No performance impact in OEChem as we were already sorting for permutation stability. Dictionary Before After Peptide 521 301 Nucleic 3,488 326 Nucleic+YG 95,391 649 ACS National Meeting, San Diego, USA 26th March 2012
  • 30. Identifying peptides • To a first approximation, use OpenBabel to convert the SMILES to a PDB file, and if it only contains ATOMs (no HETATMs) it’s a peptide. • A better definition is to use an explicit list of acceptable amino (and nucleic) acid residues and check for the presence of these. • Crosslinking is easily retrieved as bonds between residues, excepting standard eupeptide links. ACS National Meeting, San Diego, USA 26th March 2012
  • 31. Crambin (FDA & IUPAC STYLES) ACS National Meeting, San Diego, USA 26th March 2012
  • 32. Human insulin (MULtiple chains) ACS National Meeting, San Diego, USA 26th March 2012
  • 33. Oxytocin (Disulfide & amide) ACS National Meeting, San Diego, USA 26th March 2012
  • 34. Gramicidin S (cyclic peptides) ACS National Meeting, San Diego, USA 26th March 2012
  • 35. LULiBERIN (terminal options) ACS National Meeting, San Diego, USA 26th March 2012
  • 36. NESIritide (stereo options) ACS National Meeting, San Diego, USA 26th March 2012
  • 37. Database statistics • Unfortunately, of 250251 compounds in the NCI Aug 2000 database, only 758 (0.3%) are peptide or peptide-like and only 86 (0.03%) are oligonucleotide sequences. • Of the 32,274,581 compounds in PubChem[February 2012], only 251,866 (0.78%) and only 1075 (0.00%) are oligonucleotide sequences. • Alas the SMARTS pattern “NCC(=O)NCC(=O)NCC=O” for a tripeptide only matches 391,502 (1.2%), an upper bound on the prevalence of peptides. ACS National Meeting, San Diego, USA 26th March 2012
  • 38. Future Work • Analysis of non-standard AA usage in drugs. • Additional non-standard sidechains (SAR). • Alloisoleucine (IIL) and allothreonine (ALO). • Glycoinformatics (post-translational sugar chains). • Matching improvements (compilation/profiling). ACS National Meeting, San Diego, USA 26th March 2012
  • 39. conclusions • Peptides and biologics don’t need their own file formats if their sequence can be perceived. • Generating prettier 2D depictions is simple. • Peptide informatics (and glycoinfomatics) potentially offer great opportunities for research and innovation. • It’s always great to find a new use for an old algorithm. ACS National Meeting, San Diego, USA 26th March 2012
  • 40. acknowledgements • OpenEye Scientific Software. • AstraZeneca R&D. • Vertex Pharmaceuticals. • European Bioinformatics Institute. • RasMol’s user community. • Thank you for your time. ACS National Meeting, San Diego, USA 26th March 2012