from Reaction Databases
Orr Ravitz
SimBioSys Inc.
246th ACS National Meeting
Extracting Synthetic Knowledge
ARChem – main concepts
A computer-aided synthesis design system.
The Approach:
 Comprehensive rule- and precedent-based r...
Solution Display
Exploring Alternative Paths
Supporting Examples
Chemical Interference
Functional groups that may interfere with transformations are highlighted.
Functional Group Tolerance
Break down of example set based
on the presence of functional
groups beyond the reaction center...
Stereochemistry
Currently:
 Exact matches
 Starting materials
Coming soon:
 Rule-based
Essential Information
Automated extraction of knowledge
 Reaction rules
 Yield values
 Chemical interference - function...
System Design
Reactions
Reaction Rules
Starting Materials
Expert Knowledge-
bases
Target
Source reactions
Esterification examples
Other examples
··· → ···
··· → ···
··· → ···
Esterification rule
Other rule
··· →...
Reactions
Reaction Rules
Reaction Perception
Source reaction:
Extracted core
Extended core
Reaction file with atom mapping...
Extending the Core:
Passengers vs Drivers
The goal of chemical perception is to discriminate between structural features
t...
Mechanism-Dependent
Core Extension
Nucleophilic aromatic substitution:
Addition /elimination
mechanism
Requires a π accept...
Reactions
Reaction Rules
Rule Extraction
Similar extended cores
Completed reaction rule
Common extracted core
Nucleofuge (...
Interfering Functionality
Following rule abstraction, compatible functionality is detected by examining the
examples:
Comp...
Regioselectivity – Main Steps
 Recognize rule’s reaction type – electrophilic substitution, nucleophilic
addition etc.
 ...
Collecting Statistics
Electrophilic aromatic substitutions
For each example in DB:
 Evaluate ring activation including fo...
Regio Example
X=Cl, 84% X=Cl, 5.5%Rejected
Misinterpreted yield
value provided
positive evidence
Stereochemistry – the challenge
 Efficient machine perception and representation of a broad range of synthetically
import...
The Data
Database content Portion of data Notes
Number of unmapped examples 14% Reaction type unknown
Number of examples b...
Stereo-Rules Generation –
A Different Approach
 Manually code rules for a diverse set of useful enantioselective and
gene...
70 reaction types with ee>95% and more than 50 examples
Designing a Rule-Set
Reaction type Bond alterations Examples with ...
Perception of stereochemistry in structural diagrams
Enabling Technology
Stereocenter manipulation and stereo descriptors
...
Chemical constraints layer of representation
Enabling Technology
CONNECTIONS=1,2,3 FUSION=BIARYL
RINGS=5+6,6+7 BRIDGEHEAD=...
Reduction of Ketones to
Secondary Alcohols
Level 1: + Environment constraints
Level 0: Bond change constraints only
Level ...
Dihydroxylation
of Alkenes
Level 1: Bond changes with environment constraints
Level 2: + Stereochemical constraints
Level ...
Conclusions
 Useful chemical knowledge can be extracted algorithmically from reaction
databases.
 Automation is crucial ...
The Rule-Set
Cut-off threshold
Useful reactions
Noise
Distractions
Low utility
reactions
Bad atom maps (avoid)
Rare multis...
Conclusions
 Significant portion of data is being lost due to mapping errors and other problems.
 Yield and selectivity ...
Acknowledgements
SimBioSys
James Law - Regioselectivity
Victoria Lubitch
Yasamin Salmasi
Aniko Simon
Zsolt Zsoldos
Reactio...
Upcoming SlideShare
Loading in...5
×

Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

1,496

Published on

Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,496
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS"

  1. 1. from Reaction Databases Orr Ravitz SimBioSys Inc. 246th ACS National Meeting Extracting Synthetic Knowledge
  2. 2. ARChem – main concepts A computer-aided synthesis design system. The Approach:  Comprehensive rule- and precedent-based retrosynthetic analysis back to available starting materials.  Automated rule generation with manual rule curation.  Generate many alternatives.  Provide supporting literature examples.  Allow user guidance and control.
  3. 3. Solution Display
  4. 4. Exploring Alternative Paths
  5. 5. Supporting Examples
  6. 6. Chemical Interference Functional groups that may interfere with transformations are highlighted.
  7. 7. Functional Group Tolerance Break down of example set based on the presence of functional groups beyond the reaction center provides evidence for compatibility. Examples can be exported to database’s web interface for further analysis.
  8. 8. Stereochemistry Currently:  Exact matches  Starting materials Coming soon:  Rule-based
  9. 9. Essential Information Automated extraction of knowledge  Reaction rules  Yield values  Chemical interference - functional group tolerance  Regioselectivity  Stereochemistry Data Information Knowledge Perceive Generalize
  10. 10. System Design Reactions Reaction Rules Starting Materials Expert Knowledge- bases Target
  11. 11. Source reactions Esterification examples Other examples ··· → ··· ··· → ··· ··· → ··· Esterification rule Other rule ··· → ··· Reactions Reaction Rules Rule Extraction
  12. 12. Reactions Reaction Rules Reaction Perception Source reaction: Extracted core Extended core Reaction file with atom mapping Atoms attached to bonds changed, made or broken in the reaction Include all structural motifs that are essential for the reaction to occur
  13. 13. Extending the Core: Passengers vs Drivers The goal of chemical perception is to discriminate between structural features that are essential for the reaction, and those that are passengers. Shell-based approach: 1st shell 2nd shell Graph-based methods are inappropriate.
  14. 14. Mechanism-Dependent Core Extension Nucleophilic aromatic substitution: Addition /elimination mechanism Requires a π acceptor group in ortho or para position Via organometallic intermediate
  15. 15. Reactions Reaction Rules Rule Extraction Similar extended cores Completed reaction rule Common extracted core Nucleofuge (NF) - a leaving group which carries away the bonding electron pair. Generalized rule Generalized group (NF) is replaced by the most common group.
  16. 16. Interfering Functionality Following rule abstraction, compatible functionality is detected by examining the examples: Compatible Interfering  Moieties outside the extended core are listed as compatible.  Other functional groups will be inferred as `possibly interfering’.  Possibly interfering functionality will be penalized in scoring and highlighted to the user.
  17. 17. Regioselectivity – Main Steps  Recognize rule’s reaction type – electrophilic substitution, nucleophilic addition etc.  Only reactions prone to regioselectivity are subject to regio calculations.  Identify competing sites  Identify substituents and other structural motifs that may influence the directionality  Collect statistics from example set regarding selectivity in the reaction core as well as elsewhere in the molecule (chemoselectivity)  Assign regioselectivity to rule if predefined statistical requirements are met. ? ? ?
  18. 18. Collecting Statistics Electrophilic aromatic substitutions For each example in DB:  Evaluate ring activation including for heteroaromatic rings and fused rings  Evaluate location, type and neighborhood of ring substituents  Identify symmetry  Compute environment signatures that include all aromatic features plus relevant substituents For each rule:  Cluster reacting vs. non-reacting signature-equivalent sites for reactions with yield > 20%  Define regioselectivity if examples ratio is 10:1
  19. 19. Regio Example X=Cl, 84% X=Cl, 5.5%Rejected Misinterpreted yield value provided positive evidence
  20. 20. Stereochemistry – the challenge  Efficient machine perception and representation of a broad range of synthetically important stereogenic types Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers  Representation of stereochemical reaction rules and stereochemical strategies  Develop a versatile stereochemical substructure algorithm to support retron matching  Efficient discovery of symmetry in stereochemically defined molecules and rules - avoid duplicate routes  Stereoselectivity is captured inaccurately and inconsistently across common databases.
  21. 21. The Data Database content Portion of data Notes Number of unmapped examples 14% Reaction type unknown Number of examples belonging to reactions with 5000 or more examples 4% Ubiquitous protection / deprotection reactions Number of examples belonging to reactions with 20 or less examples 16% Bad atom maps (database errors) Multistep reaction sequences General useable examples 65% 65 % 0 10 20 30 40 50 60 70 80 90 100 yield cs de ee %ofdatabase Examples with quoted selectivity values Selectivity metric 0 10 20 30 40 50 60 70 80 90 100 > 0% > 25% > 50% > 75% > 90% > 95% > 98% yield cs de ee Examples with selectivity above a threshold %ofavailable Threshold selectivity values
  22. 22. Stereo-Rules Generation – A Different Approach  Manually code rules for a diverse set of useful enantioselective and generally selective reaction types.  Mine supporting examples from existing large reaction databases to discover reaction scope and limitations for each rule.  Find effective strategies to aid planning of a stereo controlled synthesis Reactions Diels Alder Sharpless Reduction of C=C Reduction of C=O
  23. 23. 70 reaction types with ee>95% and more than 50 examples Designing a Rule-Set Reaction type Bond alterations Examples with ee ≥95% Notes Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions Reduction of C=O C=O → HCOH 1553 Any type of carbonyl Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations Reduction of C=C C=C → HCCH 1120 Wide variety of environments Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones Addition via R2Zn to C=O C-Zn + C=C → CCCH 306 Dihydroxylation of C=C C=C → HOCCOH 266 Reduction of C=N C=N → HCNH 256 Any type of C=N Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene) Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210 C substitution of Br CH + CBr → CC 199 [2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198 Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones Addition via R3B to C=O CB + C=O → CCOH 141 Oxidation of sulphides S → S=O 137 Chiral sulphoxides
  24. 24. Perception of stereochemistry in structural diagrams Enabling Technology Stereocenter manipulation and stereo descriptors Op 1 2 3 4 A E 1 2 3 4 B C2 3 1 3 4 2 C C1 3 1 4 2 3 D C2 2 1 4 3 E C1 3 2 3 1 4 F C2 3 2 4 3 1 G C2 3 3 1 2 4 H C1 3 3 2 4 1 J C2 3 4 1 2 K C1 3 4 1 3 2 L C2 3 4 2 1 3 M C2 4 3 2 1 Op 1 2 3 4 s 2 1 3 4 E + 8C3 + 3C2 Rotations Reflection Conceptual Model Stereo Descriptor
  25. 25. Chemical constraints layer of representation Enabling Technology CONNECTIONS=1,2,3 FUSION=BIARYL RINGS=5+6,6+7 BRIDGEHEAD=YES DIFFRING=1 EPS=0,1 SAMERING=1 HETS=0,1,2 DIFF=1 NONAROMHETS=0,1,2 SAME=1 HALOGENS=0,1,2 ARYL=YES FGS=ALCOHOL SPCENTRE=1,2,3 FGNOT=CARBONYL CHARGE=YES PROP=EWG HS=0,1,2 PROPNOT=Lg Substructure search/match
  26. 26. Reduction of Ketones to Secondary Alcohols Level 1: + Environment constraints Level 0: Bond change constraints only Level 1: + Stereochemical constraints Base ARChem rule Hits ee de (screen) 10,004 (10,004) Not unique to ketone → secondary alcohol conversion 8,442 (10,004) Unique to ketone → secondary alcohol conversion 140 tolerated functional groups 6,525 3,457 4,711 (6,765) Enantioselective and diastereoselective examples
  27. 27. Dihydroxylation of Alkenes Level 1: Bond changes with environment constraints Level 2: + Stereochemical constraints Level 3: + Substitution patterns 2253 examples (2416 screened) Hits ee de (screen) 1,428 1,008 1,151 (1,634) 428 117 352 (444) Hits ee de 681 578 552 526 289 418 206 131 168 12 10 11 236 89 191 123 51 103 51 27 41 8 4 7
  28. 28. Conclusions  Useful chemical knowledge can be extracted algorithmically from reaction databases.  Automation is crucial given the size and growth of databases.  Different layers of knowledge are tightly entangled: regioselectivity, chemoselectivity and stereoselectivity overlap considerably.  The extracted knowledge can be applied effectively in computer-aided synthesis design, and empower chemists by offering new ideas a broader perspective on the literature. But... The quality of extracted knowledge highly depends on the accuracy and scope of the source data!
  29. 29. The Rule-Set Cut-off threshold Useful reactions Noise Distractions Low utility reactions Bad atom maps (avoid) Rare multistep reaction sequences (low utility) Multiple concurrent reactions on substrate (very low utility) Exotic heterocycle formation (promote) Ubiquitous protection / deprotection FGIs such as alcohol/ester, amine/amide etc (demote)
  30. 30. Conclusions  Significant portion of data is being lost due to mapping errors and other problems.  Yield and selectivity information is captured inconsistently. What can be done:  Meta data perception can be improved. (in progress)  Mapping algorithms should reflect contemporary mechanistic understanding of reactions.  Systematic mapping errors can be manually fixed (planned)  Extracted rules can be manually curated (continuous).
  31. 31. Acknowledgements SimBioSys James Law - Regioselectivity Victoria Lubitch Yasamin Salmasi Aniko Simon Zsolt Zsoldos Reaction Data Elsevier – Reaxys Wiley - CIRX RSC - MOS Accelrys - RefLib University of Leeds Tony Cook - Stereochemistry Peter Johnson Steve Marsden Other Collaborators ChemAxon And… ARChem users! THANK YOU!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×