AI in Chemistry:
Deep Learning
Models Love Really
Big Data
C h r i s t o p h S t e i n b e c k
Machine Learning and Artificial Intelligence
• Artificial Intelligence (AI)
• Neural Networks
• Machine-Learning
• Deep Learning
1993
Deep Learning in
Chemistry
• Prediction of
• Chemical properties
• Reactions
• Chemical structure
• Knowledge extraction
Counterpropagation Neural Network
with a few hundred neurons.
SGI Origin 200 workstation with one
180-MHz IP27 processor
and running IRIX 6.3. (Good old days)
What has changed since
the good old days?
• Advancements in algorithms and hardware (GPU
training)
• Growth of neural networks from 100s to 100-
thousands neurons
• Availability of big data in a few areas led to the
iconic breakthroughs
SHK: “We have seen that AI methods require more data
than deterministic methods, and deep learning methods
need even more.”
SHK: “… it should be noted that datasets which are
considered gold standard tend to be very large. For
example, the Image Net dataset, a gold standard in
image classification, contains 14,197,122 images as of
now.”
Highlights of Deep Learning in Chemistry
Jablonka, K. M.; et al. Nat. Mach. Intell. 2024, 6 (2), 161–169.
Our results raise a very important question: how can a natural
language model with no prior training in chemistry outperform
dedicated machine learning models, as we were able to show in
the case of high-entropy alloys in Fig. 2 and for various
molecule, material and chemical reaction properties in Extended
Data Table 2? To our knowledge, this fundamental question has
no rigorous answer.
As we show in this Article, a machine learning system built
using GPT-3 works impressively well for a wide range of
questions in chemistry—even for those for which we cannot use
conventional line representations such as SMILES. Compared
with conventional machine learning, it has many advantages.
GPT-3 can be used for many different applications.
Highlights of Deep Learning in Chemistry
He, J.; et al. J. Cheminformatics 2024, 16 (1), 95.
• Recurrent Neural Networks (RNNs)
• Variational Autoencoders (VAEs)
• Transformers
• Generative Adversarial Networks (GANs)
• Graph Neural Networks (GNNs)
• Diffusion-based Models
• Molecular generative model
• Scoring function
• Reinforcement Learning (RL) as a search algorithm
Highlights of Deep Learning in Chemistry
Kirkpatrick et al., Science 374, 1385–1389 (2021)
Highlights of Deep Learning in Chemistry
Abramson, et al., Nature 2024, 630 (8016), 493–500.
The introduction of AlphaFold 21 has spurred a revolution in
modelling the structure of proteins and their interactions,
enabling a huge range of applications in protein modelling and
design2,3,4,5,6
. Here we describe our AlphaFold 3 model with a
substantially updated diffusion-based architecture that is
capable of predicting the joint structure of complexes including
proteins, nucleic acids, small molecules, ions and modified
residues.
Highlights of Deep Learning in Chemistry
ACS Cent. Sci. 2019, 5, 9, 1572–1583
Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary
yet unsolved step in planning synthesis is solving the forward problem: Given reactants
and reagents, predict the products. Similar to other work, we treat reaction prediction as a
machine translation problem between simplified molecular-input line-entry system
(SMILES) strings (a text-based representation) of reactants, reagents, and the products.
We show that a multihead attention Molecular Transformer model outperforms all
algorithms in the literature, achieving a top-1 accuracy above 90% on a common
benchmark data set.
15 March 2016: Lee Sedol, a top-ranked Go player, loses the last of five games to AlphaGo. Lee Jin-man / AP
In the match against Lee,
Deepmind’s AlphaGo used 1,202
CPUs and 176 GPUs.
AlphaGo Zero: Mastering the Game of Go
without Human Knowledge
• DeepMind's AlphaGo Zero
implements a Monte Carlo tree
search with a convolutional
neural network providing
position evaluation and policy
guidance.
• With only the rules of Go
known, AlphaGo Zero improved
to superhuman playing strength
after a day of training (5 mio
games).
• It uses just a single machine in
the Google Cloud with 4 TPUs
Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270
Fabricating large numbers of chemical structure
depictions to solve the OCSR problem
The
Project
Information in printed literature is not readily
available in databases
Image Source: Chen et al. 2020, J.Nat Prod
Organism Name Chemical Name
Chemical Class
Biol. Activity
Chemical Structures
Optical Chemical Structure Recognition (OCSR) Tools
Mol file
Black pixels on white paper
Rule based methods
1. Scanning
2. Vectorization
3. Searching for dashed lines
and dashed wedges
4. Character recognition
5. Graph compilation
6. Post processing
7. Display and editing
DECIMER: Deep LEarning for Chemical IMagE
Recognition
Image Source: Wijeratne et al. 2001, J.Nat Prod
Segmentation
(Identification &
Extraction)
Prediction
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Re-Depicted Structure
SMILES
(simplified molecular-
input line-entry system)
OCSR Engine
Kohulan Rajan
DECIMER – Image to SMILES
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Show and tell: Image Caption Generator
DECIMER – Image to SMILES
Reference and Image source: Xu et al. 2015, arXiv[cs.LG]
DECIMER – Image to SMILES (Performance)
25
Image source: Rajan et al. 2020, J Cheminform
0.13
0.22
0.38
0.48
0.53
0.62
0.68
0.1% 0.2%
6.7%
13.2%
18.9%
26.0% 27.0%
0%
50%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12
PERCENTAGE
DATASET INDEX
TANIMOTO SIMILARY CALCULATIONS VS
TRAINING DATA SIZE
Average Tanimoto
similarity on valid
SMILES
Percentage of molecules
with Tanimoto 1.0
0.0
0.5
1.0
TANIMOTO
SIMILARITY
Q
Query
T
Target
T
Q Ç
T
Q È
Similarity measure TANIMOTO coefficient:
T
Q Ç
T
Q È
Tanimoto Coefficient =
DECIMER – Image to SMILES (Performance)
The infamous Figure 5
Rajan, K., Zielesny, A. & Steinbeck, C.
DECIMER: towards deep learning for chemical image recognition.
J Cheminform 12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
DECIMER – Image to SMILES (Training time)
27
Image source: Rajan et al. 2020, J Cheminform
25 Epochs for 15 Mio
dataset: ~ 26 days
25 Epochs for 45 Mio
dataset: ~ 78 days
CPU
Central Processing Unit
GPU
Graphics Processing Unit
TPU
Tensor Processing Unit
https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-
learning
TPUs : Tensor
Processing Units
• Tensor Processing Unit (TPU) is an AI
accelerator application-specific integrated
circuit (ASIC) developed by Google .
• This is developed specifically to train larger
models in deep learning faster.
• Development started in 2013, available to
public from 2018.
• Only available through Google Cloud
Platform.
• https://cloud.google.com/tpu
• https://en.wikipedia.org/wiki/Tensor_Processin
g_Unit
• https://cloud.google.com/blog/products/ai-
machine-learning/what-makes-tpus-fine-
tuned-for-deep-learning
• https://www.tensorflow.org/guide/tpu
GPU VS TPU training speed
30
Reference: Rajan et al. 2021, J Cheminform
• Why not GPUs?
Training time compared to an
Nvidia V100-Tesla GPU
• Single V3-8 TPU – 4x faster
• Single V4-8 TPU – 7x faster
• Single V5-8 TPU – 16x faster
=> Six month down to 11 days
0 10 20 30 40
GPU
TPU V3-8
TPU V4-8
TPU V5-8
TIME IN HOURS
Time per epoch (GPU vs TPU)
DECIMER V2
• 400 Million images plus corresponding SMILES fed into DNN.
• Images of chemical structure depictions were generated using
RanDepict.
• No assumptions about the underlying problem (no concept of bonds or
atoms, etc) included.
”Caffeine” Depicted using
CDK
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
EfficientNet-V2 CNN + Transformer DNN
SMILES
This repository contains RanDepict, an easy-to-use utility to generate a big variety
of chemical structure depictions (random depiction styles and image
augmentations) based on
RDKit, CDK and Indigo.
RanDepict
Brinkhaus, H.O., Rajan, K., Zielesny, A. et al. RanDepict: Random chemical structure depiction generator.
J Cheminform 14, 31 (2022). https://doi.org/10.1186/s13321-022-00609-4
Distortion features controlled by fingerprints
USPTO: 5,719 images from the US Patent Office
UOB: 5,740 images University of Birmingham
CLEF: 992 images from The Conference and Labs of the Evaluation Forum test set
JPO: 450 images from the Japanese Patent Office
RanDepict250k: 250,000 chemical structure depictions generated with RanDepict
RanDepict250k_augmented: 250,000 with additional augmentations generated with RanDepict.
DECIMER hand-drawn: 5,088 chemical structure images from DECIMER hand-drawn dataset.
Indigo: 50,000 images generated by Staker et al. using Indigo30, All images have a resolution of 224 x 224 pixels.
USPTO_big: 50,000 images from the USPTO from Staker et al, All images have a resolution of 224 x 224 pixels.
Img2Mol test set: 25,000 depictions used by Clévert et al. All images have a resolution of 224 x 224 pixels.
OCSR Benchmark Datasets
Skip
• OSRA
• MolVec
• Imago
Rule Based:
• Img2Mol
• SwinOCSR
• MolScribe
Deep-learning Based:
Open-Source OCSR tools
Skip
OCSR tools
performance on
Clean Datasets
0.52
0.58
0.45
0.77
0.70
0.86
0.93
35% 37%
28%
30%
14%
57%
65%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
OSRA MolVec Imago Img2Mol SwinOCSR MolScribe DECIMER
Proportion
of
Identical
Predictions
Average
Tanimoto
Similarity
Average TanimotoSimilarity Proportion of Identical Predictions
Skip
OCSR tool performance on Augmented Datasets
Clean Data Augmented Data
• xy-shearing factor randomly drawn from [−0.1, 0.1]
• rotation (randomly drawn from [−5°, 5°])
Reference: Clevert et al. 2021, Chemical Science
Bayer AG, Berlin
OCSR tool
performance on
Augmented
Datasets
0.47
0.49
0.42
0.71
0.64
0.85
0.96
19%
23% 22%
25%
11%
70%
76%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
OSRA MolVec Imago Img2Mol SwinOCSR MolScribe DECIMER
Proportion
of
Identical
Predictions
Average
Tanimoto
Similarity
Average TanimotoSimilarity Proportion of Identical Predictions
decimer.ai
Mobile version
DECIMER – Hand drawn Images Dataset
40
Brinkhaus, H.O., Zielesny, A., Steinbeck, C. et al. DECIMER—hand-drawn molecule images dataset. J Cheminform 14, 36
(2022). https://doi.org/10.1186/s13321-022-00620-9
DECIMER Hand-Drawn Structures Dataset
Brinkhaus, H.O., Zielesny, A., Steinbeck, C. et al. DECIMER—hand-drawn molecule images dataset. J Cheminform 14, 36
(2022). https://doi.org/10.1186/s13321-022-00620-9
6000 diverse molecules selected from PubChem
using RDKit’s implementation of the MaxMin
algorithm based on Morgan fingerprints.
Dataset of 5088 images and corresponding
SMILES
The dataset at ZENODO:
https://doi.org/10.5281/zenodo.6456306
DECIMER – Image Transformer (2022)
0.01
0.26
0.14
0.28
0.71
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
OSRA 2.1 Imago 2.0 MolVec 0.9.8 Img2Mol DECIMER
V2.0
Rule Based DL-Based
AVERAGE
TANIMOTO
OCSR tools performance on hand drawn images
RanDepict – Synthetic Hand Drawn Images
Training and Testing - DECIMER Image Transformer
Augmented Image
Non augmented image Augmented Image with
Markush structure
Non augmented image
with Markush structure
Synthetic Hand Drawn
structure
OCSR tools performance on hand drawn images
1% 1% 3% 5% 7.65%
5%
27%
67.39%
0.73% 1.61% 3.50%
9.34% 11.18%
8.49%
32.90%
74.63%
0.17
0.23 0.22
0.52
0.59
0.64
0.69
0.93
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OSRA MolVec Imago Img2Mol MolScribe SwinOCSR DECIMER (No
Hand Drawn)
DECIMER
(Synthetic Hand
Drawn)
Percentage
Identical Predictions Tanimoto 1.0 Count Average Tanimoto
DECIMER
Workflow
Rajan, K., et al. Nature
Communications 14, 5045
(2023).
Smiles TO iUpac Translator
Smiles TO iUpac Translator Version 2.0
STOUT: Smiles TO iUpac Translator
Data Sources:
ZINC15
Pubchem
IUPAC Names Generated with
OpenEYE – LexiChem
Total Training Data
Size: 998 Million
Maximum Length of
SMILES: 600
Maximum Length of
IUPAC Names: 700
Training
Infrastructure:
TPU V4-256
Chips
Google Cloud
STOUT: Smiles TO iUpac Translator
https://decimer.ai
Dear Both,
I know the AI world is now crazy. I asked Gemini what is DECIMER?
Results:
DECIMER (Deep Learning for Chemical Image Recognition) is an open-source platform designed to automatically identify, segment, and recognize
chemical structures within images from scientific publications. It utilizes deep learning models to extract chemical information from images and
convert it into machine-readable formats, such as SMILES.
DECIMER is a valuable tool in cheminformatics research, as it helps automate the process of extracting chemical data from literature, making it
more accessible for further analysis and research.
There are a few key components of the DECIMER project:
•DECIMER 1.0: The initial deep learning model for converting chemical structure images into SMILES.
•DECIMER-Segmentation: A tool for automatically extracting chemical structure depictions from scientific literature.
•DECIMER 2.0: An updated version of DECIMER that uses EfficientNet V2 and Transformer architectures, along with Google's TPUs, for faster
training on large datasets.
•DECIMER.ai: An open platform that combines all these tools, offering a user-friendly interface for chemical structure recognition tasks.
You can learn more about DECIMER through the following resources:
•DECIMER.ai Web Application: https://decimer.ai/
•DECIMER.ai Publication: https://www.nature.com/articles/s41467-023-40782-0
•DECIMER 1.0 Publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00469-w
I hope this information is helpful! Let me know if you have any other questions.
Afterwards, I asked it to write a song based on this
information. I took the lyrics and gave it to suno.ai, using
the new v3.5 model I could create a once 4-minute song
very easily.
https://youtu.be/LKipkW15Bts
Acknowledgments

AI in Chemistry: Deep Learning Models Love Really Big Data

  • 1.
    AI in Chemistry: DeepLearning Models Love Really Big Data C h r i s t o p h S t e i n b e c k
  • 2.
    Machine Learning andArtificial Intelligence • Artificial Intelligence (AI) • Neural Networks • Machine-Learning • Deep Learning 1993
  • 4.
    Deep Learning in Chemistry •Prediction of • Chemical properties • Reactions • Chemical structure • Knowledge extraction
  • 5.
    Counterpropagation Neural Network witha few hundred neurons. SGI Origin 200 workstation with one 180-MHz IP27 processor and running IRIX 6.3. (Good old days)
  • 6.
    What has changedsince the good old days? • Advancements in algorithms and hardware (GPU training) • Growth of neural networks from 100s to 100- thousands neurons • Availability of big data in a few areas led to the iconic breakthroughs
  • 8.
    SHK: “We haveseen that AI methods require more data than deterministic methods, and deep learning methods need even more.” SHK: “… it should be noted that datasets which are considered gold standard tend to be very large. For example, the Image Net dataset, a gold standard in image classification, contains 14,197,122 images as of now.”
  • 9.
    Highlights of DeepLearning in Chemistry Jablonka, K. M.; et al. Nat. Mach. Intell. 2024, 6 (2), 161–169. Our results raise a very important question: how can a natural language model with no prior training in chemistry outperform dedicated machine learning models, as we were able to show in the case of high-entropy alloys in Fig. 2 and for various molecule, material and chemical reaction properties in Extended Data Table 2? To our knowledge, this fundamental question has no rigorous answer. As we show in this Article, a machine learning system built using GPT-3 works impressively well for a wide range of questions in chemistry—even for those for which we cannot use conventional line representations such as SMILES. Compared with conventional machine learning, it has many advantages. GPT-3 can be used for many different applications.
  • 10.
    Highlights of DeepLearning in Chemistry He, J.; et al. J. Cheminformatics 2024, 16 (1), 95. • Recurrent Neural Networks (RNNs) • Variational Autoencoders (VAEs) • Transformers • Generative Adversarial Networks (GANs) • Graph Neural Networks (GNNs) • Diffusion-based Models • Molecular generative model • Scoring function • Reinforcement Learning (RL) as a search algorithm
  • 11.
    Highlights of DeepLearning in Chemistry Kirkpatrick et al., Science 374, 1385–1389 (2021)
  • 12.
    Highlights of DeepLearning in Chemistry Abramson, et al., Nature 2024, 630 (8016), 493–500. The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2,3,4,5,6 . Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues.
  • 13.
    Highlights of DeepLearning in Chemistry ACS Cent. Sci. 2019, 5, 9, 1572–1583 Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary yet unsolved step in planning synthesis is solving the forward problem: Given reactants and reagents, predict the products. Similar to other work, we treat reaction prediction as a machine translation problem between simplified molecular-input line-entry system (SMILES) strings (a text-based representation) of reactants, reagents, and the products. We show that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set.
  • 14.
    15 March 2016:Lee Sedol, a top-ranked Go player, loses the last of five games to AlphaGo. Lee Jin-man / AP In the match against Lee, Deepmind’s AlphaGo used 1,202 CPUs and 176 GPUs.
  • 15.
    AlphaGo Zero: Masteringthe Game of Go without Human Knowledge • DeepMind's AlphaGo Zero implements a Monte Carlo tree search with a convolutional neural network providing position evaluation and policy guidance. • With only the rules of Go known, AlphaGo Zero improved to superhuman playing strength after a day of training (5 mio games). • It uses just a single machine in the Google Cloud with 4 TPUs Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270
  • 16.
    Fabricating large numbersof chemical structure depictions to solve the OCSR problem The Project
  • 17.
    Information in printedliterature is not readily available in databases Image Source: Chen et al. 2020, J.Nat Prod Organism Name Chemical Name Chemical Class Biol. Activity Chemical Structures
  • 18.
    Optical Chemical StructureRecognition (OCSR) Tools Mol file Black pixels on white paper Rule based methods 1. Scanning 2. Vectorization 3. Searching for dashed lines and dashed wedges 4. Character recognition 5. Graph compilation 6. Post processing 7. Display and editing
  • 19.
    DECIMER: Deep LEarningfor Chemical IMagE Recognition Image Source: Wijeratne et al. 2001, J.Nat Prod Segmentation (Identification & Extraction) Prediction CN1C=NC2=C1C(=O)N(C(=O)N2C)C Re-Depicted Structure SMILES (simplified molecular- input line-entry system)
  • 20.
  • 21.
    DECIMER – Imageto SMILES CN1C=NC2=C1C(=O)N(C(=O)N2C)C Show and tell: Image Caption Generator DECIMER – Image to SMILES Reference and Image source: Xu et al. 2015, arXiv[cs.LG]
  • 22.
    DECIMER – Imageto SMILES (Performance) 25 Image source: Rajan et al. 2020, J Cheminform 0.13 0.22 0.38 0.48 0.53 0.62 0.68 0.1% 0.2% 6.7% 13.2% 18.9% 26.0% 27.0% 0% 50% 100% 0 1 2 3 4 5 6 7 8 9 10 11 12 PERCENTAGE DATASET INDEX TANIMOTO SIMILARY CALCULATIONS VS TRAINING DATA SIZE Average Tanimoto similarity on valid SMILES Percentage of molecules with Tanimoto 1.0 0.0 0.5 1.0 TANIMOTO SIMILARITY Q Query T Target T Q Ç T Q È Similarity measure TANIMOTO coefficient: T Q Ç T Q È Tanimoto Coefficient =
  • 23.
    DECIMER – Imageto SMILES (Performance) The infamous Figure 5 Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER: towards deep learning for chemical image recognition. J Cheminform 12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
  • 24.
    DECIMER – Imageto SMILES (Training time) 27 Image source: Rajan et al. 2020, J Cheminform 25 Epochs for 15 Mio dataset: ~ 26 days 25 Epochs for 45 Mio dataset: ~ 78 days
  • 25.
    CPU Central Processing Unit GPU GraphicsProcessing Unit TPU Tensor Processing Unit https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep- learning
  • 26.
    TPUs : Tensor ProcessingUnits • Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google . • This is developed specifically to train larger models in deep learning faster. • Development started in 2013, available to public from 2018. • Only available through Google Cloud Platform. • https://cloud.google.com/tpu • https://en.wikipedia.org/wiki/Tensor_Processin g_Unit • https://cloud.google.com/blog/products/ai- machine-learning/what-makes-tpus-fine- tuned-for-deep-learning • https://www.tensorflow.org/guide/tpu
  • 27.
    GPU VS TPUtraining speed 30 Reference: Rajan et al. 2021, J Cheminform • Why not GPUs? Training time compared to an Nvidia V100-Tesla GPU • Single V3-8 TPU – 4x faster • Single V4-8 TPU – 7x faster • Single V5-8 TPU – 16x faster => Six month down to 11 days 0 10 20 30 40 GPU TPU V3-8 TPU V4-8 TPU V5-8 TIME IN HOURS Time per epoch (GPU vs TPU)
  • 28.
    DECIMER V2 • 400Million images plus corresponding SMILES fed into DNN. • Images of chemical structure depictions were generated using RanDepict. • No assumptions about the underlying problem (no concept of bonds or atoms, etc) included. ”Caffeine” Depicted using CDK CN1C=NC2=C1C(=O)N(C(=O)N2C)C EfficientNet-V2 CNN + Transformer DNN SMILES
  • 29.
    This repository containsRanDepict, an easy-to-use utility to generate a big variety of chemical structure depictions (random depiction styles and image augmentations) based on RDKit, CDK and Indigo.
  • 30.
    RanDepict Brinkhaus, H.O., Rajan,K., Zielesny, A. et al. RanDepict: Random chemical structure depiction generator. J Cheminform 14, 31 (2022). https://doi.org/10.1186/s13321-022-00609-4 Distortion features controlled by fingerprints
  • 31.
    USPTO: 5,719 imagesfrom the US Patent Office UOB: 5,740 images University of Birmingham CLEF: 992 images from The Conference and Labs of the Evaluation Forum test set JPO: 450 images from the Japanese Patent Office RanDepict250k: 250,000 chemical structure depictions generated with RanDepict RanDepict250k_augmented: 250,000 with additional augmentations generated with RanDepict. DECIMER hand-drawn: 5,088 chemical structure images from DECIMER hand-drawn dataset. Indigo: 50,000 images generated by Staker et al. using Indigo30, All images have a resolution of 224 x 224 pixels. USPTO_big: 50,000 images from the USPTO from Staker et al, All images have a resolution of 224 x 224 pixels. Img2Mol test set: 25,000 depictions used by Clévert et al. All images have a resolution of 224 x 224 pixels. OCSR Benchmark Datasets Skip
  • 32.
    • OSRA • MolVec •Imago Rule Based: • Img2Mol • SwinOCSR • MolScribe Deep-learning Based: Open-Source OCSR tools Skip
  • 33.
    OCSR tools performance on CleanDatasets 0.52 0.58 0.45 0.77 0.70 0.86 0.93 35% 37% 28% 30% 14% 57% 65% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 OSRA MolVec Imago Img2Mol SwinOCSR MolScribe DECIMER Proportion of Identical Predictions Average Tanimoto Similarity Average TanimotoSimilarity Proportion of Identical Predictions Skip
  • 34.
    OCSR tool performanceon Augmented Datasets Clean Data Augmented Data • xy-shearing factor randomly drawn from [−0.1, 0.1] • rotation (randomly drawn from [−5°, 5°]) Reference: Clevert et al. 2021, Chemical Science Bayer AG, Berlin
  • 35.
    OCSR tool performance on Augmented Datasets 0.47 0.49 0.42 0.71 0.64 0.85 0.96 19% 23%22% 25% 11% 70% 76% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 OSRA MolVec Imago Img2Mol SwinOCSR MolScribe DECIMER Proportion of Identical Predictions Average Tanimoto Similarity Average TanimotoSimilarity Proportion of Identical Predictions
  • 36.
  • 37.
    DECIMER – Handdrawn Images Dataset 40 Brinkhaus, H.O., Zielesny, A., Steinbeck, C. et al. DECIMER—hand-drawn molecule images dataset. J Cheminform 14, 36 (2022). https://doi.org/10.1186/s13321-022-00620-9
  • 38.
    DECIMER Hand-Drawn StructuresDataset Brinkhaus, H.O., Zielesny, A., Steinbeck, C. et al. DECIMER—hand-drawn molecule images dataset. J Cheminform 14, 36 (2022). https://doi.org/10.1186/s13321-022-00620-9 6000 diverse molecules selected from PubChem using RDKit’s implementation of the MaxMin algorithm based on Morgan fingerprints. Dataset of 5088 images and corresponding SMILES The dataset at ZENODO: https://doi.org/10.5281/zenodo.6456306
  • 39.
    DECIMER – ImageTransformer (2022) 0.01 0.26 0.14 0.28 0.71 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 OSRA 2.1 Imago 2.0 MolVec 0.9.8 Img2Mol DECIMER V2.0 Rule Based DL-Based AVERAGE TANIMOTO OCSR tools performance on hand drawn images
  • 40.
    RanDepict – SyntheticHand Drawn Images
  • 41.
    Training and Testing- DECIMER Image Transformer Augmented Image Non augmented image Augmented Image with Markush structure Non augmented image with Markush structure Synthetic Hand Drawn structure
  • 42.
    OCSR tools performanceon hand drawn images 1% 1% 3% 5% 7.65% 5% 27% 67.39% 0.73% 1.61% 3.50% 9.34% 11.18% 8.49% 32.90% 74.63% 0.17 0.23 0.22 0.52 0.59 0.64 0.69 0.93 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% OSRA MolVec Imago Img2Mol MolScribe SwinOCSR DECIMER (No Hand Drawn) DECIMER (Synthetic Hand Drawn) Percentage Identical Predictions Tanimoto 1.0 Count Average Tanimoto
  • 43.
    DECIMER Workflow Rajan, K., etal. Nature Communications 14, 5045 (2023).
  • 44.
    Smiles TO iUpacTranslator Smiles TO iUpac Translator Version 2.0
  • 45.
    STOUT: Smiles TOiUpac Translator Data Sources: ZINC15 Pubchem IUPAC Names Generated with OpenEYE – LexiChem Total Training Data Size: 998 Million Maximum Length of SMILES: 600 Maximum Length of IUPAC Names: 700 Training Infrastructure: TPU V4-256 Chips Google Cloud
  • 46.
    STOUT: Smiles TOiUpac Translator
  • 48.
  • 49.
    Dear Both, I knowthe AI world is now crazy. I asked Gemini what is DECIMER? Results: DECIMER (Deep Learning for Chemical Image Recognition) is an open-source platform designed to automatically identify, segment, and recognize chemical structures within images from scientific publications. It utilizes deep learning models to extract chemical information from images and convert it into machine-readable formats, such as SMILES. DECIMER is a valuable tool in cheminformatics research, as it helps automate the process of extracting chemical data from literature, making it more accessible for further analysis and research. There are a few key components of the DECIMER project: •DECIMER 1.0: The initial deep learning model for converting chemical structure images into SMILES. •DECIMER-Segmentation: A tool for automatically extracting chemical structure depictions from scientific literature. •DECIMER 2.0: An updated version of DECIMER that uses EfficientNet V2 and Transformer architectures, along with Google's TPUs, for faster training on large datasets. •DECIMER.ai: An open platform that combines all these tools, offering a user-friendly interface for chemical structure recognition tasks. You can learn more about DECIMER through the following resources: •DECIMER.ai Web Application: https://decimer.ai/ •DECIMER.ai Publication: https://www.nature.com/articles/s41467-023-40782-0 •DECIMER 1.0 Publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00469-w I hope this information is helpful! Let me know if you have any other questions.
  • 50.
    Afterwards, I askedit to write a song based on this information. I took the lyrics and gave it to suno.ai, using the new v3.5 model I could create a once 4-minute song very easily.
  • 51.
  • 52.