Workshop LLM Life Sciences ChemAI 231116.pptx

From Words to Wonders:
Language Models for Life
Sciences
Room L1.02
Robots Unleashed: The Rise of AI-
Driven Chemical Discovery
Room L1.01
16 November 2023

f.grisoni@tue.nl
Learning the biochemical language with AI
A drug discovery tale
ChemAI workshop | Nov 17, 2023
f.grisoni@tue.nl
F. Grisoni | Assistant Professor
Institute for Complex Molecular Systems (ICMS)
Department of Biomedical Engineering, Eindhoven University of Technology (TU/e)

3 | F. Grisoni |
The language of life
DNA Proteins Chemical signals
“The general goal of linguistics […] addresses the same problems facing molecular biologists.”1
1Bralley P (1996). An introduction to molecular linguistics. BioScience, 46, 146.

4 | F. Grisoni |
Deciphering the language of life
Image from ancestry.com
I running am I am running
I am running
for president
I am running
a marathon
• Syntax: set of rules that dictate how
sentences or expressions should be
structured.
• Semantics: meaning conveyed by the
elements and structures of a language.

5 | F. Grisoni |
• Syntax: set of rules that dictate how
sentences or expressions should be
structured.
• Semantics: meaning conveyed by the
elements and structures of a language.
Image from ancestry.com
DNA
RNA
Protein
Codons
Codons

6 | F. Grisoni |
How can we learn the biomolecular language with AI?
What can we do with it?

7 | F. Grisoni |
Natural language processing

8 | F. Grisoni |
The vastness of the chemical universe
Chemical Universe1,2
1060
104
Known
small molecule drugs
Cells in a human body
1013 – 1014
108 – 109
Stars in the Milky Way
1Ertl (2002) Journal of Chemical Information and Computer Sciences 43, 374.
2Walters et al. (1998). Drug Discovery Today 3, 160.

9 | F. Grisoni |
C
C
C
C1
C
C
C
C
C
C
C
C
C OH
O
=
=
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
G E
Chemical language models (CLMs)
• “Syntax”
• “Semantics”
1Hochreiter S, Schmidhuber J (1997). Neural computation 9, 1735.
Segler MH, Kogej T, Tyrchan C, Waller MP (2018). ACS Central Science 4,120.
G
C
C c C
…
c 1 E
Recurrent neural network with long short-term memory1

10 | F. Grisoni |
G
O
= 1
…
= C E
O
Recurrent neural network with long short-term memory1
Chemical language models (CLMs)
1Hochreiter S, Schmidhuber J (1997). Neural computation 9, 1735.
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
G E
• “Syntax”
• “Semantics”
C
C
C
C1
C
C
C
C
C
C
C
C
C OH
O
=
=

11 | F. Grisoni |
Fine-tuning
Transfer learning
Pretraining
Generic model
Focused model
Valid ≥ 90%
Novel ≥ 90%
300k bioactive molecules
Merk D, Friedrich L, Grisoni F, Schneider G (2018) Mol. Inf. 37, 1700153.
25 RXR and
PPAR modulators

12 | F. Grisoni |
Dual modulators of nuclear receptors
Merk D, Friedrich L, Grisoni F, Schneider G (2018) Mol. Inf. 37, 1700153.
ID RXRα RXRβ RXRγ PPARα PPARγ PPARδ
1 0.13±0.01 1.1±0.3 0.06±0.02 - 2.3±0.2 -
2 13.0±0.1 9±2 8.0±0.7 - 2.8±0.3 -
3 - - - 4.0±1.0 10.1±0.3 -
4 - - - - 9±3 14±2
5 - - - - - -
EC50 (µM), n=4; hybrid reporter gene assay, HEK293T cells.
1 2 3
4 5

13 | F. Grisoni |
Applications of chemical language modelling
Bidirectional molecule generation1
E....O)CCCGC=C(C....E
C
C
C
C1
C
C
C
C
C
C
C
C
C OH
O
=
=
1Grisoni F, Moret M, Lingwood R, Schneider G (2020). J. Chem. Inf. Mod. 60, 1175.
2Grisoni F, Huisman BH, Button AL, et al. (2021). Science Advances 7, eabg3338.
3Moret M, Helmstädter M, Grisoni F et al.(2021). Angewandte Chemie 60, 19477.
Automated design-make-test2
Natural product-inspired design3

14 | F. Grisoni |
‘One-shot’ de novo design of Nurr1 agonists
Ballarotto M et al. (2023). J. Med. Chem. 66, 12.
300k bioactive molecules
Generic model Potent agonist
Weak agonists
EC50 = 0.07±0.02 µM EC50 = 2.1±0.6 uM
2 novel Nurr1 agonists
D. Merk
(@LMU)

15 | F. Grisoni |
Moret M, Pachon I, Cotos L et al. (2023). Nat. Comms 14, 114.
From language processing to chemistry and back
ELECTRA pretraining
CN1CC=CC1=O
CC1CC=CC1=F
Corruption
M. Moret
(@ETH)
18
N
N N
N
Br
O
NH2
O
22
N
N N
N
OH
Cl
NH2
Cl
Repression of PI3K-AKT signalling in tumour cells

16 | F. Grisoni |
S4 for de novo drug design
IPM Colloquium 2023
Özçelik R, de Ruiter S, Grisoni F (2023). ChemRxiv.
1Gu A, Goel K and Re C (2022). ICLR.
R. Özçelik
S. de Ruiter
Structured State-Space Sequence (S4) models1

17 | F. Grisoni |
Other biomolecular languages
Small molecules Peptides and proteins
Syntax
Semantics
“I like pears and apples, I do not
like oranges”
“I pears, I oranges and like do
apples like not”
Alphabetic syntax Symbolic syntax
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O LTKAKLKILNCLHDG

18 | F. Grisoni |
ChemMedChem 13, 2018 - Front Cover
V G S A
1Grisoni F, Neuhaus, CS, Gabernet G et al. (2018) ChemMedChem 13, 1300.
300,000 bioactive
molecules
25 in house ACPs
Pre-training Focused model
100k virtual
peptides
Generic model
1000 sequences (12)
Generating anticancer peptides (ACPs)1

19 | F. Grisoni |
ID Sequence EC50 [μm] HC50 [μm]
1 KLWKKIEKLIKKLLTSIR 47±3 236±13
2 YIWARAERVWLWWGKFLSL 56±3 -
3 ELAKKLTKLKRQLHRIW - -
4 DLFKQLQRLFLGILYCLYKIW 47±4 132±16
5 KLIDQWKKVLYHVE - -
6 AIKKFGPLAKIVAKV 95±4 -
7 RWNGRIIKGFYNLVKIWKDLKG 42±4 89±6
8 KVWKIKKNIRRLLHGIKRGWKG 34±4 -
9 GFWARIGKVFAAVKNL 101±4 -
10 AFLYRLTRQIRPWWRWLYKW 45.5±0.8 34±5
11 RIWGKHSRYIKIVKRLIQ 50±10 -
12 QIWHKIRKLWQIIKDGF 16.1±0.3 23±5
In vitro activity on cancer cells (MCF7) and human erythrocytes
1Grisoni F, Neuhaus, CS, Gabernet G et al. (2018) ChemMedChem 13, 1300.

20 | F. Grisoni |
Y. Nana Teukam
Enzyme design

21 | F. Grisoni |
Acknowledgements
f.grisoni@tue.nl
Rıza Özçelik
Sarah de Ruiter
Yves Nana Teukam (w/ IBM)
Derek van Tilborg
Emanuele Criscuolo
Helena Brinkmann
Luke Rossen
Cristina Izquierdo (w/ Albertazzi)
Laura van Weesep
Inge Groffen Gisbert Schneider
Michael Moret
Lukas Friedrich
Berend H. Huisman
Daniel Merk
Moritz Helmstädter
Marco Ballarotto
Matteo Manica
Teodoro Laino
@fra_grisoni
@molecularML

NVIDIA BioNeMo
Foundry to Build Generative AI for Drug Discovery
Dr. David Ruau, Head of strategic Alliances Drug Discovery, EMEA
ChemAI, Nov 16

NVIDIA AI Foundations
Cloud Services to Create and Run Custom Generative AI Models
NeMo BioNeMo Picasso
NVIDIA AI Foundations
NVIDIA DGX Cloud
NVIDIA AI Enterprise

Each Enterprise Needs Its Own AI
As-a-Service Public Cloud Private Cloud Edge
Operationalize and Inference at Scale
Train New Model
with Your Data
Optimize a Model You’ve
Already Trained
Customize a Foundation
Model with Your Data

NVIDIA Clara
$1.5T Industry |$500B R&D Spend |10+ Years to Bring a Drug to Market
FLARE
Federated Learning
MONAI
Imaging AI
PARABRICKS
Genomics
BIONEMO
Biology Gen AI & LLMs
NEMO
Generative AI & LLMs
TARGET PRE-CLINICAL
LEAD CLINICAL
OPTIMIZE COMMERCIAL
NVIDIA DGX Cloud
Chips, Systems, Networking, Data Center Scale
Pre-Trained Models
Accelerated Training
Optimized Inference
Cloud Services & APIs
NVIDIA AI
Frameworks, Infrastructure, SDKs, Toolkits, Libraries

CONTROLLED
GENERATION
GENERATE FUNCTIONAL
PROTEINS
GENERATE
MOLECULES
PREDICT GENE
EXPRESSION
PREDICT COMPLEX
STRUCTURES
PREDICT VIRUS
EVOLUTION
Generative AI is Turning Biology From Science to Engineering
Explosion of Biomolecular Gen AI Research |Joint NVIDIA Biomolecular Gen AI Research
Source: arXiv.org Q-bio: AI, ML, DL, NN
200
400
600
800
1000
0
2012 2014 2016 2018 2020 2022
ESM
AlphaFold
CASP13
AlphaFold2
CASP14
ESM2
EquiFold
DiffDock
OpenFold
ProteinMPNN
ProtGPT2
…
AI
Biology
arXiv
Papers
DNABERT

Generative AI Accelerates Early Drug Discovery
3 Years Faster |100s of Millions Cheaper
Source: arXiv.org Q-bio: AI, ML, DL, NN
200
400
600
800
1000
0
2012 2014 2016 2018 2020 2022
ESM
AlphaFold
CASP13
AlphaFold2
CASP14
ESM2
EquiFold
DiffDock
GenSLMs
ProteinMPNN
…
DNABERT
TARGET LEAD OPTIMIZATION
Early
Discovery
~$500M
4.5Yrs
Traditional
Early Discovery
$2M
1.5Yrs
Generative AI
Early Discovery
3x Faster
200x Cheaper

BioNeMo is a Cloud Managed Service
Customize and Run Generative AI for Computer Aided Drug Discovery
Your Data
Your Model
Inference
Your App
Optimize
Train
Pre-Trained
Models
BioNeMo
Fine-Tune
AlphaFold2
OpenFold
ESMFold
MoFlow
MegaMolBART
DiffDock
ESM1nv
ESM2
ProGPT2
NVIDIA DGX Cloud
Your Model

9 SOTA Models are Optimized for Drug Discovery Applications
Quick and effortless path to scale, speed, and experimentation
ProtGPT2
Protein Generation
Sequence
Generation
Amino Acid
Sequence
Protein
Structure
Amino Acid
Sequence
ESMFold
OpenFold
AlphaFold2
Protein Structure Prediction
Learned
Embeddings
Amino Acid
Sequence
Docked
Structures
Structures DiffDock
Molecular Docking
Molecule
Generation
Molecule
SMILES
MoFlow
MegaMolBART
Molecular Generation
ESM1nv
ESM2
Protein Learned Sequence & Structure
NVIDIA DGX Cloud
Molecular Learned Representation
Molecule
SMILES
MegaMolBART Learned
Embeddings

BioNeMo Inference Service in EA2
A suite of AI computer aided drug discovery models |Optimized for scale, speed and cost
NVIDIA BioNeMo
API Endpoints
NVIDIA DGX
Cloud
NVIDIA BioNeMo
Web Interface
Easy API Integration
Streamline application development and
eliminate management of infrastructure
with and easy-to-use API endpoints.
Interactive Web Experimentation
Instantly bring your own data to
experience the precision and speed of
Gen AI for drug discovery applications
SOTA Models
Suite of state-of-the-art generative models
across the drug discovery process from initial
design to lead optimization
Optimized Model Deployment
Designed for scale and optimized
for the quickest inference time,
reducing deployment costs.
AlphaFold OpenFold ESMFold DiffDock
ProGPT2 MegaMolBART
MoFlow ESM2
ESM1nv
Structure Prediction Pose Prediction
Biomolecular Generation Property Prediction
Target
Discovery
Lead
Discovery
Virtual
Screening
Lead
Optimization

Models Accessible on an Easy-to-Use Graphic User Interface
Interactive Inference |Visualization |Experimentation
ESM Fold
3D Protein Structure Prediction

MegaMolBART
Molecular Generation

DiffDock
Molecular Docking

BioNeMo Training Service in Beta
Fast and easy Gen AI training for drug discovery |Unleash drug discovery data potential
Data Loaders
SMILES, Proteins
Pre-Training Fine-Tuning Advanced Monitoring
Foundation
Model
Customized
Model
Task Specific
Model
AI DRUG DISCOVERY
APPLICATIONS
OpenFold
MMB
ESM1
ProT5
BioNeMo
Pre-Trained Models
NVIDIA DGX
Cloud
NVIDIA DGX
On-Prem
NVIDIA BASE
COMMAND PLATFORM
Flexible Training Workflows
Workflows to support from scratch large
scale pre-training, pre-trained model fine-
tuning and task-tuning on your own data
Enterprise Support
NVIDIA AI Enterprise and experts
by your side to keep projects on
track
Simple Data Loading
Automatic download and preprocess of Uniref
(proteins) and Zinc (molecules), supports
SMILES and protein sequence data loading
Optimized Scaling Recipes
Accelerated training throughput
with model and data parallel
training across 1,000s of nodes

Customers Accelerating Drug Discovery
BioNeMo helping to customize and run generative AI for drug discovery
Instadeep Nucleotide Transformer
500M -> 2.5B Parameter Model
SOTA 15 of 18 Benchmarks
175B Nucleotide Multi Species Sequences
Supercomputing Scale - 16 DGX Cloud
Evozyne ProT-VAE
BioNeMo Training Service
Protein Transformer Variational Autoencoder
Functional Protein Design
Experimentally Validated
Amgen BioNeMo DGX Cloud
5 Proprietary Antibody Language Models
3x Speedup – 3 Months to 4 Weeks
Up to 100x Post Training Analysis
Optimized OpenFold Service, 20x per Prediction

Generative AI Speeds Biologics Drug Discovery
Challenge
Traditional biologics discovery is a costly
process, and sparse data make predictions
even more challenging.
Amgen wanted to accelerate biologics
discovery by using AI models to propose
and evaluate designs for candidate drugs.
Required powerful multi-node
infrastructure to accelerate training of
large protein models with extensive data.
Solution
Trained large language models (LLMs) on Amgen’s
proprietary data to help predict properties of proteins
and develop biologics with enhanced properties.
Leveraged NVIDIA DGX Cloud and BioNeMo for
training and fine-tuning of protein LLMs and NVIDIA
RAPIDS for faster post-training analysis.
BioNemo on DGX Cloud, a turnkey solution enabled
Amgen to get up and running quickly, moving from
initial login to training large models in just a few days.
NVIDIA DGX Cloud
AI-training-as-a-service solution
Faster protein
structure prediction
20sec/
structure
100x
<1month
Faster post-training
analysis
From onboarding to
first pretrained
protein LLM
“Easeof multi-node training and the ability to use larger
batch sizes within DGX Cloud enabled us to achieve our
three-month objectives in just four weeks..”
- Chris James Langmead, Director of Digital
Biologics Discovery, Amgen
NVIDIA Base Command
Platform
for workflow management
NVIDIA AI Enterprise
RAPIDS for data post-processing
NVIDIA BioNeMo
For training and inferencing

Next Steps
• Try MegaMolBART on NVIDIA LaunchPad
• Sign up for Early Access for BioNeMo
• Register for no-cost, 2 week POC on NVIDIA DGX SuperCloud
Contact your account representative

Workshop LLM Life Sciences ChemAI 231116.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Workshop LLM Life Sciences ChemAI 231116.pptx

Similar to Workshop LLM Life Sciences ChemAI 231116.pptx (20)

Recently uploaded

Recently uploaded (20)

Workshop LLM Life Sciences ChemAI 231116.pptx