Jérémy Grignard, Data & Research Scientist, Servier
Les données que nous exploitons sont issues de domaines scientifiques variés comme les sciences omiques, structurales, cellulaires, chimiques ou phénotypiques, et correspondent à des concepts pharmaco-biologiques hétérogènes. Nous développons le graphe de connaissances Pegasus qui vise, en plus de capitaliser sur des données actuellement disponibles, à explorer l’environnement complexe des cibles thérapeutiques, à identifier des modalités de criblage pertinentes et à concevoir de nouvelles expériences.
SERVIER Pegasus - Graphe de connaissances pour les phases primaires de recherche de nouveaux médicaments
1. Neo4J - GraphSummit Paris - 08/06/2023
Pegasus – Knowledge Graph
For The Early Drug Discovery
Jeremy Grignard, PhD
Research & Data Scientist
Institut de Recherches Servier
2. Discovering A New Drug: A Long, Expensive And Risky Process
2
Research Clinical
Discovery Preclinical
Exploratory
Target
Identification
Screening
Phase
I
Phase
II
Phase
III
106 perturbators 101 perturbators Candidate
10-15 years - 2 billion Euros - High failure rate
Formulation of
a causal
hypothesis
between a
target and a
disease
Strategic Objective
Obtain a MA for a
chemical or
biological entity /
combination of
entities every 3
years
How to improve the early drug discovery phases in order to increase the success
rate of drug candidates in clinical phases?
3. Our Mission - Data Sciences & Data Management
Efficiently guide early research projects using computational methods supported
by experimental capabilities
We rely on 3 interconnected activities:
• High throughput design of efficient perturbators
• Explainable project-oriented selection of relevant profiled perturbators
• Large and heterogeneous dataset and models’ analysis to ensure target tractability
and support rational decision making
Profiling
Systems
Biology
Sequence
Designs
Knowledge
Graph
4 main interconnected areas of expertise
3
4. 4
Useful Data Sources For Therapeutic Projects And Our Activities
Large Data Heterogeneous pharmaco-biological concepts
• Genes, transcripts, proteins
• Ontologies (genomics, phenotypic)
• Static Maps
• Diseases
• In Vivo / In Vitro models
• Perturbators
- Chemicals (small compounds, PROTACs, probes)
- Target (shRNA, CRISPR, siRNA, overexpression)
- Antisens oligonucleotides (ASOS)
- Antibody
• Fingerprints
How to capitalize and link heterogeneous data to bring values for therapeutic
projects and support decision making?
5. 5
Pegasus – Knowledge Graph For The Early Drug Discovery
Rational & Context
• Integration of heterogeneous data
- Labeled property graph with Neo4J
• 46.371.784 entities – 66 labels
• 331.570.883 relations – 14 types
• Data model flexible
- Model evolved over time
- Can be easily changed depending on new request
• Efficient data preparation (hours), import
(minutes), storage and query
• By and for Servier
Data Model
Pegasus is built to answer questions and give valuable insights extremely quickly
Answering a question is like traversing paths in a graph
6. 6
Pegasus – Knowledge Graph For The Early Drug Discovery
Raw
Data
Primary
Data
Aggregated
Data
PEGASUS
Preparation Aggregation
Added
Value
Data Information Knowledge
Process Exposure
Target
Id
Card
DFSL
ASOS
design
Phenotypic
screening
deconvolution
Action
Models
Prediction
Models
uORF
Applications
Report
Report
Cpds
list
List of
targets
Outcomes
List of
ASOS
Process
Informed
compounds
library
7. 7
ASOS (Antisens Oligonucleotides) Design
Characterization Of ASOs Off-target Effects
Given (100.000) ASOs designed for a target, prioritize quickly the ones (784) to screen
MATCH path=(:Asos)-[:HAS_ACTIVITIES]-(:ActivityAsos)-[:ON_TARGETS]-(:Transcript)-
[:HAS_REFERENCES]-(:Transcript)-[:TRANSCRIPTS]-(:Gene)-[:HAS_REFERENCES]-(:Gene)-
[:HAS_ACTIVITY]-(:ActivityGeneEssential)-[:ON_TARGETS]-(:Disease) RETURN path;
Industrial Applications
ASOS designed, active and non-toxic:
• X and Y (preclinical)
• A, B and C (LO)
• O and P (Research)
Note for X:
• Treating a baby with epileptic
encephalopathy
• 6.143 designed ASOs
• 2.073 → 372 essential genes
• 1.344 → 400 developmental
genes
• 784 screen ASOs (mouse, neurons)
• 13 active ASOs
• 8/13 no off target
• 3/13 best activities
8. 8
Automated Focused Library Design
HAS REFERENCE
TRANSCRIPT
IN
TRANSLATE
IN HAS
HOMOLOGY
GENE TRANSCRIPT PROTEIN
Protein Embeddings
• Encode functional and structural properties of
proteins using LMs
Auto-encoder
10. 10
Automated Focused Library Design
HAS
PREDICTED
ACTIVITY
GENE TRANSCRIPT PROTEIN CONTEXT
Historical activity
database
Multitasks learning
Federated deep learning:
- Unbiased
- Sequence based
- Unlimited number of molecules
- Very limited number of targets
PERTURBATOR
11. 11
Automated Focused Library Design
CHEMICALLY
SIMILAR
GENE TRANSCRIPT PROTEIN CONTEXT
Fingerprints + metric (Tanimoto) + threshold
PERTURBATOR
12. 12
Automated Focused Library Design
CHEMICALLY
SIMILAR
PHENOTYPICALLY
SIMILAR
GENE TRANSCRIPT PROTEIN CONTEXT
Phenotypic: Cell Painting + threshold
• Biology
• Polypharmacology
• Perturbation agnostic: bridge the GAP between Chemistry & Biology
• Medium throughput
Auto-encoder
PERTURBATOR
13. 13
Automated Focused Library Design
CHEMICALLY
SIMILAR
PHENOTYPICALLY
SIMILAR
TRANSCRIPTOMICALLY
SIMILAR
GENE TRANSCRIPT PROTEIN CONTEXT
Transcriptomic: L1000 + cQuery
• Biology
• Polypharmacology
• Perturbation agnostic:
bridge the GAP between Chemistry & Biology
• Medium throughput
PERTURBATOR
14. 14
Automated Focused Library DesignFiltering out compounds / Off Targets
GENE TRANSCRIPT PROTEIN CONTEXT
CHEMICALLY
DEFINED AS
PAINS
ESSENTIAL
GENE
OFF TARGET
OFF TARGET
OFF TARGET
LOW SPECIFICITY
PERTURBATOR
15. 15
Automated Focused Library DesignFiltering out compounds
GENE TRANSCRIPT PROTEIN CONTEXT
COMMERCIALY
UNAVAILABLE
COMMERCIALLY
AVAILABLE
INTERNALLY
AVAILABLE
PERTURBATOR
18. 18
Automated Focused Library DesignMoving away from the target of interest
(cellular assay)
GENE TRANSCRIPT PROTEIN CONTEXT DISCARDED
PPI
Pathway PART OF
PATHWAY
PERTURBATOR
19. 19
Automated Focused Library DesignMoving away from the target of interest
(cellular assay)
GENE TRANSCRIPT PROTEIN CONTEXT DISCARDED
INVOLVED
IN
PATHWAY
Disease
DISEASE
PERTURBATOR
29. 29
Automated Focused Library DesignModality Expansion
GENE TRANSCRIPT PROTEIN CONTEXT HITS DISCARDED PATHWAY DISEASE CRISPR
shRNA cDNA ASOS
PREDICTED
ACTIVE
30. 30
Automated Focused Library Design
Simple Overview With Results
Given any target, identify compounds having activity, their analogues and specificity
MATCH path=(g:Gene)-[:HAS_REFERENCES]-(:Gene)-[:ON_TARGETS]-(:ActivityChemical)-
[:HAS_ACTIVITIES]-(:ChemicalChembl)-[r:HAS_SIMILARITIES]-(:ChemicalServier)
WHERE g.geneId = ‘target X’ AND r.tanimoto_similarity > 0.6 RETURN path;
Success Stories
Target X:
• Identification of 984 Servier
compounds chemically similar to
compounds having an activity
• MST validation : hit rate 15%
Target Y:
• From Project Leader : 4 potential
reference compounds
• Identification of 47
chemically similar
compounds
• Using annotations in Pegasus
→ completely non-specific
profile of the reference
compounds
32. 32
Therapeutic Target Environment Exploration (ID card)
Pegasus Application
Given any target, present a target ID card
Success Stories
For plenty of targets: A, B, C, D, E, F, G, H, I, J, K, L, M,
N, O, P, Q, R, S, T, U, V, W, X, Y, Z
- Genes / transcripts / proteins identifiers, cross-
reference, isoforms
- Biological processes, cellular localization
- Protein half lives
- Gene essentiality
- In Vivo models
- Gene / Disease associations
- SNP
- Pathways
- Perturbators
33. 33
Conclusion
Knowledge graph are well suited for integrating
• Large amount of heterogenous and sparse data (as typically seen in pharmaceutical research)
This data structure is guiding our therapeutics projects on various aspects
• And allow us to seamlessly integrate exploratory cutting-edge AI approaches
Systematic generation of data (Silico & Vitro)
• Is a requirement to feed properly those database and generate knowledge out of them
Communication
• Neo4j Health Care & Life Sciences Workshop; Symposium Servier – AI to New Drug Development;
Servier Corporate Strategy & Executive Director; France Culture « La Méthode Scientifique »…
34. 34
Perspective
« Les objets sont caractérisés par la façon dont ils interagissent. Si un objet n'a pas
d'interactions, n'influence rien, n'agit sur rien, n'émet pas de lumière, n'attire pas, ne
repousse pas, ne se laisse pas toucher, ne sent rien, etc.,
c'est comme s'il n'existait pas.
Parler d'objets qui n'interagissent jamais, c'est parler de choses qui, quand bien
même elles existeraient, ne nous concernent pas.
Nous ne comprenons même pas ce que dire que de telles choses « existent »
pourrait signifier.
Le monde que nous connaissons, qui nous touche, qui nous intéresse, ce que nous
appelons la « réalité », est le vaste réseau d'entités en interaction qui se manifestent
les unes aux autres en interagissant et dont nous faisons partie.
C'est à ce réseau (Pegasus) que nous nous intéressons. »
Helgoland by Carlo Rovelli
37. 37
JUMP-CPJoint Undertaking In Morphological Profiling Cell Painting
Phenotypic Fingerprint
High-Content Screening - Cell Painting Fingerprint Analysis
Compounds
CRISPR
Fingerprints
(morphological descriptors)
Phenotypic fingerprint induced by compounds screened in HCS show various phenotypic
response, reveal dose response effects with different mechanism of action
Negative
control and
no effect
compounds
dose response
38. 38
Target Expansion
Protein Fingerprint
Protein Embeddings
• Encode functional and structural properties of
proteins using LMs
Fingerprint Analysis
X, Y, Z, A, B, C
Protein fingerprint (embedding) seem to carry sequence and domain function information
On going work: interpretability + sub-domain similarity
Proteins
Fingerprints
(embedded sequences)
39. 39
Cancer Dependency Map (DepMap)
Cellular Model and Chemical Fingerprints
Achilles & CCLE
• Cell lines identification that express a target of
interest (experimental validation for CISH)
Fingerprint Analysis (PRISM 1D)
• Phenotypic similarity to identify compounds
sharing same phenotypic effects
Targets / MOAs / Chemical Similarity
In Clusters
Phenotypic
similarity
Drugs
Fingerprints
(activity in cell lines)
Similarity of cellular model fingerprints induced by compounds in DepMap PRISM reveal a
correlation (in some clusters) between chemical structures and phenotypic responses
40. 40
CMAP – L1000
Chemical / CRISPR / sh Fingerprints
Dimensional Reduction Cluster - MOA / Targets Analysis
Compounds identified within the cluster, obtained by reduction of L1000 fingerprints, have the
same mechanism of action (MOA), and have activity on same target family (HDCA*)
Compounds
CRIPSR, sh, …
Fingerprints
(L1000 signatures)
41. 41
CMAP – L1000
Chemical / CRISPR / sh Fingerprints
Fingerprint Similarity (Phenotypic – DepMap) Fingerprint Similarity (Chemical)
Compounds identified within the cluster, obtained by reduction of L1000 fingerprints, have the
same phenotypic effects (in DepMap PRISM) and are not chemically similar while sharing
same mechanism of action and activity on target family (HDCA*)
42. 42
Our PA DSDM activities Linked To Pegasus (so far)
Support Therapeutic Projects
Therapeutic target environment exploration (ID card)
• Get target annotations: isoforms, biological processes, localization, half lives, essentiality, models, diseases, SNP,
pathways, perturbators activity
Optimized design of Antisense Oligonucleotides (ASO)
• Characterize ASOs off-targets effects to prioritize ASO to screen
Design new experiment (e.g coupling uORFs / ASOS)
• Identify target transcripts having uORFs that can be targeted by ASOs to increase protein expression
Identification of focused screening libraries
• Identify relevant perturbators, validate of biological hypothesis, and expand from Servier compound library to
phenotypic, omics, cellular spaces
These activities are based on heterogeneous pharmaco-biological data already present in
Pegasus and under exploration for future integration
43. 43
Chemical / Activity On Targets
Chemical Fingerprint
Chemicals / Activity extraction
• Extraction, annotations (e.g PAINS, frequent hitter, reactive), and standardization
Given a target of interest, quick identification of compounds having activity,
analogues and compound specificity
Chemicals
Fingerprints
(e.g FCFP6)
44. Formalismes de représentation des connaissances pour mettre en relation des
données hétérogènes
Graphe Ressource Description Framework (RDF) Graphe à Propriétés Étiqueté (GPE)
Description atomique des données :
triplet (sujet, prédicat, objet)
Partage d’informations et de données
cohérentes
• Modèle de données éprouvé
• Difficulté d’identification d’une
sémantique commune à l’information
atomique à décrire
• Modèle RDF peu flexible
Données sous forme d’entités (nœuds) et
de relations
Analyse de graphe, recherche de
chemins en profondeur, importation
massif de données
Flexibilité du modèle de données : nœuds
et relations
Défauts formalisme RDF pour nos activités de recherche → Choix GPE
Manque de concepts → Conception et implémentation de Pegasus (GPE)
45. De multiples bases de données identifient des ressources fonctionnellement
identiques comme des gènes
Modélisation de concepts fonctionnellement identiques par
plusieurs entités et reliées par des références croisées
46. Concepts de gènes et de protéines sont souvent mélangés et le concept de
transcrit est généralement absent
Modélisation des gènes, transcrits, protéines et d’unités fonctionnelles par des entités distinctes reliées par des
relations
47. Certains concepts de perturbateurs chimiques et biologiques
sont absents
Modélisation des perturbateurs par des entités distinctes
48. Aucun concept de signatures phénotypiques et de similarités phénotypiques
dans les représentations existantes
Modélisation des similarités phénotypiques par des relations
entre signatures phénotypiques (entités)
49. Les relations reliant les nœuds des graphes sont directes dans les
représentations existantes
Entité intermédiaire pour relier et annoter contextuellement plusieurs entités entre elles
50. Plateforme de génération pour traiter les sources de données
pharmaco-biologiques hétérogènes et de provenances multiples
automatiquement et générer Pegasus