Data-driven drug discovery for rare diseases - Tales from the trenches (CINF 20, ACS National Meeting 2019-03-31)

CINF20 - 31 March 2019
Dr Frederik van den Broek, Elsevier Professional Services
Data-driven drug
discovery for rare
diseases
Tales from the trenches

This is what we are all after in drug discovery…
Image: Elsevier

If drug discovery and development only were that simple…
Disease
Drug
compound

Disease
Protein
Target
Drug
compound

Disease
Protein
Target
Drug
compound
• Cell processes
• Regulators
• Pathways
• …
• Bioactivity
• Toxicity
• Specificity
• …

Disease
Protein
Target
Drug
compound
• Cell processes
• Regulators
• Pathways
• …
• Bioactivity
• Toxicity
• Specificity
• …
• Availability
• Synthesis
• PK/PD
• …
• Genotype
• Phenotype
• Individual

This makes it all a lengthy and costly process
Image: https://www.phrma.org/graphic/the-biopharmaceutical-research-and-development-process

With rare diseases it is even harder
Small(er) patient populations leading to
• Less (integral) medical and scientific knowledge
• Small population for clinical trials
• Unawareness with doctors, researchers, policymakers
• Smaller potential market size for a drug
Image: http://www.campingtourist.com/camping-activities/climbing/difficult-mountains-climb/

Drug repurposing: a new hope for rare diseases
• Less costly and of interest for pharma
• Quicker to Phase II/III tests, so hopefully quicker to market
• Need reliable information from various sources to find suitable repurposing
candidates
Image: https://www.starwars.com/news/poll-what-is-the-best-scene-in-star-wars-a-new-hope

Accelerate with new knowledge and data
Disease
Protein
Target
Drug
compound
• Cell processes
• Regulators
• Pathways
• …
• Bioactivity
• Toxicity
• Specificity
• …
• Availability
• Synthesis
• PK/PD
• …
• Genotype
• Phenotype
• Individual

Various initiatives we were recently involved in
• Project with Findacure to find drug repurposing candidates for Congenital
Hyperinsulinism
• Pistoia Hackaton: Elsevier-Findacure challenge on Friedrich’s Ataxia
• Sub-network enrichment analysis for neuromuscular disorder pathways
• Disease pathway analysis for Huntingdon's Disease
• Pistoia Datathon for drug repurposing for rare diseases

| 13
• A rare genetic disease
• Permanently excessive level of insulin in the
blood
• Develops within the first few days of life
• Can lead to brain injury or even death
• In the most severe cases the only viable treatment is
the removal of the pancreas, consigning the patient to
a lifetime of diabetes
Congenital hyperinsulinsm (CHI)
https://res.cloudinary.com/indiegogo-media-prod-
cld/image/upload/c_limit,w_620/v1440424745/uzvnq
zhvbpsrtthzxqpu.jpg

Creating a comprehensive view of CHI
• CHI Literature Library
• Disease, Target, Pathway, and
Compound Analysis
• Research Landscape Analysis
Information Assets Applied
• Content Elsevier’s vast set of literature and patent data
• Data normalization Taxonomies and dictionaries to
normalize author names, institutions, drugs, targets, and
other important terms
• Information extraction Finding semantic
relationships, targets, pathways, drugs, and bioactivities

Building and refining the CHI disease model
Picked relevant
pathways
(from a collection of 1800
models)
Explored functions of
proteins using 6.2M pre-
text mined relations
and embedded Gene
Ontology
Summarized what is known
about CHI mechanism in an
overview model

From pathways to CHI treatments:
Automated analysis combines bioassay data with pathway data
Mean of activities among
these targets
Me
Targets and activities for
each compound
Drug-likeness
metrics for
sorting/classification
• All compounds that
were observed to bind
to targets in pathway
• Sorted by number of
active targets.
Too many targets may
suggest lack of specificity.
Find all targets that
could be used to affect
the disease state
Query for each target to find
compounds that have high
affinity for them (>6 log units)
Collate data by compound to summarize the
targets/activities related to disease that the
compound hits
• Compute geometric mean of activities for ranking
• Rank by number of targets and geometric mean of
activities against targets
Step 1 Step 2
Step 3

Pistoia Hackathon Challenge (2017)
Elsevier would like you to demonstrate the ability of deep learning to help
Findacure, a UK-based charity, accelerate treatment and clinical research for
Friedreich’s ataxia (FRDA). You’ll have access to a heterogeneous set of
data related to the disease: biological pathway analysis, associated chemical
compounds and bioactivities, potential candidates for drug re-purposing, full-
text scientific literature and clinical trial data.
Basically, giving others a go with the data sets we worked with on CHI….

Promising results, but still hard work
“We spent most of our time the first day just trying to get our heads around
the data, so we could start to find some solutions. Even opening the files was
tricky.” The students used various tools to try to extract data from the
provided XML files, but it was slow going. Daniel [one of the participants]
commented that, “we wound up having to do a lot of things manually, so we
could at least read the files in plain text.”

Sharing disease pathways
• Shared curated pathways (with supporting literature
references) with rare disease organisations to help their
discussions with researchers and fill in potential “blanks”
• Comparing gene expression algorithms for the identification
of expression regulators
• Well-defined datasets, with supporting
literature references which resonate
with researchers

Datathon (2019):
Applying AI in Drug Repurposing for Rare Diseases

“Machine learning
won’t work if your data
is rigidly siloed.”
“One major challenge
is collecting enough
reliable information to
properly train AI systems.
AI is as good as the
data.”
Nick Patience
Founder, 451
Research
“Organizations need to
make sure that the data
being accessed is
treated and defined
consistently across the
sources. Otherwise,
virtualization won't work.”
“All the major AI
advances have been
fueled by advances in
data sets. The algorithms
are easy….
"Collecting, classifying
and labeling datasets
used to train the
algorithms is the grunt
work that’s difficult”
Aspuru-Guzik
Professor of Chemistry &
Machine Learning, Harvard
University JJ Guy
CTO, Jask (AI co.)
‘Siloed’ Lack of standards
Requires labeling and
contextPoor quality1
2 3 4
Using the Entellect Platform and Data Curation

Access, curation of
authoritative life science
data
Integration of disparate
data, structured and
unstructured
Normalized and
standardized data with
industry standard
taxonomies
Build custom and off-the-
shelf analytics tools
‘Un-siloed’ Harmonized Enriched and linkedQuality
Nick Patience
Founder, 451
Research
Aspuru-Guzik
Professor of Chemistry &
Machine Learning, Harvard
University
1
2 3 4
Using the Entellect Platform and Data Curation

Adverse
Event
Person
Org
Which
drugs
affect this
target?
Bio-
Activity
Pathway
Disease
Bioprocess
Trial
Disease
Species
Target
Drug
Assay
Substance:
- provenanceName
- substance
- name
- compoundType
- substanceTypeName
- inchiCode
- molecularFormula
- charge
- numberOfAtoms
- numberOfComponents
- numberOfElements
- numberOfFragments
- numberOfStructure
- molWeightPublishedValue
- molWeightPublishedUnit
- molWeightStandardValue
- mpvalue
Bioactivity:
- provenanceName
- effect
- inducedBy
- target
- targetsCount
- bioactivityParameterName
- displayValue
- publishedValue
- publishedUnit
- pX
Target:
- provenanceName
- target
- uniprotId
- sequence
- targetType
- label
- speciesId
- speciesName
- geneSymbol
Entellect Platform and Data Curation

Various teams using various approaches
• Semantic data: Target Identification
• Semantic data: Small Molecule Binding
• Machine Learning
− Ensemble Learning
− Mol2Vec, Prot2Vec
− Network diffusion
• Expert collaboration
− Virtual docking
− Adverse Event profiling
“I could work on the important stuff straight away, using all the data”

Promising results so far (March 2019)

Aiming to make data-driven drug discovery for rare diseases
a little easier…
Disease
Protein
Target
Drug
compound
• Cell processes
• Regulators
• Pathways
• …
• Bioactivity
• Toxicity
• Specificity
• …
• Availability
• Synthesis
• PK/PD
• …
• Genotype
• Phenotype
• Individual

Conclusions
• Data, data, data…
• Data has to be FAIR and of good and trusted provenance as the
researchers and clinicians will want to see the “chain of evidence” (beware
of black box models)
• Data sets also have to be FAIR for each other: enabling the integral
approaches repurposing needs have to be linked data sets across siloes
and domains to go from disease to target to compound (and back)
Image: Sangya Pundir, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53414062

Acknowledgements
• Maria Shkrob
• Jabe Wilson
• Anton Yuryev
• Matthew Clark
• Christy Wilson
• Finlay Maclean
• Elsevier’s Entellect team
• Pistioia hackaton and datathon teams

Questions?
By Malis - https://commons.wikimedia.org/w/index.php?curid=2633354

Appendix – Datathon approaches

Data-driven drug discovery for rare diseases - Tales from the trenches (CINF 20, ACS National Meeting 2019-03-31)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data-driven drug discovery for rare diseases - Tales from the trenches (CINF 20, ACS National Meeting 2019-03-31)

Similar to Data-driven drug discovery for rare diseases - Tales from the trenches (CINF 20, ACS National Meeting 2019-03-31) (20)

Recently uploaded

Recently uploaded (20)

Data-driven drug discovery for rare diseases - Tales from the trenches (CINF 20, ACS National Meeting 2019-03-31)