Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Building a knowledge graph
with Spark and NLP: How we
recommend novel
hypothesis to our scientists
Eliseo Papa, MBBS PhD, AstraZeneca
#UnifiedDataAnalytics #SparkAISummit

Drug discovery is hard
3
COST OF A NEW
DRUG ~ 2.6 BILLION
PROBABILITY OF
SELECTING THE
RIGHT TARGET ARE 9-
12% AT BEST
FALSE DISCOVERY
RATE ESTIMATED AT
96%
OVER ⅔ OF CLINICAL
TRIALS FAIL FOR
LACK OF EFFICACY

Despite increase in R&D spending, the
number of new medicines was constant
4

AstraZeneca introduced the “5R”
framework
5

5R has had a significant impact in
improving our efficiency
6https://www.nature.com/articles/nrd.2017.244

Difficulties remain
Target decision take years to be validated
7
Too much data for scientists to consider
when generating hypothesis

We are investing in new sources of data
and faster validation
8

We need tools to make sense of data &
make better and faster decisions
1) Partnerships
2) Internal Knowledge Graph build
3) Developing a RecSys for target identification
9

Finding a drug target can be formulated
as a hybrid recommendation problem
• Scientists need to parse large amount of information
and make a ranking prediction
• Different formats, data models, locations
• Estimates of probability of success needs to be
constantly updated
10

Multiple objective optimization
11

Traditional recsys approaches
12
Collaborative filtering – “what is everyone else
choosing as a drug target”
Content-based filtering – “what are the
characteristic of the target”
Knowledge-based filtering – “what do we know
about the target role in human disease”

We assemble a large scale knowledge
graph from public and AZ internal data
13
KG
feature extraction
(embeddings,
gCNN,..)
Machine learning
model training Recommendations
Insights validated in
collaboration with
scientists
Pipeline
decision
Deduplication, entity linking,
normalization, NLP
Regular data release with
multiples access options
public
unstructured
AZ
knowledge
omics
chemistry
literature
KG
Data sources
1
2

14
normalization, NLP
public
unstructured
AZ
knowledge
omics
chemistry
literature
KG
Data sources
1

15
OUTPUTSINPUTS
files
DB & API
queries
schemas
REPORTS
GRAPH(S)
DASHBOARD & SEARCH
FILES (Nodes & Edges)
Delivering a scalable, modular, cloud-based graph creation pipeline, with automated publishing,
analysis and reporting. The platform democratizes BIKG by facilitating easy knowledge addition,
graph build, interrogation and evaluation.
David Geleta

KG pipeline on
16
• Series of prototype notebooks ported to Databricks
and chained to form the BIKG creation pipeline
• Fast, reproducible KG production: one order of
magnitude speed improvement
• Input (source files) & output (source parsers, node
&edge deduplication) files stored on DBFS
KG quality control visualization
• Databricks Dashboards
• Provides overview & in-depth views

Pipeline – series of notebooks
17

Pipeline stages
18
Dashboard
Visualize QC metrics
• (0) Source acquisition: sources are updated
• (1) Parsing: each specified source is parsed into a set of
nodes and edges; inputs differ: multiline JSON, JSON, RDF,
APIs etc..
• (2) Matching & deduplication
• Nodes: matched on labels and IDs
• Edges: using deduplicated nodes, source and destination
nodes are identified
• (3) Evaluation: resulting KG is analyzed for completeness,
correctness, etc..
• (4) Projections: KG is transformed into several forms: nodes
& edges CSVs, GraphX graph frames, RDF ontologies etc..

Node dictionary
19
Nodes with all known labels, classification,
default identifier, and any other contextual
information, excluding provenance.

Mappings table
20
• Contains all mappings with types
• Easy to filter by type, provenance
• Facilitates different strength of folding (strict 1:1 equivalence,
narrow/broad etc.)
• Directionality implied by source, target id order & mapping
relation type

Edge assertions
21
• Contains all edge assertions
• Easy to filter by type, provenance
• Directionality implied by source, target id order & relation type
• Edge types
• structural : such edges provide ontological classification, can be
used for clustering, folding etc. (e.g. rdfs:subClassOf, skos:broader)
• mapping
• "real" edge

Keep evidence & context for each
assertion
22

23
normalization, NLP
public
unstructured
AZ
knowledge
omics
chemistry
literature
BIKG
Data sources
1

25
literature
Large amount of knowledge
relating to drug discovery
knowledge is unstructured
and continuously updated

Use natural language processing to
extract precise information at scale
Named entity recognition
Entity linking
Relationship extraction
26https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005017
MEDLINE: > 60m abstract,
weekly updates, 300GB,
billion entities & relationships

27
NER
termite
Relationship
extraction using
syntax tree rules
> 500 million
relationships
Tim Scrivener

28
NLP Termite on
Termite is a commercial Named Entity
Recognition tool that AZ uses to derive value
out of unstructured data.
Deploying this technology onto a spark cluster
allowed us to simplify our processing
architecture and achieve massive scalability
Richard Jackson

Syntax parsing increases precision of
entity recognition
29

Relationship from literatures reduce
sparsity of biological KG
30https://blog.opentargets.org/link/

Language models lead to improvements
in recall and precision
31
Scaling still an issue..
1. Distribute on Spark/DB
using horovod
2. ”Distill” the model down
in size

Learned sentence representation can be
used for downstream tasks
32
Input
sentence
BERT encoded
representation
• Distance and clustering,
• link to the correct biological entity,
• ranking probability of target
• Classify the type of relation
• Estimate how probable it is

33
BIKG
feature extraction
(embeddings,
gCNN,..)
Machine learning
model training Recommendations
Insights validated in
collaboration with
scientists
Pipeline
decision
2

34
34
Graph embedding pipeline
• Ingest Knowledge graph data from a blob store
• Random-split and convert data
• Train model with PytorchBigGraph
• Evaluate model
• Generate embeddings
• NearestNeighbour search with Faiss (fb)
• Track artifacts
• Write model and search results to a blob store
Anna Gogleva

Approximate nearest neighbor search
3535
Input node: kidney disease
• Generate embeddings
• Input a query node
• Retrieve N nearest neighbor nodes of a required type
• Use case 1: input a disease node return N nearest gene nodes
• Use case 2: input gene list and re-oder it based on distance to a disease node

Lessons learned
36
Spark lets us scale to
million of data points across
disparate sources
Being able to add new data
quickly helps the feedback
loop and improves trust
Backend engineers and
biologists don’t talk the
same language but working
together can be magical
We shouldn’t strike a
balance between intuitive
and comprehensive but
instead build products for
different audiences

Acknowledgements
37
Richard Jackson
NLP pipeline
David Geleta
Graph build pipeline
Anna Gogleva
Embeddings &
RecSys pipeline
Tim Scrivener
NLP pipeline
Georgios Gerogiokas
Network analysis &
embeddings pipeline
Daniel Goude
Data Ops
Marina Pettersson
Project manager
Erik Jansson
NLP pipeline
Nick Brown
Science IT
Matthew
Woodwark
Jonathan Dry
Oncology
Claus Bendtsen
Discovery Sci
Ian Barrett
Discovery Sci
Ian Dix
Want to work with data that matters ? job-search.astrazeneca.com and search for ”knowledge graph”
@elipapa elipapa.github.io https://www.linkedin.com/in/eliseopapa/
Thank you !

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

More Related Content

What's hot

Similar to Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

More from Databricks

Recently uploaded

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists