SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Building a knowledge graph
with Spark and NLP: How we
recommend novel
hypothesis to our scientists
Eliseo Papa, MBBS PhD, AstraZeneca
#UnifiedDataAnalytics #SparkAISummit
Drug discovery is hard
3
COST OF A NEW
DRUG ~ 2.6 BILLION
PROBABILITY OF
SELECTING THE
RIGHT TARGET ARE 9-
12% AT BEST
FALSE DISCOVERY
RATE ESTIMATED AT
96%
OVER ⅔ OF CLINICAL
TRIALS FAIL FOR
LACK OF EFFICACY
Despite increase in R&D spending, the
number of new medicines was constant
4
AstraZeneca introduced the “5R”
framework
5
5R has had a significant impact in
improving our efficiency
6https://www.nature.com/articles/nrd.2017.244
Difficulties remain
Target decision take years to be validated
7
Too much data for scientists to consider
when generating hypothesis
We are investing in new sources of data
and faster validation
8
We need tools to make sense of data &
make better and faster decisions
1) Partnerships
2) Internal Knowledge Graph build
3) Developing a RecSys for target identification
9
Finding a drug target can be formulated
as a hybrid recommendation problem
• Scientists need to parse large amount of information
and make a ranking prediction
• Different formats, data models, locations
• Estimates of probability of success needs to be
constantly updated
10
Multiple objective optimization
11
Traditional recsys approaches
12
Collaborative filtering – “what is everyone else
choosing as a drug target”
Content-based filtering – “what are the
characteristic of the target”
Knowledge-based filtering – “what do we know
about the target role in human disease”
We assemble a large scale knowledge
graph from public and AZ internal data
13
KG
feature extraction
(embeddings,
gCNN,..)
Machine learning
model training Recommendations
Insights validated in
collaboration with
scientists
Pipeline
decision
Deduplication, entity linking,
normalization, NLP
Regular data release with
multiples access options
public
unstructured
AZ
knowledge
omics
chemistry
literature
KG
Data sources
1
2
14
Deduplication, entity linking,
normalization, NLP
Regular data release with
multiples access options
public
unstructured
AZ
knowledge
omics
chemistry
literature
KG
Data sources
1
15
OUTPUTSINPUTS
files
DB & API
queries
schemas
REPORTS
GRAPH(S)
DASHBOARD & SEARCH
FILES (Nodes & Edges)
Delivering a scalable, modular, cloud-based graph creation pipeline, with automated publishing,
analysis and reporting. The platform democratizes BIKG by facilitating easy knowledge addition,
graph build, interrogation and evaluation.
David Geleta
KG pipeline on
16
• Series of prototype notebooks ported to Databricks
and chained to form the BIKG creation pipeline
• Fast, reproducible KG production: one order of
magnitude speed improvement
• Input (source files) & output (source parsers, node
&edge deduplication) files stored on DBFS
KG quality control visualization
• Databricks Dashboards
• Provides overview & in-depth views
Pipeline – series of notebooks
17
Pipeline stages
18
Dashboard
Visualize QC metrics
• (0) Source acquisition: sources are updated
• (1) Parsing: each specified source is parsed into a set of
nodes and edges; inputs differ: multiline JSON, JSON, RDF,
APIs etc..
• (2) Matching & deduplication
• Nodes: matched on labels and IDs
• Edges: using deduplicated nodes, source and destination
nodes are identified
• (3) Evaluation: resulting KG is analyzed for completeness,
correctness, etc..
• (4) Projections: KG is transformed into several forms: nodes
& edges CSVs, GraphX graph frames, RDF ontologies etc..
Node dictionary
19
Nodes with all known labels, classification,
default identifier, and any other contextual
information, excluding provenance.
Mappings table
20
• Contains all mappings with types
• Easy to filter by type, provenance
• Facilitates different strength of folding (strict 1:1 equivalence,
narrow/broad etc.)
• Directionality implied by source, target id order & mapping
relation type
Edge assertions
21
• Contains all edge assertions
• Easy to filter by type, provenance
• Directionality implied by source, target id order & relation type
• Edge types
• structural : such edges provide ontological classification, can be
used for clustering, folding etc. (e.g. rdfs:subClassOf, skos:broader)
• mapping
• "real" edge
Keep evidence & context for each
assertion
22
23
Deduplication, entity linking,
normalization, NLP
Regular data release with
multiples access options
public
unstructured
AZ
knowledge
omics
chemistry
literature
BIKG
Data sources
1
24
Focus on NLP
literature
25
literature
Large amount of knowledge
relating to drug discovery
knowledge is unstructured
and continuously updated
Use natural language processing to
extract precise information at scale
Named entity recognition
Entity linking
Relationship extraction
26https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005017
MEDLINE: > 60m abstract,
weekly updates, 300GB,
billion entities & relationships
27
NER
termite
Relationship
extraction using
syntax tree rules
> 500 million
relationships
Tim Scrivener
28
NLP Termite on
Termite is a commercial Named Entity
Recognition tool that AZ uses to derive value
out of unstructured data.
Deploying this technology onto a spark cluster
allowed us to simplify our processing
architecture and achieve massive scalability
Richard Jackson
Syntax parsing increases precision of
entity recognition
29
Relationship from literatures reduce
sparsity of biological KG
30https://blog.opentargets.org/link/
Language models lead to improvements
in recall and precision
31
Scaling still an issue..
1. Distribute on Spark/DB
using horovod
2. ”Distill” the model down
in size
Learned sentence representation can be
used for downstream tasks
32
Input
sentence
BERT encoded
representation
• Distance and clustering,
• link to the correct biological entity,
• ranking probability of target
• Classify the type of relation
• Estimate how probable it is
33
BIKG
feature extraction
(embeddings,
gCNN,..)
Machine learning
model training Recommendations
Insights validated in
collaboration with
scientists
Pipeline
decision
2
34
34
Graph embedding pipeline
• Ingest Knowledge graph data from a blob store
• Random-split and convert data
• Train model with PytorchBigGraph
• Evaluate model
• Generate embeddings
• NearestNeighbour search with Faiss (fb)
• Track artifacts
• Write model and search results to a blob store
Anna Gogleva
Approximate nearest neighbor search
3535
Input node: kidney disease
• Generate embeddings
• Input a query node
• Retrieve N nearest neighbor nodes of a required type
• Use case 1: input a disease node return N nearest gene nodes
• Use case 2: input gene list and re-oder it based on distance to a disease node
Lessons learned
36
Spark lets us scale to
million of data points across
disparate sources
Being able to add new data
quickly helps the feedback
loop and improves trust
Backend engineers and
biologists don’t talk the
same language but working
together can be magical
We shouldn’t strike a
balance between intuitive
and comprehensive but
instead build products for
different audiences
Acknowledgements
37
Richard Jackson
NLP pipeline
David Geleta
Graph build pipeline
Anna Gogleva
Embeddings &
RecSys pipeline
Tim Scrivener
NLP pipeline
Georgios Gerogiokas
Network analysis &
embeddings pipeline
Daniel Goude
Data Ops
Marina Pettersson
Project manager
Erik Jansson
NLP pipeline
Nick Brown
Science IT
Matthew
Woodwark
Jonathan Dry
Oncology
Claus Bendtsen
Discovery Sci
Ian Barrett
Discovery Sci
Ian Dix
Want to work with data that matters ? job-search.astrazeneca.com and search for ”knowledge graph”
@elipapa elipapa.github.io https://www.linkedin.com/in/eliseopapa/
Thank you !
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
Franz Inc. - AllegroGraph
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Databricks
 
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
Neo4j
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
mukuljoshi
 
End-to-End Drug Discovery Project Management (GlaxoSmithKline)
End-to-End Drug Discovery Project Management (GlaxoSmithKline)End-to-End Drug Discovery Project Management (GlaxoSmithKline)
End-to-End Drug Discovery Project Management (GlaxoSmithKline)
Neo4j
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASADeveloping a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Neo4j
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
Neo4j
 
The path to success with Graph Database and Graph Data Science
The path to success with Graph Database and Graph Data ScienceThe path to success with Graph Database and Graph Data Science
The path to success with Graph Database and Graph Data Science
Neo4j
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 
Slides: Knowledge Graphs vs. Property Graphs
Slides: Knowledge Graphs vs. Property GraphsSlides: Knowledge Graphs vs. Property Graphs
Slides: Knowledge Graphs vs. Property Graphs
DATAVERSITY
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
Neo4j
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Databricks
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Neo4j
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph Explosion
Neo4j
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
Neo4j
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Fraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business AuthorityFraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business Authority
Neo4j
 

What's hot (20)

JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
 
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
The Art of the Possible with Graph - Sudhir Hasbe - GraphSummit London 14 Nov...
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
End-to-End Drug Discovery Project Management (GlaxoSmithKline)
End-to-End Drug Discovery Project Management (GlaxoSmithKline)End-to-End Drug Discovery Project Management (GlaxoSmithKline)
End-to-End Drug Discovery Project Management (GlaxoSmithKline)
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASADeveloping a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
The path to success with Graph Database and Graph Data Science
The path to success with Graph Database and Graph Data ScienceThe path to success with Graph Database and Graph Data Science
The path to success with Graph Database and Graph Data Science
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Slides: Knowledge Graphs vs. Property Graphs
Slides: Knowledge Graphs vs. Property GraphsSlides: Knowledge Graphs vs. Property Graphs
Slides: Knowledge Graphs vs. Property Graphs
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph Explosion
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Fraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business AuthorityFraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business Authority
 

Similar to Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
Neo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life Sciences
Neo4j
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Spark Summit
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
EITESANGO
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Big Data and AI for Covid-19
Big Data and AI for Covid-19Big Data and AI for Covid-19
Big Data and AI for Covid-19
Andrew Zhang
 
Maze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and developmentMaze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and development
Nolan Nichols
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
Maryann Martone
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
Justin Sybrandt, Ph.D.
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
Alexander Pico
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
Wolfgang G. Hoeck
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 

Similar to Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists (20)

FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Neo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life Sciences
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Big Data and AI for Covid-19
Big Data and AI for Covid-19Big Data and AI for Covid-19
Big Data and AI for Covid-19
 
Maze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and developmentMaze's Compass Platform - A data fabric for drug discovery and development
Maze's Compass Platform - A data fabric for drug discovery and development
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Building a knowledge graph with Spark and NLP: How we recommend novel hypothesis to our scientists Eliseo Papa, MBBS PhD, AstraZeneca #UnifiedDataAnalytics #SparkAISummit
  • 3. Drug discovery is hard 3 COST OF A NEW DRUG ~ 2.6 BILLION PROBABILITY OF SELECTING THE RIGHT TARGET ARE 9- 12% AT BEST FALSE DISCOVERY RATE ESTIMATED AT 96% OVER ⅔ OF CLINICAL TRIALS FAIL FOR LACK OF EFFICACY
  • 4. Despite increase in R&D spending, the number of new medicines was constant 4
  • 5. AstraZeneca introduced the “5R” framework 5
  • 6. 5R has had a significant impact in improving our efficiency 6https://www.nature.com/articles/nrd.2017.244
  • 7. Difficulties remain Target decision take years to be validated 7 Too much data for scientists to consider when generating hypothesis
  • 8. We are investing in new sources of data and faster validation 8
  • 9. We need tools to make sense of data & make better and faster decisions 1) Partnerships 2) Internal Knowledge Graph build 3) Developing a RecSys for target identification 9
  • 10. Finding a drug target can be formulated as a hybrid recommendation problem • Scientists need to parse large amount of information and make a ranking prediction • Different formats, data models, locations • Estimates of probability of success needs to be constantly updated 10
  • 12. Traditional recsys approaches 12 Collaborative filtering – “what is everyone else choosing as a drug target” Content-based filtering – “what are the characteristic of the target” Knowledge-based filtering – “what do we know about the target role in human disease”
  • 13. We assemble a large scale knowledge graph from public and AZ internal data 13 KG feature extraction (embeddings, gCNN,..) Machine learning model training Recommendations Insights validated in collaboration with scientists Pipeline decision Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature KG Data sources 1 2
  • 14. 14 Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature KG Data sources 1
  • 15. 15 OUTPUTSINPUTS files DB & API queries schemas REPORTS GRAPH(S) DASHBOARD & SEARCH FILES (Nodes & Edges) Delivering a scalable, modular, cloud-based graph creation pipeline, with automated publishing, analysis and reporting. The platform democratizes BIKG by facilitating easy knowledge addition, graph build, interrogation and evaluation. David Geleta
  • 16. KG pipeline on 16 • Series of prototype notebooks ported to Databricks and chained to form the BIKG creation pipeline • Fast, reproducible KG production: one order of magnitude speed improvement • Input (source files) & output (source parsers, node &edge deduplication) files stored on DBFS KG quality control visualization • Databricks Dashboards • Provides overview & in-depth views
  • 17. Pipeline – series of notebooks 17
  • 18. Pipeline stages 18 Dashboard Visualize QC metrics • (0) Source acquisition: sources are updated • (1) Parsing: each specified source is parsed into a set of nodes and edges; inputs differ: multiline JSON, JSON, RDF, APIs etc.. • (2) Matching & deduplication • Nodes: matched on labels and IDs • Edges: using deduplicated nodes, source and destination nodes are identified • (3) Evaluation: resulting KG is analyzed for completeness, correctness, etc.. • (4) Projections: KG is transformed into several forms: nodes & edges CSVs, GraphX graph frames, RDF ontologies etc..
  • 19. Node dictionary 19 Nodes with all known labels, classification, default identifier, and any other contextual information, excluding provenance.
  • 20. Mappings table 20 • Contains all mappings with types • Easy to filter by type, provenance • Facilitates different strength of folding (strict 1:1 equivalence, narrow/broad etc.) • Directionality implied by source, target id order & mapping relation type
  • 21. Edge assertions 21 • Contains all edge assertions • Easy to filter by type, provenance • Directionality implied by source, target id order & relation type • Edge types • structural : such edges provide ontological classification, can be used for clustering, folding etc. (e.g. rdfs:subClassOf, skos:broader) • mapping • "real" edge
  • 22. Keep evidence & context for each assertion 22
  • 23. 23 Deduplication, entity linking, normalization, NLP Regular data release with multiples access options public unstructured AZ knowledge omics chemistry literature BIKG Data sources 1
  • 25. 25 literature Large amount of knowledge relating to drug discovery knowledge is unstructured and continuously updated
  • 26. Use natural language processing to extract precise information at scale Named entity recognition Entity linking Relationship extraction 26https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005017 MEDLINE: > 60m abstract, weekly updates, 300GB, billion entities & relationships
  • 27. 27 NER termite Relationship extraction using syntax tree rules > 500 million relationships Tim Scrivener
  • 28. 28 NLP Termite on Termite is a commercial Named Entity Recognition tool that AZ uses to derive value out of unstructured data. Deploying this technology onto a spark cluster allowed us to simplify our processing architecture and achieve massive scalability Richard Jackson
  • 29. Syntax parsing increases precision of entity recognition 29
  • 30. Relationship from literatures reduce sparsity of biological KG 30https://blog.opentargets.org/link/
  • 31. Language models lead to improvements in recall and precision 31 Scaling still an issue.. 1. Distribute on Spark/DB using horovod 2. ”Distill” the model down in size
  • 32. Learned sentence representation can be used for downstream tasks 32 Input sentence BERT encoded representation • Distance and clustering, • link to the correct biological entity, • ranking probability of target • Classify the type of relation • Estimate how probable it is
  • 33. 33 BIKG feature extraction (embeddings, gCNN,..) Machine learning model training Recommendations Insights validated in collaboration with scientists Pipeline decision 2
  • 34. 34 34 Graph embedding pipeline • Ingest Knowledge graph data from a blob store • Random-split and convert data • Train model with PytorchBigGraph • Evaluate model • Generate embeddings • NearestNeighbour search with Faiss (fb) • Track artifacts • Write model and search results to a blob store Anna Gogleva
  • 35. Approximate nearest neighbor search 3535 Input node: kidney disease • Generate embeddings • Input a query node • Retrieve N nearest neighbor nodes of a required type • Use case 1: input a disease node return N nearest gene nodes • Use case 2: input gene list and re-oder it based on distance to a disease node
  • 36. Lessons learned 36 Spark lets us scale to million of data points across disparate sources Being able to add new data quickly helps the feedback loop and improves trust Backend engineers and biologists don’t talk the same language but working together can be magical We shouldn’t strike a balance between intuitive and comprehensive but instead build products for different audiences
  • 37. Acknowledgements 37 Richard Jackson NLP pipeline David Geleta Graph build pipeline Anna Gogleva Embeddings & RecSys pipeline Tim Scrivener NLP pipeline Georgios Gerogiokas Network analysis & embeddings pipeline Daniel Goude Data Ops Marina Pettersson Project manager Erik Jansson NLP pipeline Nick Brown Science IT Matthew Woodwark Jonathan Dry Oncology Claus Bendtsen Discovery Sci Ian Barrett Discovery Sci Ian Dix Want to work with data that matters ? job-search.astrazeneca.com and search for ”knowledge graph” @elipapa elipapa.github.io https://www.linkedin.com/in/eliseopapa/ Thank you !
  • 38. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT