Managing R&D data on
parallel compute
infrastructure
Prepared for the 2021 Data + AI Summit
April 6, 2021
Boston +1 617 557 5800
© 2021 ZS 2
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Topics
Introduction
NGS data persistence strategies
NGS mapping and alignment strategies
1
2
3
© 2021 ZS 3
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
ZS works closely with its clients to drive customer value and
create impact across the organization
ZSers who are
committed to helping
active clients and their
customers thrive
9,500+
190+
clients have experienced ZS
differentiation across 30 industries
in over 90 countries, including:
1,200+
80+ Therapy areas of
experience
100% Of the top 50 pharma are
our clients
90%+ Of our work in pharma
and medtech
ZS is a Premier Databricks partner with
a strong track record of enabling clients to
take full advantage of data by deploying
ZS’s proven assets and the Databricks
Unified Data Analytics Platform to serve
as a one-stop-shop for all users involved
in an end-to-end data engineering, data
analytics, and data science pipeline.
© 2021 ZS 4
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
The ZS R&D Excellence team partners with clinical, medical
and scientific clients to discover and develop innovative
medicines that improve patients’ lives
R&D areas of excellence
Our experts work side by side with clients, leveraging analytics and technology to create solutions that
work in the real world from R&D to commercialization.
Biomedical research
— Scientific solutions
— Bioinformatics and in-silico
solutions
— Scientific and research strategy
— Research and early development
technology platforms
— Integrated evidence strategy
— Real world data (RWD) strategy
— Observational research
— Rapid insight solutions
Real world evidence (RWE)
— RWE benchmarking
— Evidence communication
— Actionable RWE
— RWD science
Medical affairs
— Global evidence planning
— Medical org design
— Scientific communication strategy
— Medical science liaison design
and support
Global health economics
and outcomes research
— Economic modeling
— Value communication strategy
— Patient reported outcomes
— Literature review
Clinical development
— Trial optimization
— Quality risk monitoring
— Biometrics and clinical data tech
solutions
— Site and patient engagement
— Digital and virtual strategy
R&D
Excellence
About ZS R&D
750+ Professionals
focused on R&D
programs
60+ Million invested in
R&D data, analytics &
technology assets
Clinical Design Center
50+ Working with over 50
clients on R&D programs
© 2021 ZS 5
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Problem statement
Current research and development
landscapes in biopharma have been plagued
by years of not following the FAIR principles
of data management. This has limited the
ability to fully democratize the use of this
data and stifled areas of drug development
and artificial intelligence-based medicine.
© 2021 ZS 6
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Moving raw data to the cloud for analysis
LIMS/ELN Base calls ETL Read preparation analysis
Automation
— Instrument to cloud
– Metadata capture
– Resilient key
— Systems integration
– Sample
descriptions
Mature
Immature
Conversion to FASTQ
— Can be performed on
instrument
– Custom processes
— Automatically move
FASTQ or BCL to
cloud
Converting FASTQ to
dataset objects
— Defining common
models for individual
FASTQ reads
— Persisting these as
data products
Preprocessing
— Dataset parallelism
in executors
– Quality control
– Adapter and
synthetic ligations
trimmed
Preprocessed data
products
— Mapping to reference
data products
— Creation of specific
data products for a
pipeline
Manual
— No integration to
LIMS
— Metadata placed in
file names
Files to demultiplex
and convert on
instrument
— Sharing of files by
email or personal
share drives
No ETL
— All data remains as
compressed files
Preprocessing
— Run on a single node
– Limited parallelism
– Significantly long
times
– Secondary raw
data artifacts
Run against raw
reference/genomic
features
— Mapped to entire
reference genome
— Significant number of
useless artifacts
generated
© 2021 ZS 7
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final © 2021 ZS 7
Spark strategies for raw data ingestion
Spark and Databricks in
raw data ingestion
— Scalable clusters that allow for magnitudes of time enhancements
— Object-oriented nature of Spark Datasets allows for specific versions of
FASTQ headers
— Spark Structured Streaming and stepwise analytics
— Oxford Nanopore and FAST5 sequencing pipelines
Ingesting raw FASTQ to a dataset
Datasets are parallelizable and fit multiple platforms
Streaming analytics
From raw FASTQ to a
structured dataset persisted
to data lake
© 2021 ZS 8
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Notebooks and user interaction
Jobs API
Databricks as a platform for analysis
Data scientists, process developers and
statisticians can interact with data products
— Controlling data quality through definitions of
external tables
— Ad hoc analytics with creation of silver and
bronze level data products
— Benchmarking, performance
enhancements/experimental execution
Highly controlled processes can be created as
traditional Spark application artifacts (.jar, .whl,
.egg, etc.)
— Data ingestion from raw sources
— Demultiplexing
— Controlled analytics pipelines
Notebooks can also be the sources and
definitions of spark execution jobs
— Concordance testing
— Visualizations and human-facing components
© 2021 ZS 9
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Replacing
traditional
methods for
creating scalable
and high
throughput
time-sensitive
pipelines
Step-wise execution from
datasets
Creating data products
that have matchable
entities to direct
sequence products
BWA, Bowtie, single
node alignment
methods
Implementation of
matching and
identification methods
to Spark UDFs
Mapping throughput
Low High
Low
High
Aggregation
and
analysis
© 2021 ZS 10
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final © 2021 ZS 10
Use case for clinical pipeline
Liquid biopsy NGS
analysis
Amplicon-based pipeline
— Anticipated onboarding tens of
millions of patients
— Need analysis time per patient
to be in the realm of 1-2
LIMS System integrated directly into the file transfer to cloud storage (ADLSv2, S3)
— Metadata and sample information tracked by a resilient key
— Databricks Delta Lake implementation for data products
Approximately 500,000 known sequence features were analyzed
— Data products generated directly from reference sources
— Products have reference, provenance and versioning metadata
— Mapping strategy that implements a UDF to determine sequence matches
Able to cut the mapping and amplicon identification process from four hours to less than
one minute per patient
— Demultiplex flow-cell files and spin up one Databricks cluster per patient
— Stepwise job definitions allow joining executors at certain points in analysis
LIMS — ELN — Data transfer strategy
Amplicon-based analysis
Post-Mapping and scientific analysis
© 2021 ZS 11
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Use case for clinical pipeline
R&D data lake
— Co-located data
— Archival cloud
storage
— HA/DR
Technical
value
Scientific
value
QC and prep reads
— Able to reuse
methods that are
open source
— Databricks and other
vendors creating
utilities in Spark
Parallelization
— Magnitudes of
scalability
— Data no longer
tracked in databases
and
Biologic relevance of
data
— Use of scientific-
friendly languages
– Python
– R
— Interface with bins
Enhance the Ability
to gain data value
— Data lakes
— Structured data
catalogs
Consolidated lake
— Democratization of
data
— FAIR principals
Amplification
Strategies
— Quality assurance-
based methods for
amplification
— Unique parent
barcoding
Novel Methods
— Lower barrier to
writing novel
methods to support
novel science
Scalable application
of powerful open-
source technologies
— Bioconductor
— SparkR
Growing structured and
consolidated lake
— Variational
autoencoders
— Machine learning (ML)
models
Raw data ingest Trim adapters Identify and map Post hoc analysis Train ML for AI
© 2021 ZS 12
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Topics
Introduction
NGS data persistence strategies
NGS mapping and alignment strategies
© 2021 ZS 13
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
5847_Databricks_Data__AI_Summit_Talk_v4.1_Final
Contact Info
Andrew S. Brown
Ph.D.
Strategy & Architecture Manager
https://www.linkedin.com/in/andrew-brown-73917014b/
Thank you

Managing R&D Data on Parallel Compute Infrastructure

  • 1.
    Managing R&D dataon parallel compute infrastructure Prepared for the 2021 Data + AI Summit April 6, 2021 Boston +1 617 557 5800
  • 2.
    © 2021 ZS2 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Topics Introduction NGS data persistence strategies NGS mapping and alignment strategies 1 2 3
  • 3.
    © 2021 ZS3 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final ZS works closely with its clients to drive customer value and create impact across the organization ZSers who are committed to helping active clients and their customers thrive 9,500+ 190+ clients have experienced ZS differentiation across 30 industries in over 90 countries, including: 1,200+ 80+ Therapy areas of experience 100% Of the top 50 pharma are our clients 90%+ Of our work in pharma and medtech ZS is a Premier Databricks partner with a strong track record of enabling clients to take full advantage of data by deploying ZS’s proven assets and the Databricks Unified Data Analytics Platform to serve as a one-stop-shop for all users involved in an end-to-end data engineering, data analytics, and data science pipeline.
  • 4.
    © 2021 ZS4 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final The ZS R&D Excellence team partners with clinical, medical and scientific clients to discover and develop innovative medicines that improve patients’ lives R&D areas of excellence Our experts work side by side with clients, leveraging analytics and technology to create solutions that work in the real world from R&D to commercialization. Biomedical research — Scientific solutions — Bioinformatics and in-silico solutions — Scientific and research strategy — Research and early development technology platforms — Integrated evidence strategy — Real world data (RWD) strategy — Observational research — Rapid insight solutions Real world evidence (RWE) — RWE benchmarking — Evidence communication — Actionable RWE — RWD science Medical affairs — Global evidence planning — Medical org design — Scientific communication strategy — Medical science liaison design and support Global health economics and outcomes research — Economic modeling — Value communication strategy — Patient reported outcomes — Literature review Clinical development — Trial optimization — Quality risk monitoring — Biometrics and clinical data tech solutions — Site and patient engagement — Digital and virtual strategy R&D Excellence About ZS R&D 750+ Professionals focused on R&D programs 60+ Million invested in R&D data, analytics & technology assets Clinical Design Center 50+ Working with over 50 clients on R&D programs
  • 5.
    © 2021 ZS5 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Problem statement Current research and development landscapes in biopharma have been plagued by years of not following the FAIR principles of data management. This has limited the ability to fully democratize the use of this data and stifled areas of drug development and artificial intelligence-based medicine.
  • 6.
    © 2021 ZS6 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Moving raw data to the cloud for analysis LIMS/ELN Base calls ETL Read preparation analysis Automation — Instrument to cloud – Metadata capture – Resilient key — Systems integration – Sample descriptions Mature Immature Conversion to FASTQ — Can be performed on instrument – Custom processes — Automatically move FASTQ or BCL to cloud Converting FASTQ to dataset objects — Defining common models for individual FASTQ reads — Persisting these as data products Preprocessing — Dataset parallelism in executors – Quality control – Adapter and synthetic ligations trimmed Preprocessed data products — Mapping to reference data products — Creation of specific data products for a pipeline Manual — No integration to LIMS — Metadata placed in file names Files to demultiplex and convert on instrument — Sharing of files by email or personal share drives No ETL — All data remains as compressed files Preprocessing — Run on a single node – Limited parallelism – Significantly long times – Secondary raw data artifacts Run against raw reference/genomic features — Mapped to entire reference genome — Significant number of useless artifacts generated
  • 7.
    © 2021 ZS7 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final © 2021 ZS 7 Spark strategies for raw data ingestion Spark and Databricks in raw data ingestion — Scalable clusters that allow for magnitudes of time enhancements — Object-oriented nature of Spark Datasets allows for specific versions of FASTQ headers — Spark Structured Streaming and stepwise analytics — Oxford Nanopore and FAST5 sequencing pipelines Ingesting raw FASTQ to a dataset Datasets are parallelizable and fit multiple platforms Streaming analytics From raw FASTQ to a structured dataset persisted to data lake
  • 8.
    © 2021 ZS8 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Notebooks and user interaction Jobs API Databricks as a platform for analysis Data scientists, process developers and statisticians can interact with data products — Controlling data quality through definitions of external tables — Ad hoc analytics with creation of silver and bronze level data products — Benchmarking, performance enhancements/experimental execution Highly controlled processes can be created as traditional Spark application artifacts (.jar, .whl, .egg, etc.) — Data ingestion from raw sources — Demultiplexing — Controlled analytics pipelines Notebooks can also be the sources and definitions of spark execution jobs — Concordance testing — Visualizations and human-facing components
  • 9.
    © 2021 ZS9 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Replacing traditional methods for creating scalable and high throughput time-sensitive pipelines Step-wise execution from datasets Creating data products that have matchable entities to direct sequence products BWA, Bowtie, single node alignment methods Implementation of matching and identification methods to Spark UDFs Mapping throughput Low High Low High Aggregation and analysis
  • 10.
    © 2021 ZS10 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final © 2021 ZS 10 Use case for clinical pipeline Liquid biopsy NGS analysis Amplicon-based pipeline — Anticipated onboarding tens of millions of patients — Need analysis time per patient to be in the realm of 1-2 LIMS System integrated directly into the file transfer to cloud storage (ADLSv2, S3) — Metadata and sample information tracked by a resilient key — Databricks Delta Lake implementation for data products Approximately 500,000 known sequence features were analyzed — Data products generated directly from reference sources — Products have reference, provenance and versioning metadata — Mapping strategy that implements a UDF to determine sequence matches Able to cut the mapping and amplicon identification process from four hours to less than one minute per patient — Demultiplex flow-cell files and spin up one Databricks cluster per patient — Stepwise job definitions allow joining executors at certain points in analysis LIMS — ELN — Data transfer strategy Amplicon-based analysis Post-Mapping and scientific analysis
  • 11.
    © 2021 ZS11 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Use case for clinical pipeline R&D data lake — Co-located data — Archival cloud storage — HA/DR Technical value Scientific value QC and prep reads — Able to reuse methods that are open source — Databricks and other vendors creating utilities in Spark Parallelization — Magnitudes of scalability — Data no longer tracked in databases and Biologic relevance of data — Use of scientific- friendly languages – Python – R — Interface with bins Enhance the Ability to gain data value — Data lakes — Structured data catalogs Consolidated lake — Democratization of data — FAIR principals Amplification Strategies — Quality assurance- based methods for amplification — Unique parent barcoding Novel Methods — Lower barrier to writing novel methods to support novel science Scalable application of powerful open- source technologies — Bioconductor — SparkR Growing structured and consolidated lake — Variational autoencoders — Machine learning (ML) models Raw data ingest Trim adapters Identify and map Post hoc analysis Train ML for AI
  • 12.
    © 2021 ZS12 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Topics Introduction NGS data persistence strategies NGS mapping and alignment strategies
  • 13.
    © 2021 ZS13 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final 5847_Databricks_Data__AI_Summit_Talk_v4.1_Final Contact Info Andrew S. Brown Ph.D. Strategy & Architecture Manager https://www.linkedin.com/in/andrew-brown-73917014b/
  • 14.