Portable and reproducible bioinformatic analysis. Neoantigen discovery.

Portable and reproducible
bioinformatic analysis
Vladimir Kovacevic

2
Agenda
1. Genome sequencing and bioinformatics
2. Constructing portable and reproducible bioinformatics analysis in Common Workflow Language (CWL)
3. Executing bioinformatic analysis on the cloud (Cancer Genomics Cloud platform)
4. Jupyter Notebook bioinformatic analysis on the cloud
5. Bioinformatic analysis example: Discovery of neoantigen cancer markers in the era of NGS data

3
1.Genome sequencing and
bioinformatics

4
DNA - the code of life
● DeoxyriboNucleic Acid
● Same in every cell (almost)
● DNA replicates during cell division
● Base (nucleotide) complementary bases
○ A - T (adenine and thymine)
○ C - G (cytosine and guanine)
● 3 billion base-pairs x 2
CTGGATTATATATAAATACGAAGGGACTAT... etc
● Intron and exom (2%)
● ~99.6% same between 2 individuals

Genome sequencing
● Digitalization of genome
● Human Genome Project (1990-2003), 3B $
● Birth of bioinformatics
● Sanger sequencing (First generation sequencing)
○ Long (took 13 years)
○ Costly (3B$ for one human genome)
● Currently NGS (next generation sequencing)
○ Illumina
○ Around 200$ and 1 day needed to sequence the genome
● Also third generation sequencing in use
○ Longer read-length (up to 50k base)
○ Oxford nanopore, PacBio
○ Higher error rate
○ Smaller in size
○ Sequencing in space (Mark and Scott Kelly)
5

6
6
Why perform DNA sequencing?
Stanford University
● Rare genetic diseases
● Origins of humans
● Precision medicine - Cancer
treatment (immunotherapy)
● Microbes that live inside us
(microbiome)
● Study ways that genomes
work

Illumina sequencing
● Read - DNA fragment after reading it in sequencer
● Typical whole genome sequencing experiment:
○ 200-500 million reads
○ 50-150 bases (letters) long
7

Sequencing (sum up)
1. Shearing (fragmentation of the genome)
2. Attaching adapters
3. PCR amplification (optional)
4. Attaching template to surface/flowcel
5. PCR/bridge amplification (cluster creation)
6. Adding fluorescent bases and taking a picture after each cycle (repeat
this many times)
7. Stack up images and read the sequence
13

15
NCBI
Price drop:
● 3 billion (2003)
● ~$200 (2019)
Size of sequenced data

16
Bioinformatics to the rescue!
Bioinformatics, n. The science of information and information flow in biological systems, esp. of the use of computational methods
in genetics and genomics. (Oxford English Dictionary)
Bioinformatics - using statistical and computing methods that aim to solve biological problems.

Secondary genomics analysis
● Genomes of the all species are arrays of nucleotides (A, T, C, G) - strings
● The process of DNA sequencing returns only fragments of it
● Our mission: RECONSTRUCT IT!
17

Sequencing data - FASTQ file
4 lines for each read
● Read id
● Read sequence
● + sign
● ASCII encoded quality
18

Genome reconstruction
Result of sequencing experiment
● FASTQ file
● 100-500 GB
● Each read(line) containing a genome sequence 50-250 bp long
19

Genome reconstruction
How do we reconstruct genome from reads?
1. Alignment
○ Using reference genome to map the
position of the reads
2. Assembly
○ Reconstructing the genome by finding the
links between the reads
20

Pileup is the set of bases aligned to a single position
on the genome.
What is the pileup?

● We have “T” in the all reads covering that
position
1
Ideal Variant Calling

CONTIG POS REF ALT GT
X 1 T - 0/0
1
How can we represent what we have observed?

X 1 T - 0/0
X 2 G - 0/0
1 2

X 1 T - 0/0
X 2 G - 0/0
X 3 A - 0/0
1 2 3

X 2 G - 0/0
X 3 A - 0/0
X 4 C T 1/1
1 2 3 4

X 3 A - 0/0
X 4 C T 1/1
X 5 C A 1/1
1 2 3 4 5

X 4 C T 1/1
X 5 C A 1/1
X 6 A - 0/0
1 2 3 4 5 6

X 5 C A 1/1
X 6 A - 0/0
X 7 G - 0/0
1 2 3 4 5 6 7

X 6 A - 0/0
X 7 G - 0/0
X 8 T C 0/1
1 2 3 4 5 6 7 8

X 7 G - 0/0
X 8 T C 0/1
X 9 G A,T 1/2
1 2 3 4 5 6 7 8 9

36
2. Constructing portable and reproducible
bioinformatics analysis in Common Workflow
Language (CWL)

Common Workflow Language
● CWL is a way to describe command line tools execution
● Every tool has defined set of inputs and outputs
● Every tool is executed in its own environment (Docker)
● Execution on the cloud or local environment
● Enables portable and reproducible execution
message
echo
Used by CWL executor

What is a CWL workflow?
● Acyclic graph of tools connected to perform some analysis
● Workflow’s nodes are:
○ Inputs (file or parameter)
○ Tools
○ Outputs
○ Workflow
FASTQ
SAM
Fasta
BWA-MEM
bwa mem ref.fa read1.fq read2.fq > aln.sam
sam2bam aln.sam > aln.bam SAM2BAM

How to build a workflow?
https://github.com/rabix/composer

What is Docker?
● Docker is a light-weight virtual environment
● Allows you to package the tool (e.g. Python script
or some C program) with all of its dependencies
into the standardized unit for software
development
● Docker containers run on any computer, on any
infrastructure
● Layered container structure
● Can directly access resources of host operating
system

Docker file
FROM ubuntu:16.04
MAINTAINER vladimir.kovacevic@sbgenomics.com
RUN apt-get update && apt-get install -y wget
make
gcc
zlib1g-dev
WORKDIR /opt
RUN wget https://github.com/bwa/releases/bwa-
0.7.15.tar.bz2
RUN tar xfj bwa-0.7.15.tar.bz2
WORKDIR /opt/bwa-0.7.15
RUN make
COPY Dockerfile /opt/Dockerfile
# Build image from Dockerfile and push to docker repo
docker build -t images.sbgenomics.com/vladimirk/bwa:0.7.15 .
docker push images.sbgenomics.com/vladimirk/bwa:0.7.15

Common Workflow Language
● Define inputs and outputs of a command line tool,
runtime and requirements
● Define how to connect command line tools,
creating a workflow
● Ensure reproducibility and portability
● Think of CWL as a detailed recipe!

45
3. Executing bioinformatic analysis on the cloud
(Cancer Genomics Cloud platform)

Cancer Genomics Cloud platform
● Two petabytes of multi-
dimensional genomics data
available to ~3800 authorized
researchers to analyse on the
cloud
● The Cancer Genome Atlas
(TCGA), a landmark cancer
genomics program, molecularly
characterized over 20,000
primary cancer and matched
normal samples
● Free registration for academia
with $300 credit!

...and run it!
PhiX is an icosahedral, nontailed bacteriophage with a
single-stranded DNA. It has a tiny genome with 5386
nucleotides and was the first DNA genome to be
sequenced by Fred Sanger. Due to its small, well-defined
genome sequence, PhiX has been commonly used as a
control for Illumina sequencing runs.

So, what just happened?
● Request for default (c4.2xlarge) instance sent to aws
● Initialize instance
● cwl.job.json created from task inputs and parameters
● Together with cwl.app.json sent to initialized aws instance
● Download input files to the aws instance
● Download of docker image(s) of the tool(s)
● Run the tool inside docker container
● Collect marked outputs and upload them to the cloud storage
attached to our platform’s project

What about some real data?
WES

...with real analysis!
BWA-MEM
BAMFastQ
SBG Prepare
Intervals
BED
1.bed,
2.bed
…
22.bed,
x.bed,
y.bed,
mt.bed
Scatter
ApplyBQSR
Haplotype
Caller
Merge
VCFs
Base
Recalibrator
BQSR
table
BAMs VCFs VCF
Variant
filtering
VCF
Variant
Annotation
VCF
Reference
genome

Whole exome sequencing execution

HLA Typing
● The HLA gene family provides instructions for making a group
of related proteins known as the human leukocyte antigen
(HLA) complex.
● The HLA complex helps the immune system distinguish the
body's proteins from proteins made by foreign invaders such as
viruses and bacteria.
● HLA typing has been widely used for reducing the
risk of organ rejection
● Specific HLA variants are associated with both
autoimmune (e.g. type 1 diabetes, rheumatoid
arthritis) and infectious (e.g. HIV, Hepatitis C)
diseases HLA

Local executor
Runnable from the command line
Suitable for local testing and development
./rabix [OPTIONS] <app> <inputs>
rabix.io
https://github.com/rabix/bunny
HLA

58
4. Jupyter Notebook bioinformatic analysis on the
cloud

Interactive analysis
Run python/R Jupyter Notebook on the cloud
Further process outputs from bioinformatics tasks
HLA
pattern = 'ACCT'
with open('/sbgenomics/project-
files/PhiX_genome.fasta', 'r') as
myfile:
data=myfile.readlines()
data = ''.join(data).replace('n', '')
cnt = 0
for i in range(0, len(data) -
len(pattern)):
if data[i:i+len(pattern)] ==
pattern:
cnt += 1
print(cnt, i)

Microbiome Differential Abundance Analysis
Detect microbes that are differentially abundant between disease-
control (~500 each) samples
HLA

61
5. Bioinformatic analysis example: Discovery of
neoantigen cancer markers in the era of NGS data

What is cancer?
Mutation (error) during DNA replication can fall to:
1. Intron (no change)
2. Important gene (cell dies, organism lives)
3. Gene that stops regulation of the cell division (cell
lives, organism...)
What causes cancer (increases probability of mutation)?
1. EM radiation
2. Chemical agents
3. Free radicals
4. Genetic factors
5. Infections (viruses)
A dividing lung cancer cell.
Credit: National Institutes of Health
62

Cancer cells
Our body develops thousands cancer
cells every day.
OMG! OMG! OMG!
63

● Neoantigens - proteins presented only
by cancer cells
● When neoantigen is known -> immune
T-cells can be “programmed” to destroy
cancer cells
● These unique cancer markers could be a
key to developing a new generation of
personalized, targeted cancer
immunotherapies
Neoantigens
Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy
64

How can we discover neoantigens?
1. Reconstruct DNA of tumor, DNA normal and RNA of tumor tissue
65

2. Compare DNA from Tumor and Normal tissue
66

C →T
A →T (ignored) G →TC
exon 1 intron intron exon 3
Start
codon
exon 2DNA
Start
codon
Stop
codon
MCYEVILQNFHGVAKKRTGYHYKVGRGRALLSVES
exon 1 exon 3exon 2
Stop
codon
ILQNFHGVAKKRTGYHYKVGR
A →GG
Somatic variant (mutations present
in tumor)
A C
GG T
G
TCRNA
Protein
3. Protein extraction 67

4. Discover HLA type from genome
(translates to MHC molecule)
5. Perform scoring of protein-HLA sets MHC
Mutation HLA type peptide
NetMHC
score
Pickpocket
score
NetCTLPan
score
RNA
expression
1_111957245_C_A HLA-A*02:01 MMLSSSPV 0.881 0.633 1.09815 11.5
8_144392368_T_C HLA-A*02:01 WLLEKLEQL 0.828 1.097 1.06133 12.5
17_28537638_C_T HLA-A*02:01 VLDEFPHV 0.836 0.374 1.06015 23
68

Neoantigen workflow validation
Tumor Neoantigen Selection Alliance (TESLA) Challenge
Flow-cytometry-validated protein-HLA sets (>10 patients)
SBG Neoantigen workflow detected and rank high the majority of confirmed
neoantigens
70

Neoantigen cancer vaccine
Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy
Status Indication Reference
Phase I Melanoma
(stage III and IV)
Ugur Sahin et all, Personalized RNA mutanome vaccines mobilize poly-
specific therapeutic immunity against cancer
Phase I Melanoma
(stage IIIB/C and IVM1a/b)
Patrick A. Ott et all, An immunogenic personal neoantigen vaccine for patients
with melanoma
Preclinical study MC-38 colon cancer Mahesh Yadav et all, Predicting immunogenic tumour mutations by combining
mass spectrometry and exome sequencing
Preclinical study B16F10 melanoma Mutant MHC class II epitopes drive therapeutic immune responses to cancer
Preclinical study A2.DR1 sarcoma A vaccine targeting mutant IDH1 induces antitumour immunity
Preclinical study B16F10 melanoma Exploiting the Mutanome for Tumor Vaccination
Phase I Melanoma (stage III) Beatriz M. Carreno et all, A dendritic cell vaccine increases the breadth and
diversity of melanoma neoantigen-specific T cells
71

More than 20 gene therapy drugs obtained FDA approval:
● Novartis - 83% (52/63) of patients complete or partial remission
● Advaxis - target hotspot mutations that commonly occur in specific cancer
types. More than 10 drug candidates have been designed for different
tumor types in the ADXS-HOT program
Cons of immunotherapy
● Autoimmune disease
● Very expensive
Neoantigen cancer vaccine
72

Thank you!
HLA
We are hiring - Bioinformatics Analyst
Questions?
@vladimir_bio

Portable and reproducible bioinformatic analysis. Neoantigen discovery.

Recommended

Recommended

More Related Content

Similar to Portable and reproducible bioinformatic analysis. Neoantigen discovery.

Similar to Portable and reproducible bioinformatic analysis. Neoantigen discovery. (20)

Recently uploaded

Recently uploaded (20)

Portable and reproducible bioinformatic analysis. Neoantigen discovery.