SlideShare a Scribd company logo
Portable and reproducible
bioinformatic analysis
Vladimir Kovacevic
2
Agenda
1. Genome sequencing and bioinformatics
2. Constructing portable and reproducible bioinformatics analysis in Common Workflow Language (CWL)
3. Executing bioinformatic analysis on the cloud (Cancer Genomics Cloud platform)
4. Jupyter Notebook bioinformatic analysis on the cloud
5. Bioinformatic analysis example: Discovery of neoantigen cancer markers in the era of NGS data
3
1.Genome sequencing and
bioinformatics
4
DNA - the code of life
● DeoxyriboNucleic Acid
● Same in every cell (almost)
● DNA replicates during cell division
● Base (nucleotide) complementary bases
○ A - T (adenine and thymine)
○ C - G (cytosine and guanine)
● 3 billion base-pairs x 2
CTGGATTATATATAAATACGAAGGGACTAT... etc
● Intron and exom (2%)
● ~99.6% same between 2 individuals
Genome sequencing
● Digitalization of genome
● Human Genome Project (1990-2003), 3B $
● Birth of bioinformatics
● Sanger sequencing (First generation sequencing)
○ Long (took 13 years)
○ Costly (3B$ for one human genome)
● Currently NGS (next generation sequencing)
○ Illumina
○ Around 200$ and 1 day needed to sequence the genome
● Also third generation sequencing in use
○ Longer read-length (up to 50k base)
○ Oxford nanopore, PacBio
○ Higher error rate
○ Smaller in size
○ Sequencing in space (Mark and Scott Kelly)
5
6
6
Why perform DNA sequencing?
Stanford University
● Rare genetic diseases
● Origins of humans
● Precision medicine - Cancer
treatment (immunotherapy)
● Microbes that live inside us
(microbiome)
● Study ways that genomes
work
Illumina sequencing
● Read - DNA fragment after reading it in sequencer
● Typical whole genome sequencing experiment:
○ 200-500 million reads
○ 50-150 bases (letters) long
7
Sequencing (Illumina)
8
Sequencing (Illumina)
9
Sequencing (Illumina)
10
Sequencing (Illumina)
11
Sequencing error
12
Sequencing (sum up)
1. Shearing (fragmentation of the genome)
2. Attaching adapters
3. PCR amplification (optional)
4. Attaching template to surface/flowcel
5. PCR/bridge amplification (cluster creation)
6. Adding fluorescent bases and taking a picture after each cycle (repeat
this many times)
7. Stack up images and read the sequence
13
14
15
NCBI
Price drop:
● 3 billion (2003)
● ~$200 (2019)
Size of sequenced data
16
Bioinformatics to the rescue!
Bioinformatics, n. The science of information and information flow in biological systems, esp. of the use of computational methods
in genetics and genomics. (Oxford English Dictionary)
Bioinformatics - using statistical and computing methods that aim to solve biological problems.
Secondary genomics analysis
● Genomes of the all species are arrays of nucleotides (A, T, C, G) - strings
● The process of DNA sequencing returns only fragments of it
● Our mission: RECONSTRUCT IT!
17
Sequencing data - FASTQ file
4 lines for each read
● Read id
● Read sequence
● + sign
● ASCII encoded quality
18
Genome reconstruction
Result of sequencing experiment
● FASTQ file
● 100-500 GB
● Each read(line) containing a genome sequence 50-250 bp long
19
Genome reconstruction
How do we reconstruct genome from reads?
1. Alignment
○ Using reference genome to map the
position of the reads
2. Assembly
○ Reconstructing the genome by finding the
links between the reads
20
Assembly
21
Alignment
22
Pileup is the set of bases aligned to a single position
on the genome.
What is the pileup?
1
Variant Calling
1
Ideal Variant Calling
● We have “T” in the all reads covering that
position
1
Ideal Variant Calling
CONTIG POS REF ALT GT
X 1 T - 0/0
1
How can we represent what we have observed?
Ideal Variant Calling
CONTIG POS REF ALT GT
X 1 T - 0/0
X 2 G - 0/0
1 2
Ideal Variant Calling
CONTIG POS REF ALT GT
X 1 T - 0/0
X 2 G - 0/0
X 3 A - 0/0
1 2 3
Ideal Variant Calling
CONTIG POS REF ALT GT
X 2 G - 0/0
X 3 A - 0/0
X 4 C T 1/1
1 2 3 4
Ideal Variant Calling
CONTIG POS REF ALT GT
X 3 A - 0/0
X 4 C T 1/1
X 5 C A 1/1
1 2 3 4 5
Ideal Variant Calling
CONTIG POS REF ALT GT
X 4 C T 1/1
X 5 C A 1/1
X 6 A - 0/0
1 2 3 4 5 6
Ideal Variant Calling
CONTIG POS REF ALT GT
X 5 C A 1/1
X 6 A - 0/0
X 7 G - 0/0
1 2 3 4 5 6 7
Ideal Variant Calling
CONTIG POS REF ALT GT
X 6 A - 0/0
X 7 G - 0/0
X 8 T C 0/1
1 2 3 4 5 6 7 8
Ideal Variant Calling
CONTIG POS REF ALT GT
X 7 G - 0/0
X 8 T C 0/1
X 9 G A,T 1/2
1 2 3 4 5 6 7 8 9
Ideal Variant Calling
36
2. Constructing portable and reproducible
bioinformatics analysis in Common Workflow
Language (CWL)
Common Workflow Language
● CWL is a way to describe command line tools execution
● Every tool has defined set of inputs and outputs
● Every tool is executed in its own environment (Docker)
● Execution on the cloud or local environment
● Enables portable and reproducible execution
message
echo
Used by CWL executor
CWL @ Cloud
What is a CWL workflow?
● Acyclic graph of tools connected to perform some analysis
● Workflow’s nodes are:
○ Inputs (file or parameter)
○ Tools
○ Outputs
○ Workflow
FASTQ
SAM
Fasta
BWA-MEM
bwa mem ref.fa read1.fq read2.fq > aln.sam
sam2bam aln.sam > aln.bam SAM2BAM
Why we need a workflow?
How to build a workflow?
https://github.com/rabix/composer
What is Docker?
● Docker is a light-weight virtual environment
● Allows you to package the tool (e.g. Python script
or some C program) with all of its dependencies
into the standardized unit for software
development
● Docker containers run on any computer, on any
infrastructure
● Layered container structure
● Can directly access resources of host operating
system
Docker file
FROM ubuntu:16.04
MAINTAINER vladimir.kovacevic@sbgenomics.com
RUN apt-get update && apt-get install -y wget 
make 
gcc 
zlib1g-dev
WORKDIR /opt
RUN wget https://github.com/bwa/releases/bwa-
0.7.15.tar.bz2
RUN tar xfj bwa-0.7.15.tar.bz2
WORKDIR /opt/bwa-0.7.15
RUN make
COPY Dockerfile /opt/Dockerfile
# Build image from Dockerfile and push to docker repo
docker build -t images.sbgenomics.com/vladimirk/bwa:0.7.15 .
docker push images.sbgenomics.com/vladimirk/bwa:0.7.15
Common Workflow Language
● Define inputs and outputs of a command line tool,
runtime and requirements
● Define how to connect command line tools,
creating a workflow
● Ensure reproducibility and portability
● Think of CWL as a detailed recipe!
45
3. Executing bioinformatic analysis on the cloud
(Cancer Genomics Cloud platform)
Cancer Genomics Cloud platform
● Two petabytes of multi-
dimensional genomics data
available to ~3800 authorized
researchers to analyse on the
cloud
● The Cancer Genome Atlas
(TCGA), a landmark cancer
genomics program, molecularly
characterized over 20,000
primary cancer and matched
normal samples
● Free registration for academia
with $300 credit!
Let’s build some tool!
...and run it!
PhiX is an icosahedral, nontailed bacteriophage with a
single-stranded DNA. It has a tiny genome with 5386
nucleotides and was the first DNA genome to be
sequenced by Fred Sanger. Due to its small, well-defined
genome sequence, PhiX has been commonly used as a
control for Illumina sequencing runs.
So, what just happened?
● Request for default (c4.2xlarge) instance sent to aws
● Initialize instance
● cwl.job.json created from task inputs and parameters
● Together with cwl.app.json sent to initialized aws instance
● Download input files to the aws instance
● Download of docker image(s) of the tool(s)
● Run the tool inside docker container
● Collect marked outputs and upload them to the cloud storage
attached to our platform’s project
What about some real data?
WES
...with real analysis!
...with real analysis!
BWA-MEM
BAMFastQ
SBG Prepare
Intervals
BED
1.bed,
2.bed
…
22.bed,
x.bed,
y.bed,
mt.bed
Scatter
ApplyBQSR
Haplotype
Caller
Merge
VCFs
Base
Recalibrator
BQSR
table
BAMs VCFs VCF
Variant
filtering
VCF
Variant
Annotation
VCF
Reference
genome
Whole exome sequencing execution
HLA Typing
● The HLA gene family provides instructions for making a group
of related proteins known as the human leukocyte antigen
(HLA) complex.
● The HLA complex helps the immune system distinguish the
body's proteins from proteins made by foreign invaders such as
viruses and bacteria.
● HLA typing has been widely used for reducing the
risk of organ rejection
● Specific HLA variants are associated with both
autoimmune (e.g. type 1 diabetes, rheumatoid
arthritis) and infectious (e.g. HIV, Hepatitis C)
diseases HLA
HLA Typing
HLA
Public App Gallery
Local executor
Runnable from the command line
Suitable for local testing and development
./rabix [OPTIONS] <app> <inputs>
rabix.io
https://github.com/rabix/bunny
HLA
58
4. Jupyter Notebook bioinformatic analysis on the
cloud
Interactive analysis
Run python/R Jupyter Notebook on the cloud
Further process outputs from bioinformatics tasks
HLA
pattern = 'ACCT'
with open('/sbgenomics/project-
files/PhiX_genome.fasta', 'r') as
myfile:
data=myfile.readlines()
data = ''.join(data).replace('n', '')
cnt = 0
for i in range(0, len(data) -
len(pattern)):
if data[i:i+len(pattern)] ==
pattern:
cnt += 1
print(cnt, i)
Microbiome Differential Abundance Analysis
Detect microbes that are differentially abundant between disease-
control (~500 each) samples
HLA
61
5. Bioinformatic analysis example: Discovery of
neoantigen cancer markers in the era of NGS data
What is cancer?
Mutation (error) during DNA replication can fall to:
1. Intron (no change)
2. Important gene (cell dies, organism lives)
3. Gene that stops regulation of the cell division (cell
lives, organism...)
What causes cancer (increases probability of mutation)?
1. EM radiation
2. Chemical agents
3. Free radicals
4. Genetic factors
5. Infections (viruses)
A dividing lung cancer cell.
Credit: National Institutes of Health
62
Cancer cells
Our body develops thousands cancer
cells every day.
OMG! OMG! OMG!
63
● Neoantigens - proteins presented only
by cancer cells
● When neoantigen is known -> immune
T-cells can be “programmed” to destroy
cancer cells
● These unique cancer markers could be a
key to developing a new generation of
personalized, targeted cancer
immunotherapies
Neoantigens
Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy
64
How can we discover neoantigens?
1. Reconstruct DNA of tumor, DNA normal and RNA of tumor tissue
65
How can we discover neoantigens?
2. Compare DNA from Tumor and Normal tissue
66
C →T
A →T (ignored) G →TC
exon 1 intron intron exon 3
Start
codon
exon 2DNA
Start
codon
Stop
codon
MCYEVILQNFHGVAKKRTGYHYKVGRGRALLSVES
exon 1 exon 3exon 2
Stop
codon
ILQNFHGVAKKRTGYHYKVGR
A →GG
Somatic variant (mutations present
in tumor)
A C
GG T
G
TCRNA
Protein
How can we discover neoantigens?
3. Protein extraction 67
How can we discover neoantigens?
4. Discover HLA type from genome
(translates to MHC molecule)
5. Perform scoring of protein-HLA sets MHC
Mutation HLA type peptide
NetMHC
score
Pickpocket
score
NetCTLPan
score
RNA
expression
1_111957245_C_A HLA-A*02:01 MMLSSSPV 0.881 0.633 1.09815 11.5
8_144392368_T_C HLA-A*02:01 WLLEKLEQL 0.828 1.097 1.06133 12.5
17_28537638_C_T HLA-A*02:01 VLDEFPHV 0.836 0.374 1.06015 23
68
Neoantigen workflow
69
Neoantigen workflow validation
Tumor Neoantigen Selection Alliance (TESLA) Challenge
Flow-cytometry-validated protein-HLA sets (>10 patients)
SBG Neoantigen workflow detected and rank high the majority of confirmed
neoantigens
70
Neoantigen cancer vaccine
Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy
Status Indication Reference
Phase I Melanoma
(stage III and IV)
Ugur Sahin et all, Personalized RNA mutanome vaccines mobilize poly-
specific therapeutic immunity against cancer
Phase I Melanoma
(stage IIIB/C and IVM1a/b)
Patrick A. Ott et all, An immunogenic personal neoantigen vaccine for patients
with melanoma
Preclinical study MC-38 colon cancer Mahesh Yadav et all, Predicting immunogenic tumour mutations by combining
mass spectrometry and exome sequencing
Preclinical study B16F10 melanoma Mutant MHC class II epitopes drive therapeutic immune responses to cancer
Preclinical study A2.DR1 sarcoma A vaccine targeting mutant IDH1 induces antitumour immunity
Preclinical study B16F10 melanoma Exploiting the Mutanome for Tumor Vaccination
Phase I Melanoma (stage III) Beatriz M. Carreno et all, A dendritic cell vaccine increases the breadth and
diversity of melanoma neoantigen-specific T cells
71
More than 20 gene therapy drugs obtained FDA approval:
● Novartis - 83% (52/63) of patients complete or partial remission
● Advaxis - target hotspot mutations that commonly occur in specific cancer
types. More than 10 drug candidates have been designed for different
tumor types in the ADXS-HOT program
Cons of immunotherapy
● Autoimmune disease
● Very expensive
Neoantigen cancer vaccine
72
Thank you!
HLA
We are hiring - Bioinformatics Analyst
Questions?
@vladimir_bio

More Related Content

Similar to Portable and reproducible bioinformatic analysis. Neoantigen discovery.

[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
bosc_2008
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
bhargvi sharma
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
Chris Mungall
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the place
BioDec
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Paul Groth
 
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopython
tiago
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
bosc
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
Dan Gaston
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
Christian Frech
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
NECST Lab @ Politecnico di Milano
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
lcplcp1
 
The Infobiotics workbench
The Infobiotics workbenchThe Infobiotics workbench
The Infobiotics workbench
Natalio Krasnogor
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
João André Carriço
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
Ganesan Narayanasamy
 

Similar to Portable and reproducible bioinformatic analysis. Neoantigen discovery. (20)

[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Genome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the placeGenome_annotation@BioDec: Python all over the place
Genome_annotation@BioDec: Python all over the place
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopython
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
The Infobiotics workbench
The Infobiotics workbenchThe Infobiotics workbench
The Infobiotics workbench
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 

Recently uploaded

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 

Recently uploaded (20)

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 

Portable and reproducible bioinformatic analysis. Neoantigen discovery.

  • 1. Portable and reproducible bioinformatic analysis Vladimir Kovacevic
  • 2. 2 Agenda 1. Genome sequencing and bioinformatics 2. Constructing portable and reproducible bioinformatics analysis in Common Workflow Language (CWL) 3. Executing bioinformatic analysis on the cloud (Cancer Genomics Cloud platform) 4. Jupyter Notebook bioinformatic analysis on the cloud 5. Bioinformatic analysis example: Discovery of neoantigen cancer markers in the era of NGS data
  • 4. 4 DNA - the code of life ● DeoxyriboNucleic Acid ● Same in every cell (almost) ● DNA replicates during cell division ● Base (nucleotide) complementary bases ○ A - T (adenine and thymine) ○ C - G (cytosine and guanine) ● 3 billion base-pairs x 2 CTGGATTATATATAAATACGAAGGGACTAT... etc ● Intron and exom (2%) ● ~99.6% same between 2 individuals
  • 5. Genome sequencing ● Digitalization of genome ● Human Genome Project (1990-2003), 3B $ ● Birth of bioinformatics ● Sanger sequencing (First generation sequencing) ○ Long (took 13 years) ○ Costly (3B$ for one human genome) ● Currently NGS (next generation sequencing) ○ Illumina ○ Around 200$ and 1 day needed to sequence the genome ● Also third generation sequencing in use ○ Longer read-length (up to 50k base) ○ Oxford nanopore, PacBio ○ Higher error rate ○ Smaller in size ○ Sequencing in space (Mark and Scott Kelly) 5
  • 6. 6 6 Why perform DNA sequencing? Stanford University ● Rare genetic diseases ● Origins of humans ● Precision medicine - Cancer treatment (immunotherapy) ● Microbes that live inside us (microbiome) ● Study ways that genomes work
  • 7. Illumina sequencing ● Read - DNA fragment after reading it in sequencer ● Typical whole genome sequencing experiment: ○ 200-500 million reads ○ 50-150 bases (letters) long 7
  • 13. Sequencing (sum up) 1. Shearing (fragmentation of the genome) 2. Attaching adapters 3. PCR amplification (optional) 4. Attaching template to surface/flowcel 5. PCR/bridge amplification (cluster creation) 6. Adding fluorescent bases and taking a picture after each cycle (repeat this many times) 7. Stack up images and read the sequence 13
  • 14. 14
  • 15. 15 NCBI Price drop: ● 3 billion (2003) ● ~$200 (2019) Size of sequenced data
  • 16. 16 Bioinformatics to the rescue! Bioinformatics, n. The science of information and information flow in biological systems, esp. of the use of computational methods in genetics and genomics. (Oxford English Dictionary) Bioinformatics - using statistical and computing methods that aim to solve biological problems.
  • 17. Secondary genomics analysis ● Genomes of the all species are arrays of nucleotides (A, T, C, G) - strings ● The process of DNA sequencing returns only fragments of it ● Our mission: RECONSTRUCT IT! 17
  • 18. Sequencing data - FASTQ file 4 lines for each read ● Read id ● Read sequence ● + sign ● ASCII encoded quality 18
  • 19. Genome reconstruction Result of sequencing experiment ● FASTQ file ● 100-500 GB ● Each read(line) containing a genome sequence 50-250 bp long 19
  • 20. Genome reconstruction How do we reconstruct genome from reads? 1. Alignment ○ Using reference genome to map the position of the reads 2. Assembly ○ Reconstructing the genome by finding the links between the reads 20
  • 23. Pileup is the set of bases aligned to a single position on the genome. What is the pileup?
  • 26. ● We have “T” in the all reads covering that position 1 Ideal Variant Calling
  • 27. CONTIG POS REF ALT GT X 1 T - 0/0 1 How can we represent what we have observed? Ideal Variant Calling
  • 28. CONTIG POS REF ALT GT X 1 T - 0/0 X 2 G - 0/0 1 2 Ideal Variant Calling
  • 29. CONTIG POS REF ALT GT X 1 T - 0/0 X 2 G - 0/0 X 3 A - 0/0 1 2 3 Ideal Variant Calling
  • 30. CONTIG POS REF ALT GT X 2 G - 0/0 X 3 A - 0/0 X 4 C T 1/1 1 2 3 4 Ideal Variant Calling
  • 31. CONTIG POS REF ALT GT X 3 A - 0/0 X 4 C T 1/1 X 5 C A 1/1 1 2 3 4 5 Ideal Variant Calling
  • 32. CONTIG POS REF ALT GT X 4 C T 1/1 X 5 C A 1/1 X 6 A - 0/0 1 2 3 4 5 6 Ideal Variant Calling
  • 33. CONTIG POS REF ALT GT X 5 C A 1/1 X 6 A - 0/0 X 7 G - 0/0 1 2 3 4 5 6 7 Ideal Variant Calling
  • 34. CONTIG POS REF ALT GT X 6 A - 0/0 X 7 G - 0/0 X 8 T C 0/1 1 2 3 4 5 6 7 8 Ideal Variant Calling
  • 35. CONTIG POS REF ALT GT X 7 G - 0/0 X 8 T C 0/1 X 9 G A,T 1/2 1 2 3 4 5 6 7 8 9 Ideal Variant Calling
  • 36. 36 2. Constructing portable and reproducible bioinformatics analysis in Common Workflow Language (CWL)
  • 37. Common Workflow Language ● CWL is a way to describe command line tools execution ● Every tool has defined set of inputs and outputs ● Every tool is executed in its own environment (Docker) ● Execution on the cloud or local environment ● Enables portable and reproducible execution message echo Used by CWL executor
  • 39. What is a CWL workflow? ● Acyclic graph of tools connected to perform some analysis ● Workflow’s nodes are: ○ Inputs (file or parameter) ○ Tools ○ Outputs ○ Workflow FASTQ SAM Fasta BWA-MEM bwa mem ref.fa read1.fq read2.fq > aln.sam sam2bam aln.sam > aln.bam SAM2BAM
  • 40. Why we need a workflow?
  • 41. How to build a workflow? https://github.com/rabix/composer
  • 42. What is Docker? ● Docker is a light-weight virtual environment ● Allows you to package the tool (e.g. Python script or some C program) with all of its dependencies into the standardized unit for software development ● Docker containers run on any computer, on any infrastructure ● Layered container structure ● Can directly access resources of host operating system
  • 43. Docker file FROM ubuntu:16.04 MAINTAINER vladimir.kovacevic@sbgenomics.com RUN apt-get update && apt-get install -y wget make gcc zlib1g-dev WORKDIR /opt RUN wget https://github.com/bwa/releases/bwa- 0.7.15.tar.bz2 RUN tar xfj bwa-0.7.15.tar.bz2 WORKDIR /opt/bwa-0.7.15 RUN make COPY Dockerfile /opt/Dockerfile # Build image from Dockerfile and push to docker repo docker build -t images.sbgenomics.com/vladimirk/bwa:0.7.15 . docker push images.sbgenomics.com/vladimirk/bwa:0.7.15
  • 44. Common Workflow Language ● Define inputs and outputs of a command line tool, runtime and requirements ● Define how to connect command line tools, creating a workflow ● Ensure reproducibility and portability ● Think of CWL as a detailed recipe!
  • 45. 45 3. Executing bioinformatic analysis on the cloud (Cancer Genomics Cloud platform)
  • 46. Cancer Genomics Cloud platform ● Two petabytes of multi- dimensional genomics data available to ~3800 authorized researchers to analyse on the cloud ● The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples ● Free registration for academia with $300 credit!
  • 48. ...and run it! PhiX is an icosahedral, nontailed bacteriophage with a single-stranded DNA. It has a tiny genome with 5386 nucleotides and was the first DNA genome to be sequenced by Fred Sanger. Due to its small, well-defined genome sequence, PhiX has been commonly used as a control for Illumina sequencing runs.
  • 49. So, what just happened? ● Request for default (c4.2xlarge) instance sent to aws ● Initialize instance ● cwl.job.json created from task inputs and parameters ● Together with cwl.app.json sent to initialized aws instance ● Download input files to the aws instance ● Download of docker image(s) of the tool(s) ● Run the tool inside docker container ● Collect marked outputs and upload them to the cloud storage attached to our platform’s project
  • 50. What about some real data? WES
  • 52. ...with real analysis! BWA-MEM BAMFastQ SBG Prepare Intervals BED 1.bed, 2.bed … 22.bed, x.bed, y.bed, mt.bed Scatter ApplyBQSR Haplotype Caller Merge VCFs Base Recalibrator BQSR table BAMs VCFs VCF Variant filtering VCF Variant Annotation VCF Reference genome
  • 54. HLA Typing ● The HLA gene family provides instructions for making a group of related proteins known as the human leukocyte antigen (HLA) complex. ● The HLA complex helps the immune system distinguish the body's proteins from proteins made by foreign invaders such as viruses and bacteria. ● HLA typing has been widely used for reducing the risk of organ rejection ● Specific HLA variants are associated with both autoimmune (e.g. type 1 diabetes, rheumatoid arthritis) and infectious (e.g. HIV, Hepatitis C) diseases HLA
  • 57. Local executor Runnable from the command line Suitable for local testing and development ./rabix [OPTIONS] <app> <inputs> rabix.io https://github.com/rabix/bunny HLA
  • 58. 58 4. Jupyter Notebook bioinformatic analysis on the cloud
  • 59. Interactive analysis Run python/R Jupyter Notebook on the cloud Further process outputs from bioinformatics tasks HLA pattern = 'ACCT' with open('/sbgenomics/project- files/PhiX_genome.fasta', 'r') as myfile: data=myfile.readlines() data = ''.join(data).replace('n', '') cnt = 0 for i in range(0, len(data) - len(pattern)): if data[i:i+len(pattern)] == pattern: cnt += 1 print(cnt, i)
  • 60. Microbiome Differential Abundance Analysis Detect microbes that are differentially abundant between disease- control (~500 each) samples HLA
  • 61. 61 5. Bioinformatic analysis example: Discovery of neoantigen cancer markers in the era of NGS data
  • 62. What is cancer? Mutation (error) during DNA replication can fall to: 1. Intron (no change) 2. Important gene (cell dies, organism lives) 3. Gene that stops regulation of the cell division (cell lives, organism...) What causes cancer (increases probability of mutation)? 1. EM radiation 2. Chemical agents 3. Free radicals 4. Genetic factors 5. Infections (viruses) A dividing lung cancer cell. Credit: National Institutes of Health 62
  • 63. Cancer cells Our body develops thousands cancer cells every day. OMG! OMG! OMG! 63
  • 64. ● Neoantigens - proteins presented only by cancer cells ● When neoantigen is known -> immune T-cells can be “programmed” to destroy cancer cells ● These unique cancer markers could be a key to developing a new generation of personalized, targeted cancer immunotherapies Neoantigens Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy 64
  • 65. How can we discover neoantigens? 1. Reconstruct DNA of tumor, DNA normal and RNA of tumor tissue 65
  • 66. How can we discover neoantigens? 2. Compare DNA from Tumor and Normal tissue 66
  • 67. C →T A →T (ignored) G →TC exon 1 intron intron exon 3 Start codon exon 2DNA Start codon Stop codon MCYEVILQNFHGVAKKRTGYHYKVGRGRALLSVES exon 1 exon 3exon 2 Stop codon ILQNFHGVAKKRTGYHYKVGR A →GG Somatic variant (mutations present in tumor) A C GG T G TCRNA Protein How can we discover neoantigens? 3. Protein extraction 67
  • 68. How can we discover neoantigens? 4. Discover HLA type from genome (translates to MHC molecule) 5. Perform scoring of protein-HLA sets MHC Mutation HLA type peptide NetMHC score Pickpocket score NetCTLPan score RNA expression 1_111957245_C_A HLA-A*02:01 MMLSSSPV 0.881 0.633 1.09815 11.5 8_144392368_T_C HLA-A*02:01 WLLEKLEQL 0.828 1.097 1.06133 12.5 17_28537638_C_T HLA-A*02:01 VLDEFPHV 0.836 0.374 1.06015 23 68
  • 70. Neoantigen workflow validation Tumor Neoantigen Selection Alliance (TESLA) Challenge Flow-cytometry-validated protein-HLA sets (>10 patients) SBG Neoantigen workflow detected and rank high the majority of confirmed neoantigens 70
  • 71. Neoantigen cancer vaccine Yugang Guo et all, Neoantigen Vaccine Delivery for Personalized Anticancer Immunotherapy Status Indication Reference Phase I Melanoma (stage III and IV) Ugur Sahin et all, Personalized RNA mutanome vaccines mobilize poly- specific therapeutic immunity against cancer Phase I Melanoma (stage IIIB/C and IVM1a/b) Patrick A. Ott et all, An immunogenic personal neoantigen vaccine for patients with melanoma Preclinical study MC-38 colon cancer Mahesh Yadav et all, Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing Preclinical study B16F10 melanoma Mutant MHC class II epitopes drive therapeutic immune responses to cancer Preclinical study A2.DR1 sarcoma A vaccine targeting mutant IDH1 induces antitumour immunity Preclinical study B16F10 melanoma Exploiting the Mutanome for Tumor Vaccination Phase I Melanoma (stage III) Beatriz M. Carreno et all, A dendritic cell vaccine increases the breadth and diversity of melanoma neoantigen-specific T cells 71
  • 72. More than 20 gene therapy drugs obtained FDA approval: ● Novartis - 83% (52/63) of patients complete or partial remission ● Advaxis - target hotspot mutations that commonly occur in specific cancer types. More than 10 drug candidates have been designed for different tumor types in the ADXS-HOT program Cons of immunotherapy ● Autoimmune disease ● Very expensive Neoantigen cancer vaccine 72
  • 73. Thank you! HLA We are hiring - Bioinformatics Analyst Questions? @vladimir_bio