Genome simulation and applications

Genome (FASTQ and VCF) Simulation & Applications
Hariprasad Radhakrishnan
AstraZeneca, Technology Labs, UK

Genome Simulation
AstraZeneca
We are a global, science-led
biopharmaceutical business
pushing the boundaries of science
to deliver life-changing medicines.
61,500
employees worldwide
$23bn
2016 Revenue*
100+
Countries

Our Experiment –
• Quick introduction to DNA - DNA Sequencing
• synthetically generated Genome data (FASTQ & VCF)!
• How to Scale/ run Distributed Compute using Kubernetes and
Docker
Hari works as an Associate Architect - Data &
Analytics in the UK Tech Incubation Lab of
AstraZeneca.
Genome Simulation
Introduction

DNA & How it is Sequenced
Genome Simulation

AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG
TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA
ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT
TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG
Deoxyribonucleic acid (DNA) is the chemical
inside the nucleus of all cells that carries the
genetic instructions for making living
organisms. A DNA molecule consists of two
strands that wrap around each other to resemble
a twisted ladder.
Genome Simulation
Human DNA

AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG
TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA
ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT
TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG
Genome Simulation
Human DNA

In a perfect world (just your 3 billion letters):
~700 megabytes
In the real world, right off the genome sequencer:
100~200 gigabytes
Genome Simulation
Human Genome Sequencing
For maybe 100,000 + samples
10~20 petabytes

Simulation and the need for
it
Genome Simulation

• Be able to Simulate high-throughput Genome sequencing.
• The Genome generated can be used to test existing pipelines and
infrastructure.
• If run in a distributed mode can generate sufficient data to create
near production line scenarios, the data could be used to test the
ingestion, processing through the pipeline and subsequent analytic
tools.
• Synthetic so no issues with privacy, patient de – identification,
transfer across regions.
Genome Simulation
Why Simulate Genomic Data

Genome Simulation
Human Genome – Alignment – Variant Calling

Our Little Experiment
Genome Simulation

• VarSim was picked as the tool for genome simulation, it provided
ways were variations could be introduced in a random fashion into
Genome Simulation and the output FASTQ and VCF files would be unique
from each other
• Other tools were considered, but were not maintained or did not
provide sufficient flexibility.
Genome Simulation
Tool Selection

References
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark
B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for
high-throughput genome sequencing with cancer applications
Bioinformatics first published online December 17,
2014doi:10.1093/bioinformatics/btu828
Summary:
VarSim is a framework for assessing alignment and
variant calling accuracy in high- throughput genome
sequencing through simulation or real data. In contrast
to simulating a raNdom mutation spectrum, it
synthesizes diploid genomes with germline and somatic
mutations based on a realistic model. This model
leverages information such as previously reported
mutations to make the synthetic genomes biologically
relevant.
Genome Simulation
VarSim

• VarSim is a Python/Java based tool that would simulate one genome per run.
• We looked into ways were we could parallelize the to generate more
genomes.
• Build Docker container/Image for the tool.
• Experiment run on Google Cloud – Container engine.
• Parameters like Coverage, Unique ID & Seed value were externalized in a
Lambda function that the Docker images could talk to and receive arguments
before execution.
• Output FASTQ files and VCF’s would then be stored in Cloud storage.
• Ability to choose to generate FASTQ & VCF or just VCF.
Genome Simulation
Technicalities

Start Script
Google Cloud Libraries
Java Libraries
SAM Tools
VarSim
Python Libraries
Ref Genome / Insert Sequences /
Annotations
ART Simulator
4.8 GB
CLOUD FUNCTIONS BETA
HTTP
DOCKER IMAGE
OUTPUT FILES
FASTQ
VCF
Genome Simulation
Docker Image

Using Docker
Registry on Google
Cloud.
Given the size of
the Docker Image it
made sense to take
advantage of the
high Network speeds
between servers on
the cloud for quick
deployment to the
Container Engine.
Genome Simulation
Container Registry
DOCKER IMAGE4.8 GB

Using Container
Registry on Google
Cloud.
Genome Simulation
Container Clusters
Can Reach a MAX cluster size of 1000

Configure the version of Docker image to be deployed to the cluster.
Kubernetes takes care of distributing the Docker image to all the instances
in the cluster.
Genome Simulation
Kubernetes - Container Clusters

Genome Simulation
Architecture

• We have around 1000
unique VCF files
generated so far.
• We have around 10 Genome
FASTQ and VCF’s.
• More FASTQ and VCF if we
can fund it.
Cost
• Cost $1000 to generate 1000 unique VCF files. 1$ per VCF.
• Cost’s to generate FASTQ files vary based on the coverage required. For
a 50X coverage the costs work out around 5$ for the FASTQ and VCF. The
costs can be brought down by generating FATSQ files in multiple lanes.
Genome Simulation
Outcome

John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H.
Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing
with cancer applications
Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828
To folks from Google - Daniel Bergqvist, Nico Gaviola & Craig Box. Mathew Woodwark, Nick Brown, Rob
Hernandez, Sandra Giuliani, Frank Lombardi from AstraZeneca for supporting this work.
Genome Simulation
References
Thanks

Genome simulation and applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genome simulation and applications

Similar to Genome simulation and applications (20)

Recently uploaded

Recently uploaded (20)

Genome simulation and applications

Editor's Notes