SlideShare a Scribd company logo
Genome (FASTQ and VCF) Simulation & Applications
Hariprasad Radhakrishnan
AstraZeneca, Technology Labs, UK
Genome Simulation
AstraZeneca
We are a global, science-led
biopharmaceutical business
pushing the boundaries of science
to deliver life-changing medicines.
61,500
employees worldwide
$23bn
2016 Revenue*
100+
Countries
Our Experiment –
• Quick introduction to DNA - DNA Sequencing
• synthetically generated Genome data (FASTQ & VCF)!
• How to Scale/ run Distributed Compute using Kubernetes and
Docker
Hari works as an Associate Architect - Data &
Analytics in the UK Tech Incubation Lab of
AstraZeneca.
Genome Simulation
Introduction
DNA & How it is Sequenced
Genome Simulation
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG
TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA
ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT
TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG
Deoxyribonucleic acid (DNA) is the chemical
inside the nucleus of all cells that carries the
genetic instructions for making living
organisms. A DNA molecule consists of two
strands that wrap around each other to resemble
a twisted ladder.
Genome Simulation
Human DNA
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG
TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA
ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT
TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG
Genome Simulation
Human DNA
In a perfect world (just your 3 billion letters):
~700 megabytes
In the real world, right off the genome sequencer:
100~200 gigabytes
Genome Simulation
Human Genome Sequencing
For maybe 100,000 + samples
10~20 petabytes
Simulation and the need for
it
Genome Simulation
• Be able to Simulate high-throughput Genome sequencing.
• The Genome generated can be used to test existing pipelines and
infrastructure.
• If run in a distributed mode can generate sufficient data to create
near production line scenarios, the data could be used to test the
ingestion, processing through the pipeline and subsequent analytic
tools.
• Synthetic so no issues with privacy, patient de – identification,
transfer across regions.
Genome Simulation
Why Simulate Genomic Data
Genome Simulation
Human Genome – Alignment – Variant Calling
Our Little Experiment
Genome Simulation
• VarSim was picked as the tool for genome simulation, it provided
ways were variations could be introduced in a random fashion into
Genome Simulation and the output FASTQ and VCF files would be unique
from each other
• Other tools were considered, but were not maintained or did not
provide sufficient flexibility.
Genome Simulation
Tool Selection
References
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark
B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for
high-throughput genome sequencing with cancer applications
Bioinformatics first published online December 17,
2014doi:10.1093/bioinformatics/btu828
Summary:
VarSim is a framework for assessing alignment and
variant calling accuracy in high- throughput genome
sequencing through simulation or real data. In contrast
to simulating a raNdom mutation spectrum, it
synthesizes diploid genomes with germline and somatic
mutations based on a realistic model. This model
leverages information such as previously reported
mutations to make the synthetic genomes biologically
relevant.
Genome Simulation
VarSim
• VarSim is a Python/Java based tool that would simulate one genome per run.
• We looked into ways were we could parallelize the to generate more
genomes.
• Build Docker container/Image for the tool.
• Experiment run on Google Cloud – Container engine.
• Parameters like Coverage, Unique ID & Seed value were externalized in a
Lambda function that the Docker images could talk to and receive arguments
before execution.
• Output FASTQ files and VCF’s would then be stored in Cloud storage.
• Ability to choose to generate FASTQ & VCF or just VCF.
Genome Simulation
Technicalities
Start Script
Google Cloud Libraries
Java Libraries
SAM Tools
VarSim
Python Libraries
Ref Genome / Insert Sequences /
Annotations
ART Simulator
4.8 GB
CLOUD FUNCTIONS BETA
HTTP
DOCKER IMAGE
OUTPUT FILES
FASTQ
VCF
Genome Simulation
Docker Image
Using Docker
Registry on Google
Cloud.
Given the size of
the Docker Image it
made sense to take
advantage of the
high Network speeds
between servers on
the cloud for quick
deployment to the
Container Engine.
Genome Simulation
Container Registry
DOCKER IMAGE4.8 GB
Using Container
Registry on Google
Cloud.
Genome Simulation
Container Clusters
Can Reach a MAX cluster size of 1000
Configure the version of Docker image to be deployed to the cluster.
Kubernetes takes care of distributing the Docker image to all the instances
in the cluster.
Genome Simulation
Kubernetes - Container Clusters
Genome Simulation
Architecture
• We have around 1000
unique VCF files
generated so far.
• We have around 10 Genome
FASTQ and VCF’s.
• More FASTQ and VCF if we
can fund it.
Cost
• Cost $1000 to generate 1000 unique VCF files. 1$ per VCF.
• Cost’s to generate FASTQ files vary based on the coverage required. For
a 50X coverage the costs work out around 5$ for the FASTQ and VCF. The
costs can be brought down by generating FATSQ files in multiple lanes.
Genome Simulation
Outcome
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H.
Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing
with cancer applications
Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828
To folks from Google - Daniel Bergqvist, Nico Gaviola & Craig Box. Mathew Woodwark, Nick Brown, Rob
Hernandez, Sandra Giuliani, Frank Lombardi from AstraZeneca for supporting this work.
Genome Simulation
References
Thanks
Thank You
Genome Simulation

More Related Content

What's hot

VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
Data Con LA
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
Genome Reference Consortium
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
Genome Reference Consortium
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
Yun Lung Li
 
Custom Enrichment Panels for Targeted Next Generation Sequencing
Custom Enrichment Panels for Targeted Next Generation SequencingCustom Enrichment Panels for Targeted Next Generation Sequencing
Custom Enrichment Panels for Targeted Next Generation Sequencing
Integrated DNA Technologies
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
NECST Lab @ Politecnico di Milano
 
Bioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
Bioo Scientific - Improving the Performance of SureSelectXT2 Target CaptureBioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
Bioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
Bioo Scientific
 
Improving exome sequencing, targeted sequencing, and low frequency variant de...
Improving exome sequencing, targeted sequencing, and low frequency variant de...Improving exome sequencing, targeted sequencing, and low frequency variant de...
Improving exome sequencing, targeted sequencing, and low frequency variant de...
Laura Berry
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
Denis C. Bauer
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
Genome Reference Consortium
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
Din Apellidos
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum
 
Next-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones InfographicNext-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones Infographic
QIAGEN
 
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS (Society for Laboratory Automation and Screening)
 
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS (Society for Laboratory Automation and Screening)
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
VHIR Vall d’Hebron Institut de Recerca
 

What's hot (20)

DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Custom Enrichment Panels for Targeted Next Generation Sequencing
Custom Enrichment Panels for Targeted Next Generation SequencingCustom Enrichment Panels for Targeted Next Generation Sequencing
Custom Enrichment Panels for Targeted Next Generation Sequencing
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Bioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
Bioo Scientific - Improving the Performance of SureSelectXT2 Target CaptureBioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
Bioo Scientific - Improving the Performance of SureSelectXT2 Target Capture
 
Improving exome sequencing, targeted sequencing, and low frequency variant de...
Improving exome sequencing, targeted sequencing, and low frequency variant de...Improving exome sequencing, targeted sequencing, and low frequency variant de...
Improving exome sequencing, targeted sequencing, and low frequency variant de...
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Next-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones InfographicNext-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones Infographic
 
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
SLAS Ultra-High-Throughput Screening Special Interest Group SLAS2017 Presenta...
 
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 

Similar to Genome simulation and applications

Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
Nick Brown
 
Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
Ceph Community
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
Chris Dwan
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
Vijay Karan
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
Vijay Karan
 
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
Robert (Rob) Salomon
 
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
Dag Endresen
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
Pistoia Alliance
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain
Kan Yuenyong
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
Chung-Tsai Su
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
university of education,Lahore
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
Robert Grossman
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
Paolo Missier
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
eXascale Infolab
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
Stacy Véronneau
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Paolo Missier
 
UCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
UCSC's Biomolecular Department Eliminates I/O Bottleneck with PanasasUCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
UCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
Panasas
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
Mark Evans
 
FabSim: Facilitating computational research through automation on large-scale...
FabSim: Facilitating computational research through automation on large-scale...FabSim: Facilitating computational research through automation on large-scale...
FabSim: Facilitating computational research through automation on large-scale...
Derek Groen
 

Similar to Genome simulation and applications (20)

Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
 
Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
IEEE 2015 Java Projects
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
 
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
Genomic Cytometry: Using Multi-Omic Approaches to Increase Dimensionality in ...
 
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
EURISCO demo installations of IPT, at GBIF EU Nodes meeting in Alicante (11 M...
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
UCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
UCSC's Biomolecular Department Eliminates I/O Bottleneck with PanasasUCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
UCSC's Biomolecular Department Eliminates I/O Bottleneck with Panasas
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
FabSim: Facilitating computational research through automation on large-scale...
FabSim: Facilitating computational research through automation on large-scale...FabSim: Facilitating computational research through automation on large-scale...
FabSim: Facilitating computational research through automation on large-scale...
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Genome simulation and applications

  • 1. Genome (FASTQ and VCF) Simulation & Applications Hariprasad Radhakrishnan AstraZeneca, Technology Labs, UK
  • 2. Genome Simulation AstraZeneca We are a global, science-led biopharmaceutical business pushing the boundaries of science to deliver life-changing medicines. 61,500 employees worldwide $23bn 2016 Revenue* 100+ Countries
  • 3. Our Experiment – • Quick introduction to DNA - DNA Sequencing • synthetically generated Genome data (FASTQ & VCF)! • How to Scale/ run Distributed Compute using Kubernetes and Docker Hari works as an Associate Architect - Data & Analytics in the UK Tech Incubation Lab of AstraZeneca. Genome Simulation Introduction
  • 4. DNA & How it is Sequenced Genome Simulation
  • 5. AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG Deoxyribonucleic acid (DNA) is the chemical inside the nucleus of all cells that carries the genetic instructions for making living organisms. A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder. Genome Simulation Human DNA
  • 7. In a perfect world (just your 3 billion letters): ~700 megabytes In the real world, right off the genome sequencer: 100~200 gigabytes Genome Simulation Human Genome Sequencing For maybe 100,000 + samples 10~20 petabytes
  • 8. Simulation and the need for it Genome Simulation
  • 9. • Be able to Simulate high-throughput Genome sequencing. • The Genome generated can be used to test existing pipelines and infrastructure. • If run in a distributed mode can generate sufficient data to create near production line scenarios, the data could be used to test the ingestion, processing through the pipeline and subsequent analytic tools. • Synthetic so no issues with privacy, patient de – identification, transfer across regions. Genome Simulation Why Simulate Genomic Data
  • 10. Genome Simulation Human Genome – Alignment – Variant Calling
  • 12. • VarSim was picked as the tool for genome simulation, it provided ways were variations could be introduced in a random fashion into Genome Simulation and the output FASTQ and VCF files would be unique from each other • Other tools were considered, but were not maintained or did not provide sufficient flexibility. Genome Simulation Tool Selection
  • 13. References John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828 Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high- throughput genome sequencing through simulation or real data. In contrast to simulating a raNdom mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. Genome Simulation VarSim
  • 14. • VarSim is a Python/Java based tool that would simulate one genome per run. • We looked into ways were we could parallelize the to generate more genomes. • Build Docker container/Image for the tool. • Experiment run on Google Cloud – Container engine. • Parameters like Coverage, Unique ID & Seed value were externalized in a Lambda function that the Docker images could talk to and receive arguments before execution. • Output FASTQ files and VCF’s would then be stored in Cloud storage. • Ability to choose to generate FASTQ & VCF or just VCF. Genome Simulation Technicalities
  • 15. Start Script Google Cloud Libraries Java Libraries SAM Tools VarSim Python Libraries Ref Genome / Insert Sequences / Annotations ART Simulator 4.8 GB CLOUD FUNCTIONS BETA HTTP DOCKER IMAGE OUTPUT FILES FASTQ VCF Genome Simulation Docker Image
  • 16. Using Docker Registry on Google Cloud. Given the size of the Docker Image it made sense to take advantage of the high Network speeds between servers on the cloud for quick deployment to the Container Engine. Genome Simulation Container Registry DOCKER IMAGE4.8 GB
  • 17. Using Container Registry on Google Cloud. Genome Simulation Container Clusters Can Reach a MAX cluster size of 1000
  • 18. Configure the version of Docker image to be deployed to the cluster. Kubernetes takes care of distributing the Docker image to all the instances in the cluster. Genome Simulation Kubernetes - Container Clusters
  • 20. • We have around 1000 unique VCF files generated so far. • We have around 10 Genome FASTQ and VCF’s. • More FASTQ and VCF if we can fund it. Cost • Cost $1000 to generate 1000 unique VCF files. 1$ per VCF. • Cost’s to generate FASTQ files vary based on the coverage required. For a 50X coverage the costs work out around 5$ for the FASTQ and VCF. The costs can be brought down by generating FATSQ files in multiple lanes. Genome Simulation Outcome
  • 21. John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828 To folks from Google - Daniel Bergqvist, Nico Gaviola & Craig Box. Mathew Woodwark, Nick Brown, Rob Hernandez, Sandra Giuliani, Frank Lombardi from AstraZeneca for supporting this work. Genome Simulation References Thanks

Editor's Notes

  1. The human body has about 100 trillion cells with more than 200 different cell types. Each cell harbors the same genetic information in its nucleus in form of DNA containing chromosomes DNA Deoxyribonucleic acid (DNA) is the chemical inside the nucleus of all cells that carries the genetic instructions for making living organisms. A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder. The sides are made of sugar and phosphate molecules. The �rungs� are made of nitrogen-containing chemicals called bases. Each strand is composed of one sugar molecule, one phosphate molecule, and a base. Four different bases are present in DNA - adenine (A), thymine (T), cytosine (C), and guanine (G). The particular order of the bases arranged along the sugar - phosphate backbone is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. Each strand of the DNA molecule is held together at its base by a weak bond. The four bases pair in a set manner: Adenine (A) pairs with thymine (T), while cytosine (C) pairs with guanine (G). These pairs of bases are known as Base Pairs (bp). These Base Pairs (bp) are the basis of Y-chromosome testing.
  2. The human body has about 100 trillion cells with more than 200 different cell types. Each cell harbors the same genetic information in its nucleus in form of DNA containing chromosomes DNA Deoxyribonucleic acid (DNA) is the chemical inside the nucleus of all cells that carries the genetic instructions for making living organisms. A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder. The sides are made of sugar and phosphate molecules. The �rungs� are made of nitrogen-containing chemicals called bases. Each strand is composed of one sugar molecule, one phosphate molecule, and a base. Four different bases are present in DNA - adenine (A), thymine (T), cytosine (C), and guanine (G). The particular order of the bases arranged along the sugar - phosphate backbone is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. Each strand of the DNA molecule is held together at its base by a weak bond. The four bases pair in a set manner: Adenine (A) pairs with thymine (T), while cytosine (C) pairs with guanine (G). These pairs of bases are known as Base Pairs (bp). These Base Pairs (bp) are the basis of Y-chromosome testing.
  3. @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65
  4. VarSim is a Python/Java based tool that would simulate one genome per run. We looked into ways were we could parallelize the process and run simultaneously on multiple machines to generate more genomes. We packaged the tool into a Docker container/Image so it can be easily shipped and deployed in a cluster. As we had some experience in using Google Cloud, we decided to execute it in a cluster using a managed service called Container Engine (running Kubernetes). This allows us to spin multiple machines (upto 500) and deploy our Docker Image(Varsim) and execute. Parameters like Coverage, Unique ID & Seed value was externalized in a Lambda function that the Docker images could talk to and receive arguments before execution. Output FASTQ files and VCF’s would then be stored in Cloud storage. Ability to choose to generate FASTQ & VCF or just VCF.