SlideShare a Scribd company logo
1 of 19
A Look at DNA/RNA Simulation
General Outline
• Brief overview of available simulators
• Pattnaik, et al. (2014). SInC: an accurate and fast error-
model based simulator for SNPs, Indels and CNVs coupled
with a read generator for short-read sequence data. BMC
Bioinformatics, 15:40.
• Griebel, et al. (2012). Modelling and simulating generic
RNA-Seq experiments with the flux simulator. Nucl. Acids
Res. 40 (20): 10073-10083.
• Mu, et al. (2015). VarSim: a high-fidelity simulation and
validation framework for high-throughput genome
sequencing with cancer applications. Bioinformatics, 31
(9): 1469-1471.
• Conclusions/Suggestions
Brief Overview
• Read simulators:
– Wgsim(2009): basic sequencing simulation; dummy quality scores
– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input
– ART(2012): uses pre-trained quality score distribution profile
– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error
profile
• Variation/Read simulators:
– GemSIM(2012): generates empirical error models from real data, multiple genome input,
random generation of SNPs and Indels
– MAQ(2008): error model based on quality score profile from a order-one Markov chain,
random SNP and Indel generation
– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels
– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy
distributions generated from a gene quantification file
– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles,
random SNP, indel, and CNVs
• Multi-step simulators:
– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from
realistic statistical models
– VarSim(2015): genome and read simulation and validation framework
SInC
• Three-part variation simulator and a read
generator
• Variation modules model SNPs, Indels, and CNVs
(copy number variations)
• Read generator module models short-read
sequencing using a real-data derived quality
distribution profile.
• Multi-threaded for fast read generation.
• Performed a small evaluation versus 4 other
variation simulators.
SInC
• SNPs, indels, and CNVs are randomly
distributed across the reference genome by
separate modules using command-line
parameters
• Reads are generated using a pre-defined error
profile distribution
• However, a separate tool is available to
generate custom error profiles from real data
sets
SInC Workflow
SInC Evaluation using GATK and Pindel
SInC Evaluation
FluxSim
• Generic RNA-seq experiment simulator
• Multiple modules simulating different stages
of RNA Illumina library construction and
sequencing, as well as a transcriptome
simulator.
• Simulator Modules/Stages: transcription,
fragmentation, reverse transcription, size
selection, adapter ligation/PCR amplification,
sequencing
Outline of the Flux Simulator pipeline.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.
FluxSim Transcription
• FluxSim models gene expression by sampling
from a power law distribution (i.e. modified
Ziph’s law with exponential mRNA decay).
–
– This relationship models the networked nature of
cellular gene expression, with many lowly
expressed genes (low ranked), several moderately
expressed genes, and a few very highly expressed
genes (high ranked).
FluxSim: log-log plot of three real cellular transcriptome datasets
FluxSim Sequencing
• A quality profile based model for Illumina
sequencing
– Quality values are randomly drawn from a pre-
defined empirical distribution dependent on cycle
position
– Nucleotides are mutated according to the quality
score error probability
– Nucleotide mutation choice/preference is
determined based on the quality score using a
first order Markov process
VarSim
• Multi-step simulator and validation framework
– 1) simulates perturbed diploid genomes from a
reference by inserting variants (VarSim simulates
SNVs, deletions, insertions,MNPs, complex variants,
tandem duplications and inversions) from existing
databases distribution profiles
– 2) uses a third-party read simulator to generate
sequenced reads (currently configured to use
DWGSIM or ART) from the perturbed genomes
– 3) reads are mapped back to original reference
genome using a modified vcf2diploid (Rozowsky etal.,
2011) map file (MFF file)
VarSim Validation
– read alignments (from mapping software, e.g.
BWA-mem) are validated using read header
metadata
– Variants (from variant caller software, e.g.
FreeBayes) are validated against ‘true’ variants
that were inserted into the perturbed genome
– Accuracy of variant calling is reported based on
sensitivity (TPR) and precision (PPV/FDR), broken
down by variant type and size, as a JSON file with
SVG plots
VarSim simulation and validation workflow.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.
Validation results for some popular secondary analysis tools.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.
Conclusions/Suggestions
• There are no comprehensive evaluations (that I could
find) of DNA/RNA simulators other than the
incomplete SInC comparison.
• However, SInC and VarSim appear to be a good
candidates for genome variation and gDNA simulation,
while FluxSim appears to be the only fully realized RNA
simulator.
• A pipeline with SInC or VarSim genome perturbation
combined with FluxSim transcription and library
prep/sequencing might allow validation of RNAseq
tools with biologically complex simulated data.
Comparison of simulated reads with experimental evidence in different sequencing protocols.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.
FluxSim Evaluation

More Related Content

Viewers also liked

Viewers also liked (8)

Silicon Halton badges
Silicon Halton badgesSilicon Halton badges
Silicon Halton badges
 
Les Outils pour Entreprendre
Les Outils pour EntreprendreLes Outils pour Entreprendre
Les Outils pour Entreprendre
 
International Migration, mex- US
International Migration, mex- USInternational Migration, mex- US
International Migration, mex- US
 
C & B shoes
C & B shoesC & B shoes
C & B shoes
 
Problema Matematico
Problema MatematicoProblema Matematico
Problema Matematico
 
Tema iv historia
Tema iv historiaTema iv historia
Tema iv historia
 
M012329497
M012329497M012329497
M012329497
 
Immunity
ImmunityImmunity
Immunity
 

Similar to DNA/RNA read simulators

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxHaibo Liu
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsAdam Bradley
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...OECD Environment
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modelingAlichy Sowmya
 
Next Generation Sequencing methods
Next Generation Sequencing methods Next Generation Sequencing methods
Next Generation Sequencing methods Zohaib HUSSAIN
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAlex Rubinsteyn
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceJustin Johnson
 

Similar to DNA/RNA read simulators (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptx
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
 
Next Generation Sequencing methods
Next Generation Sequencing methods Next Generation Sequencing methods
Next Generation Sequencing methods
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and Varcode
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a Service
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 

Recently uploaded

Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.k64182334
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

DNA/RNA read simulators

  • 1. A Look at DNA/RNA Simulation
  • 2. General Outline • Brief overview of available simulators • Pattnaik, et al. (2014). SInC: an accurate and fast error- model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15:40. • Griebel, et al. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucl. Acids Res. 40 (20): 10073-10083. • Mu, et al. (2015). VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics, 31 (9): 1469-1471. • Conclusions/Suggestions
  • 3. Brief Overview • Read simulators: – Wgsim(2009): basic sequencing simulation; dummy quality scores – MetaSim(2008): uses pre-defined sequence context error models; multiple genome input – ART(2012): uses pre-trained quality score distribution profile – piRS(2012): creates quality score and cycle matrix from real data to generate empirical error profile • Variation/Read simulators: – GemSIM(2012): generates empirical error models from real data, multiple genome input, random generation of SNPs and Indels – MAQ(2008): error model based on quality score profile from a order-one Markov chain, random SNP and Indel generation – DWGSIM(2009): based on wgsim of samtools. SNPs and Indels – BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy distributions generated from a gene quantification file – SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles, random SNP, indel, and CNVs • Multi-step simulators: – Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from realistic statistical models – VarSim(2015): genome and read simulation and validation framework
  • 4. SInC • Three-part variation simulator and a read generator • Variation modules model SNPs, Indels, and CNVs (copy number variations) • Read generator module models short-read sequencing using a real-data derived quality distribution profile. • Multi-threaded for fast read generation. • Performed a small evaluation versus 4 other variation simulators.
  • 5. SInC • SNPs, indels, and CNVs are randomly distributed across the reference genome by separate modules using command-line parameters • Reads are generated using a pre-defined error profile distribution • However, a separate tool is available to generate custom error profiles from real data sets
  • 7. SInC Evaluation using GATK and Pindel
  • 9. FluxSim • Generic RNA-seq experiment simulator • Multiple modules simulating different stages of RNA Illumina library construction and sequencing, as well as a transcriptome simulator. • Simulator Modules/Stages: transcription, fragmentation, reverse transcription, size selection, adapter ligation/PCR amplification, sequencing
  • 10. Outline of the Flux Simulator pipeline. Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083 © The Author(s) 2012. Published by Oxford University Press.
  • 11. FluxSim Transcription • FluxSim models gene expression by sampling from a power law distribution (i.e. modified Ziph’s law with exponential mRNA decay). – – This relationship models the networked nature of cellular gene expression, with many lowly expressed genes (low ranked), several moderately expressed genes, and a few very highly expressed genes (high ranked).
  • 12. FluxSim: log-log plot of three real cellular transcriptome datasets
  • 13. FluxSim Sequencing • A quality profile based model for Illumina sequencing – Quality values are randomly drawn from a pre- defined empirical distribution dependent on cycle position – Nucleotides are mutated according to the quality score error probability – Nucleotide mutation choice/preference is determined based on the quality score using a first order Markov process
  • 14. VarSim • Multi-step simulator and validation framework – 1) simulates perturbed diploid genomes from a reference by inserting variants (VarSim simulates SNVs, deletions, insertions,MNPs, complex variants, tandem duplications and inversions) from existing databases distribution profiles – 2) uses a third-party read simulator to generate sequenced reads (currently configured to use DWGSIM or ART) from the perturbed genomes – 3) reads are mapped back to original reference genome using a modified vcf2diploid (Rozowsky etal., 2011) map file (MFF file)
  • 15. VarSim Validation – read alignments (from mapping software, e.g. BWA-mem) are validated using read header metadata – Variants (from variant caller software, e.g. FreeBayes) are validated against ‘true’ variants that were inserted into the perturbed genome – Accuracy of variant calling is reported based on sensitivity (TPR) and precision (PPV/FDR), broken down by variant type and size, as a JSON file with SVG plots
  • 16. VarSim simulation and validation workflow. John C. Mu et al. Bioinformatics 2015;31:1469-1471 © The Author 2014. Published by Oxford University Press.
  • 17. Validation results for some popular secondary analysis tools. John C. Mu et al. Bioinformatics 2015;31:1469-1471 © The Author 2014. Published by Oxford University Press.
  • 18. Conclusions/Suggestions • There are no comprehensive evaluations (that I could find) of DNA/RNA simulators other than the incomplete SInC comparison. • However, SInC and VarSim appear to be a good candidates for genome variation and gDNA simulation, while FluxSim appears to be the only fully realized RNA simulator. • A pipeline with SInC or VarSim genome perturbation combined with FluxSim transcription and library prep/sequencing might allow validation of RNAseq tools with biologically complex simulated data.
  • 19. Comparison of simulated reads with experimental evidence in different sequencing protocols. Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083 © The Author(s) 2012. Published by Oxford University Press. FluxSim Evaluation

Editor's Notes

  1. Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file 3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared. Human chromosome 22 from UCSC hg19 Aligned using novalign, variant calls with GATK and PINDEL
  2. Figure 3 Time profiles of the different simulators used. Time elapsed to perform one complete simulation with default options using single core across different simulators. A) For chromosome 22 at 15X B) For human whole genome (hg19) at 5X.
  3. Outline of the Flux Simulator pipeline. Provided the genomic sequence of an organism and a representative gene annotation as input, the initial step is a transcriptome simulation (A) to assign each transcript a randomised expression level according to general laws of gene expression. Subsequently, fragmentation (B) and RT (C) are carried out, either by first hydrolysing RNA and then transcribing the fragments into cDNA molecules (B and C, right) or by nebulisation respectively enzymatic digestion after reversely transcribing the entire RNA molecules (B and C, left). The simulated molecules of the primary library then get amplified by in silico PCR (D)—optionally after selecting a certain size range—and the final library then is subjected to simulated sequencing (E), including potential platform and sequencing chemistry specific error models. Finally, read sequences along with their genomic mappings are obtained.
  4. Y0 = expression level of most abundant gene K = exponent to law, governs slope of log-log plot a and b = exponential mRNA decay rate
  5. Supplementary  Figure  3:  Expression  profiles  observed in RNA-­‐Seq experiments. The curves show  the  log-­‐log  behaviour  of  transcript expression  in RNA-­‐Seq experiments conducted on cellular transcriptomes  of  the  species  M.musculus (blue),  A.thaliana (green) and S.cerevisiae (red). Expression values  for  every  gene in a corresponding reference  annotation (i.e., the murine RefSeq, the TAIR9 annotation  of  cress,  and  the  SGD  yeast  annotation)  have been estimated by the number of reads mapping  to  it,  and  expression  levels  have  been  ranked from high to low (x-­‐axis). Although target cells  and  RNA-­‐Seq  experiment  protocols  differ  substantially, all datasets show highly similar characteristics  in  their  transcript  abundance  distribution: the nature of Zipf’s Law underlying gene  expression  can  be  noted  by  the  largely  linear  behaviour in logarithmic scale. However, especially for  lowly  abundant  forms, an exponential decay is  notable.
  6. VarSim simulation and validation workflow. The germline workflow can be run with or without the somatic workflow
  7. Validation results for some popular secondary analysis tools F1 = harmonic mean of sensitivity and precision BWA-Mem used GATK re-alignment
  8. Comparison of simulated reads with experimental evidence in different sequencing protocols. For each experiment, transcripts from a reference annotation of the corresponding species have been classified into short (<1000 nt, left panels), intermediate (1000–2000 nt, centre panels), and long forms (>2000 nt, right panels). Red and orange bars show reads from the experiment that align in sense and antisense, respectively, to the directionality of transcription, the corresponding in silico results are shown as dark and light blue bars, respectively. (A) Read tag distributions from an RNA hydrolysis protocol in M. musculus sequenced on the Illumina GA2 platform. (B) A different hydrolysis experiment carried out with the recent HiSeq2000 technology (Illumina), producing longer reads that exclusively map in sense orientation, so called ‘dir RNA-Seq’). (C) A complementary Illumina experiment employing poly-dT primed RT and subsequent DNAse digestion of the (poly-A+) transcriptome of S. cerevisisae. (D) Results from an experiment in A. thaliana where poly-dT primed RT products are fragmented by nebulisation.