DNA/RNA read simulators

General Outline
• Brief overview of available simulators
• Pattnaik, et al. (2014). SInC: an accurate and fast error-
model based simulator for SNPs, Indels and CNVs coupled
with a read generator for short-read sequence data. BMC
Bioinformatics, 15:40.
• Griebel, et al. (2012). Modelling and simulating generic
RNA-Seq experiments with the flux simulator. Nucl. Acids
Res. 40 (20): 10073-10083.
• Mu, et al. (2015). VarSim: a high-fidelity simulation and
validation framework for high-throughput genome
sequencing with cancer applications. Bioinformatics, 31
(9): 1469-1471.
• Conclusions/Suggestions

Brief Overview
• Read simulators:
– Wgsim(2009): basic sequencing simulation; dummy quality scores
– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input
– ART(2012): uses pre-trained quality score distribution profile
– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error
profile
• Variation/Read simulators:
– GemSIM(2012): generates empirical error models from real data, multiple genome input,
random generation of SNPs and Indels
– MAQ(2008): error model based on quality score profile from a order-one Markov chain,
random SNP and Indel generation
– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels
– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy
distributions generated from a gene quantification file
– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles,
random SNP, indel, and CNVs
• Multi-step simulators:
– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from
realistic statistical models
– VarSim(2015): genome and read simulation and validation framework

SInC
• Three-part variation simulator and a read
generator
• Variation modules model SNPs, Indels, and CNVs
(copy number variations)
• Read generator module models short-read
sequencing using a real-data derived quality
distribution profile.
• Multi-threaded for fast read generation.
• Performed a small evaluation versus 4 other
variation simulators.

SInC
• SNPs, indels, and CNVs are randomly
distributed across the reference genome by
separate modules using command-line
parameters
• Reads are generated using a pre-defined error
profile distribution
• However, a separate tool is available to
generate custom error profiles from real data
sets

SInC Evaluation using GATK and Pindel

FluxSim
• Generic RNA-seq experiment simulator
• Multiple modules simulating different stages
of RNA Illumina library construction and
sequencing, as well as a transcriptome
simulator.
• Simulator Modules/Stages: transcription,
fragmentation, reverse transcription, size
selection, adapter ligation/PCR amplification,
sequencing

Outline of the Flux Simulator pipeline.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.

FluxSim Transcription
• FluxSim models gene expression by sampling
from a power law distribution (i.e. modified
Ziph’s law with exponential mRNA decay).
–
– This relationship models the networked nature of
cellular gene expression, with many lowly
expressed genes (low ranked), several moderately
expressed genes, and a few very highly expressed
genes (high ranked).

FluxSim: log-log plot of three real cellular transcriptome datasets

FluxSim Sequencing
• A quality profile based model for Illumina
sequencing
– Quality values are randomly drawn from a pre-
defined empirical distribution dependent on cycle
position
– Nucleotides are mutated according to the quality
score error probability
– Nucleotide mutation choice/preference is
determined based on the quality score using a
first order Markov process

VarSim
• Multi-step simulator and validation framework
– 1) simulates perturbed diploid genomes from a
reference by inserting variants (VarSim simulates
SNVs, deletions, insertions,MNPs, complex variants,
tandem duplications and inversions) from existing
databases distribution profiles
– 2) uses a third-party read simulator to generate
sequenced reads (currently configured to use
DWGSIM or ART) from the perturbed genomes
– 3) reads are mapped back to original reference
genome using a modified vcf2diploid (Rozowsky etal.,
2011) map file (MFF file)

VarSim Validation
– read alignments (from mapping software, e.g.
BWA-mem) are validated using read header
metadata
– Variants (from variant caller software, e.g.
FreeBayes) are validated against ‘true’ variants
that were inserted into the perturbed genome
– Accuracy of variant calling is reported based on
sensitivity (TPR) and precision (PPV/FDR), broken
down by variant type and size, as a JSON file with
SVG plots

VarSim simulation and validation workflow.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.

Validation results for some popular secondary analysis tools.
John C. Mu et al. Bioinformatics 2015;31:1469-1471
© The Author 2014. Published by Oxford University Press.

Conclusions/Suggestions
• There are no comprehensive evaluations (that I could
find) of DNA/RNA simulators other than the
incomplete SInC comparison.
• However, SInC and VarSim appear to be a good
candidates for genome variation and gDNA simulation,
while FluxSim appears to be the only fully realized RNA
simulator.
• A pipeline with SInC or VarSim genome perturbation
combined with FluxSim transcription and library
prep/sequencing might allow validation of RNAseq
tools with biologically complex simulated data.

Comparison of simulated reads with experimental evidence in different sequencing protocols.
Thasso Griebel et al. Nucl. Acids Res. 2012;40:10073-10083
© The Author(s) 2012. Published by Oxford University Press.
FluxSim Evaluation

DNA/RNA read simulators

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to DNA/RNA read simulators

Similar to DNA/RNA read simulators (20)

Recently uploaded

Recently uploaded (20)

DNA/RNA read simulators

Editor's Notes