2. General Outline
• Brief overview of available simulators
• Pattnaik, et al. (2014). SInC: an accurate and fast error-
model based simulator for SNPs, Indels and CNVs coupled
with a read generator for short-read sequence data. BMC
Bioinformatics, 15:40.
• Griebel, et al. (2012). Modelling and simulating generic
RNA-Seq experiments with the flux simulator. Nucl. Acids
Res. 40 (20): 10073-10083.
• Mu, et al. (2015). VarSim: a high-fidelity simulation and
validation framework for high-throughput genome
sequencing with cancer applications. Bioinformatics, 31
(9): 1469-1471.
• Conclusions/Suggestions
3. Brief Overview
• Read simulators:
– Wgsim(2009): basic sequencing simulation; dummy quality scores
– MetaSim(2008): uses pre-defined sequence context error models; multiple genome input
– ART(2012): uses pre-trained quality score distribution profile
– piRS(2012): creates quality score and cycle matrix from real data to generate empirical error
profile
• Variation/Read simulators:
– GemSIM(2012): generates empirical error models from real data, multiple genome input,
random generation of SNPs and Indels
– MAQ(2008): error model based on quality score profile from a order-one Markov chain,
random SNP and Indel generation
– DWGSIM(2009): based on wgsim of samtools. SNPs and Indels
– BEERS(2009): RNAseq simulator, random sampling from a set of gene models, copy
distributions generated from a gene quantification file
– SInC(2014): pre-defined quality profile error generation, tool for generating custom profiles,
random SNP, indel, and CNVs
• Multi-step simulators:
– Flux Sim(2012): RNAseq experiment simulator, simulates transcription and sequencing from
realistic statistical models
– VarSim(2015): genome and read simulation and validation framework
4. SInC
• Three-part variation simulator and a read
generator
• Variation modules model SNPs, Indels, and CNVs
(copy number variations)
• Read generator module models short-read
sequencing using a real-data derived quality
distribution profile.
• Multi-threaded for fast read generation.
• Performed a small evaluation versus 4 other
variation simulators.
5. SInC
• SNPs, indels, and CNVs are randomly
distributed across the reference genome by
separate modules using command-line
parameters
• Reads are generated using a pre-defined error
profile distribution
• However, a separate tool is available to
generate custom error profiles from real data
sets
9. FluxSim
• Generic RNA-seq experiment simulator
• Multiple modules simulating different stages
of RNA Illumina library construction and
sequencing, as well as a transcriptome
simulator.
• Simulator Modules/Stages: transcription,
fragmentation, reverse transcription, size
selection, adapter ligation/PCR amplification,
sequencing
11. FluxSim Transcription
• FluxSim models gene expression by sampling
from a power law distribution (i.e. modified
Ziph’s law with exponential mRNA decay).
–
– This relationship models the networked nature of
cellular gene expression, with many lowly
expressed genes (low ranked), several moderately
expressed genes, and a few very highly expressed
genes (high ranked).
13. FluxSim Sequencing
• A quality profile based model for Illumina
sequencing
– Quality values are randomly drawn from a pre-
defined empirical distribution dependent on cycle
position
– Nucleotides are mutated according to the quality
score error probability
– Nucleotide mutation choice/preference is
determined based on the quality score using a
first order Markov process
14. VarSim
• Multi-step simulator and validation framework
– 1) simulates perturbed diploid genomes from a
reference by inserting variants (VarSim simulates
SNVs, deletions, insertions,MNPs, complex variants,
tandem duplications and inversions) from existing
databases distribution profiles
– 2) uses a third-party read simulator to generate
sequenced reads (currently configured to use
DWGSIM or ART) from the perturbed genomes
– 3) reads are mapped back to original reference
genome using a modified vcf2diploid (Rozowsky etal.,
2011) map file (MFF file)
15. VarSim Validation
– read alignments (from mapping software, e.g.
BWA-mem) are validated using read header
metadata
– Variants (from variant caller software, e.g.
FreeBayes) are validated against ‘true’ variants
that were inserted into the perturbed genome
– Accuracy of variant calling is reported based on
sensitivity (TPR) and precision (PPV/FDR), broken
down by variant type and size, as a JSON file with
SVG plots
18. Conclusions/Suggestions
• There are no comprehensive evaluations (that I could
find) of DNA/RNA simulators other than the
incomplete SInC comparison.
• However, SInC and VarSim appear to be a good
candidates for genome variation and gDNA simulation,
while FluxSim appears to be the only fully realized RNA
simulator.
• A pipeline with SInC or VarSim genome perturbation
combined with FluxSim transcription and library
prep/sequencing might allow validation of RNAseq
tools with biologically complex simulated data.
Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file 3. The rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared.
Human chromosome 22 from UCSC hg19
Aligned using novalign, variant calls with GATK and PINDEL
Figure 3 Time profiles of the different simulators used. Time elapsed to perform one complete simulation with default options using single core across different simulators. A) For chromosome 22 at 15X B) For human whole genome (hg19) at 5X.
Outline of the Flux Simulator pipeline. Provided the genomic sequence of an organism and a representative gene annotation as input, the initial step is a transcriptome simulation (A) to assign each transcript a randomised expression level according to general laws of gene expression. Subsequently, fragmentation (B) and RT (C) are carried out, either by first hydrolysing RNA and then transcribing the fragments into cDNA molecules (B and C, right) or by nebulisation respectively enzymatic digestion after reversely transcribing the entire RNA molecules (B and C, left). The simulated molecules of the primary library then get amplified by in silico PCR (D)—optionally after selecting a certain size range—and the final library then is subjected to simulated sequencing (E), including potential platform and sequencing chemistry specific error models. Finally, read sequences along with their genomic mappings are obtained.
Y0 = expression level of most abundant gene
K = exponent to law, governs slope of log-log plot
a and b = exponential mRNA decay rate
Supplementary Figure 3: Expression profiles observed in RNA-‐Seq experiments. The curves show the log-‐log behaviour of transcript expression in RNA-‐Seq experiments conducted on cellular transcriptomes of the species M.musculus (blue), A.thaliana (green) and S.cerevisiae (red). Expression values for every gene in a corresponding reference annotation (i.e., the murine RefSeq, the TAIR9 annotation of cress, and the SGD yeast annotation) have been estimated by the number of reads mapping to it, and expression levels have been ranked from high to low (x-‐axis). Although target cells and RNA-‐Seq experiment protocols differ substantially, all datasets show highly similar characteristics in their transcript abundance distribution: the nature of Zipf’s Law underlying gene expression can be noted by the largely linear behaviour in logarithmic scale. However, especially for lowly abundant forms, an exponential decay is notable.
VarSim simulation and validation workflow. The germline workflow can be run with or without the somatic workflow
Validation results for some popular secondary analysis tools
F1 = harmonic mean of sensitivity and precision
BWA-Mem used GATK re-alignment
Comparison of simulated reads with experimental evidence in different sequencing protocols. For each experiment, transcripts from a reference annotation of the corresponding species have been classified into short (<1000 nt, left panels), intermediate (1000–2000 nt, centre panels), and long forms (>2000 nt, right panels). Red and orange bars show reads from the experiment that align in sense and antisense, respectively, to the directionality of transcription, the corresponding in silico results are shown as dark and light blue bars, respectively. (A) Read tag distributions from an RNA hydrolysis protocol in M. musculus sequenced on the Illumina GA2 platform. (B) A different hydrolysis experiment carried out with the recent HiSeq2000 technology (Illumina), producing longer reads that exclusively map in sense orientation, so called ‘dir RNA-Seq’). (C) A complementary Illumina experiment employing poly-dT primed RT and subsequent DNAse digestion of the (poly-A+) transcriptome of S. cerevisisae. (D) Results from an experiment in A. thaliana where poly-dT primed RT products are fragmented by nebulisation.