Creating a SNP calling pipeline

Potato SNPs

Dan Bolser and David Martin

Next Gen Bug, Dundee
01/18/2010

1

Aims of the work
1) Learn about handling RNASeq

Create a SNP calling pipeline

2) Select SNPs for genetic mapping

Using Illumina's GoldenGate SNP chip (OPA)

2

Creating a SNP calling pipeline

3

Align (using BWA)
1) Index the potato genome assembly
bwa index [-a bwtsw|div|is] [-c]
<in.fasta>
2) Perform the alignment
bwa aln [options] <in.fasta>
<in.fq>
3) Output results in SAM format (single end)
bwa samse <in.fasta> <in.sai>
<in.fq> 5

Align (using Bowtie)
1) Index the potato genome assembly
bowtie-build [options] <in.fasta>
<ebwt>
2) Perform the alignment and output results
bowtie [options] <ebwt> <in.fq>

Convert (using SAMtools)
1) Convert SAM to BAM for sorting
samtools view -S -b <in.sam>
2) Sort BAM for SNP calling
samtools sort <in.bam> <out.bam.s>


Alignments are both compressed for long term
storage and sorted for variant discovery.

8

Coverage profiles /
Depth vectors

10

SAMtools...

Dump a coverage profile
samtools mpileup -f <in.fasta>
<my.bam.s>
P1 244526 A 10 ...,.,,,.. BBQaàaaa[
P1 244527 A 10 ...,.,,,.. BBZ_`â_a[
P1 244528 C 10 .$.$.,.,,,.. >>RaZàaaa
P1 244529 C 8 .,.,,,.. NaXaaaa`
P1 244530 T 8 .,.,,,.. Xa_aaa`
P1 244531 C 8 .,.,,,.. Rbabbaa
P1 244532 T 9 .,.,,,..^~. EE^^^^^Â
P1 244533 T 9 .,.,,,... BBB
P1 244534 T 9 .$,$.,,,... @@^^^^^Ê

11

SAMtools Bio::DB::Sam (BioPerl)
Dump a coverage
profile 2

12

SAMtools Bio::DB::Sam (BioPerl)
P41630
Matches : 9
0233333333333345555555555
666778888888899999999999
999999999999999999999999
999976666666666665444444
44443332211111111000

13

mpileup

samtools mpileup collects summary
information in the input BAMs, computes the
likelihood of data given each possible
genotype and stores the likelihoods in the
BCF format.

bcftools view applies the prior and does the
actual calling.

Finally, we filter.
15

SNP call
1) Index the potato genome assembly (again!)
samtools faidx in.fasta
2) Run 'mpileup' to generate VCF format
samtools mpileup -ug -f in.fasta
my1.bam.s my2.bam.s > my.raw.bcf

Actually, all we did (I think) is perform a
format conversion (BAM to VCF).

VCF format
A standard format for sequence variation:
SNPs, indels and structural variants.
Compressed and indexed.
Developed for the 1000 Genomes Project.
VCFtools for VCF like SAMtools for SAM.
Specification and tools available from
http://vcftools.sourceforge.net
18

SNP call and filter
1) Call SNPs
bcftools view -bvcg my.raw.bcf >
my.var.bcf
2) Filter SNPs
bcftools view my.var.bcf |
vcfutils.pl varFilter my.var.bcf
> my.var.bcf.filt

20

Aims of the work
1) Learn about handling RNASeq

Create a SNP calling pipeline

2) Select SNPs for genetic mapping


22

Select SNPs for genetic mapping

23

SNP chip (OPA) construction

A set of DM SNP positions was provided by
the SolCAP project (RNASeq derived).

A subset was selected for developing OPAs
(Illumina’s SNP chip technology).

OPAs were run, and results have now been
compared to RNASeq.

24

Comparison (using an early SAMtools)

Comparison (using new SAMtools)

Looking into the RNASeq data…

34

Potato genome
assembly

RNASeq RNASeq
read library read library

36

A lot more questions to answer…

Track down more ‘strange’ SNPs based on
the expected AFS of the two samples.

Go beyond bialleleic SNPs

Check the OPA base...
− Was the right base probed by the chip?

42

Thank you for your patience!

43

OPAs in 5 steps...
The DNA sample is
activated for binding
to paramagnetic
particles.

OPAs in 5 steps...
Three oligos are
designed for each
SNP locus. Two are
specific to each allele
of the SNP site
(ASO) and a Locus-
Specific Oligo (LSO).

OPAs in 5 steps...
Several wash steps
remove excess and
mis-hybridized oligos.
Extension of the
appropriate ASO and
ligation to the LSO joins
information about the
genotype to the
address sequence on
the LSO.

OPAs in 5 steps...
The single-stranded,
dye-labeled DNAs
are hybridized to
their complement
bead type through
their unique address
sequences.

OPAs in 5 steps...
Key to the assay:
Scalable, multiplexing
sample preparation
(one tube reaction).
Highly parallel array-
based read-out.
High-quality data:
Average call rates
above 99% accuracy.

Creating a SNP calling pipeline

In this document