The document describes creating a SNP calling pipeline for potato data from RNA-Seq experiments. Key steps included aligning reads to the potato genome using BWA or Bowtie, converting SAM to BAM and sorting, generating coverage profiles with SAMtools, and calling SNPs from the BAM files using SAMtools and bcftools. SNPs identified from the RNA-Seq data were then selected for inclusion on an Illumina GoldenGate SNP chip to genotype samples for genetic mapping. Comparison of the SNP chip results to the original RNA-Seq data was performed to evaluate accuracy. Remaining questions around discrepancies in the data were noted for further investigation.
Aims of thework
1) Learn about handling RNASeq
Create a SNP calling pipeline
2) Select SNPs for genetic mapping
Using Illumina's GoldenGate SNP chip (OPA)
2
Convert (using SAMtools)
1)Convert SAM to BAM for sorting
samtools view -S -b <in.sam>
2) Sort BAM for SNP calling
samtools sort <in.bam> <out.bam.s>
Alignments are both compressed for long term
storage and sorted for variant discovery.
8
mpileup
samtools mpileup collects summary
information in the input BAMs, computes the
likelihood of data given each possible
genotype and stores the likelihoods in the
BCF format.
bcftools view applies the prior and does the
actual calling.
Finally, we filter.
15
16.
SNP call
1) Indexthe potato genome assembly (again!)
samtools faidx in.fasta
2) Run 'mpileup' to generate VCF format
samtools mpileup -ug -f in.fasta
my1.bam.s my2.bam.s > my.raw.bcf
Actually, all we did (I think) is perform a
format conversion (BAM to VCF).
VCF format
A standardformat for sequence variation:
SNPs, indels and structural variants.
Compressed and indexed.
Developed for the 1000 Genomes Project.
VCFtools for VCF like SAMtools for SAM.
Specification and tools available from
http://vcftools.sourceforge.net
18
Aims of thework
1) Learn about handling RNASeq
Create a SNP calling pipeline
2) Select SNPs for genetic mapping
Using Illumina's GoldenGate SNP chip (OPA)
22
23.
Select SNPs forgenetic mapping
Using Illumina's GoldenGate SNP chip (OPA)
23
24.
SNP chip (OPA)construction
A set of DM SNP positions was provided by
the SolCAP project (RNASeq derived).
A subset was selected for developing OPAs
(Illumina’s SNP chip technology).
OPAs were run, and results have now been
compared to RNASeq.
24
A lot morequestions to answer…
Track down more ‘strange’ SNPs based on
the expected AFS of the two samples.
Go beyond bialleleic SNPs
Check the OPA base...
− Was the right base probed by the chip?
42
OPAs in 5steps...
The DNA sample is
activated for binding
to paramagnetic
particles.
46.
OPAs in 5steps...
Three oligos are
designed for each
SNP locus. Two are
specific to each allele
of the SNP site
(ASO) and a Locus-
Specific Oligo (LSO).
47.
OPAs in 5steps...
Several wash steps
remove excess and
mis-hybridized oligos.
Extension of the
appropriate ASO and
ligation to the LSO joins
information about the
genotype to the
address sequence on
the LSO.
48.
OPAs in 5steps...
The single-stranded,
dye-labeled DNAs
are hybridized to
their complement
bead type through
their unique address
sequences.
49.
OPAs in 5steps...
Key to the assay:
Scalable, multiplexing
sample preparation
(one tube reaction).
Highly parallel array-
based read-out.
High-quality data:
Average call rates
above 99% accuracy.
Editor's Notes
#47 All three oligo sequences contain regions of genomic complementarity and universal PCR primer sites; the LSO also contains a unique address sequence that targets a particular bead type. Up to 1,536 SNPs may be interrogated simultaneously in this manner. During the primer hybridization process, the assay oligos hybridize to the genomic DNA sample bound to paramagnetic particles. Because hybridization occurs prior to any amplification steps, no amplification bias can be introduced into the assay.
#48 Extension of the appropriate ASO and ligation of the extended product to the LSO joins information about the genotype present at the SNP site to the address sequence on the LSO Allele-specific primer extension (ASPE). This step is used to preferentially extend the correctly matched ASO (at the 3' end) up to the 5' end of the LSO primer.
#49 One to one mapping between an address sequence on the array and the locus being scored. As a result of this labeling scheme, the PCR product consists of double stranded DNA of which one strand, containing the complement to the Illumicode, is labeled with either Cy3 or Cy5 in an allele specific manner, and a complementary strand labeled with biotin. The biotinylated strand is removed and the single, florescently labeled strand hybridized to the BeadArray.