Imgc2011 bioinformatics tutorial

IMGS 2011 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory

Intro Sequencing Technology: life in the fast lane Alignments: things to consider File formats: everything you always wanted to know but were afraid to ask Tools: Pick the right one for the job at hand

Cost Throughput Gigabases Cost per Kb Lucinda Fulton, The Genome Center at Washington University

Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Sequence “Space” Roche 454 – Flow space Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain Flow space describes sequence in terms of these base incorporations http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD – Color space Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye Each base sequenced twice http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa – Base space Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV – Next Generation Sequencing (lecture) http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

Optimal global alignment Optimal local alignment Needleman-Wunsch Smith-Waterman Sequences align essentially from end to end Sequences align only in small, isolated regions Global and local alignments References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

Word size = 3(configurable) Hashing methods References Wilbur & Lipman (1983), PNAS80, 726-30 Lipman & Pearson (1985), Science227, 1435-1441 Pearson & Lipman (1988), PNAS85, 2444-2448 MVRRLPERTSTPACE Query sequence MVR VRR RRL RLP LPE PER ERT RTS TST STP TPA PAC ACE

http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial

Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Predicted positives negatives positives Actual negatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)

Richa Agarwala MHC Alternate locus Alignment to chr6

Tools Alignments BLAST: not for NGS BWA Bowtie Maq … Transcriptomics Tophat Cufflinks … Variant calling ssahaSNP Mosaic … Counting (Chip-Seq, etc) FindPeaks PeakSeq

Genome Workbench http://www.ncbi.nlm.nih.gov/projects/gbench/

“Standard” File formats Sequence containers FASTA FASTQ BAM/SAM Alignments BAM/SAM MAF Annotation BED GFF/GTF/GFF3 WIG Variation VCF GVF

FASTQ: Data Format FASTQ Text based Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation http://maq.sourceforge.net/fastq.shtml Cock et al. (2009). Nuc Acids Res 38:1767-1771.

FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

SAM (Sequence Alignment/Map) It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format SAM is the output of aligners that map reads to a reference genome Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields BAM is the binary format of SAM http://samtools.sourceforge.net/

Mandatory Alignment Fields http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples Alignments in SAM format http://samtools.sourceforge.net/SAM1.pdf

Valid BED files chr1 86114265 86116346 nsv433165 chr2 1841774 1846089 nsv433166 chr16 2950446 2955264 nsv433167 chr17 14350387 14351933 nsv433168 chr17 32831694 32832761 nsv433169 chr17 32831694 32832761 nsv433170 chr18 61880550 61881930 nsv433171 chr1 16759829 16778548 chr1:21667704 270866 - chr1 16763194 16784844 chr1:146691804 407277 + chr1 16763194 16784844 chr1:144004664 408925 - chr1 16763194 16779513 chr1:142857141 291416 - chr1 16763194 16779513 chr1:143522082 293473 - chr1 16763194 16778548 chr1:146844175 284555 - chr1 16763194 16778548 chr1:147006260 284948 - chr1 16763411 16784844 chr1:144747517 405362 +

Mouse chrX: 35,000,000-36,000000

Mouse chrX: 35,000,000-36,000000 X MGSCv3 Build 36

GRCh37 hg19 Zv7 danRer5 MGSCv37 mm8 NCBIM37

Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964

Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml This site will be accessible after the meeting. Check back for updates and new tutorials.

RNA Seq Workflow Convert data to FASTQ Upload files to Galaxy Quality Control Throw out low quality sequence reads, etc. Map reads to a reference genome Many algorithms available Trade off between speed and sensitivity Data summarization Associating alignments with genome annotations Counts Data Visualization Statistical Analysis

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA Sequencing FASTQ file QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service

TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111.

TopHat is built on the Bowtie alignment algorithm. Trapnell C et al. Bioinformatics 2009;25:1105-1111

Cufflinks http://cufflinks.cbcb.umd.edu/ ,[object Object]

Estimates their abundances, and

Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.

Galaxy See Tutorial 1 http://main.g2.bx.psu.edu/ Build and share data and analysis workflows No programming experience required Strong and growing development and user community

Short Read Archive http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? Short Read Archive Handbook http://www.ncbi.nlm.nih.gov/books/NBK47528/

Aspera Connect http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8 High performance file transfer for getting data from the Short Read Archive

SRA Toolkit http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

Galaxy on the Cloud Create an Amazon Web Services AWS account Sign up for Amazon Elastic Compute Cloude (EC2) and Amazon Simple Storage Service (S3 service) Use the AWS Management Console to start a master EC2 instance Use the Galaxy Cloud web interface to manage the cluster Step by step instructions are here: https://bitbucket.org/galaxy/galaxy-central/wiki/cloud Screencast to demonstrate the sign up process is here: https://bitbucket.org/galaxy/galaxy-central/wiki/cloud Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Why Go to the Cloud? Files and Compute needs are much greater for next gen sequence data Amazon cloud provides a scalable, cost-effective solution Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Imgc2011 bioinformatics tutorial

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to Imgc2011 bioinformatics tutorial

Similar to Imgc2011 bioinformatics tutorial (20)

More from Deanna Church

More from Deanna Church (17)

Recently uploaded

Recently uploaded (20)

Imgc2011 bioinformatics tutorial

Editor's Notes