2. Intro Sequencing Technology: life in the fast lane Alignments: things to consider File formats: everything you always wanted to know but were afraid to ask Tools: Pick the right one for the job at hand
5. Sequence “Space” Roche 454 – Flow space Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain Flow space describes sequence in terms of these base incorporations http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD – Color space Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye Each base sequenced twice http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa – Base space Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV – Next Generation Sequencing (lecture) http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
6. Optimal global alignment Optimal local alignment Needleman-Wunsch Smith-Waterman Sequences align essentially from end to end Sequences align only in small, isolated regions Global and local alignments References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.
15. Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Predicted positives negatives positives Actual negatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)
19. “Standard” File formats Sequence containers FASTA FASTQ BAM/SAM Alignments BAM/SAM MAF Annotation BED GFF/GTF/GFF3 WIG Variation VCF GVF
20. FASTQ: Data Format FASTQ Text based Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation http://maq.sourceforge.net/fastq.shtml Cock et al. (2009). Nuc Acids Res 38:1767-1771.
21. FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
22. SAM (Sequence Alignment/Map) It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format SAM is the output of aligners that map reads to a reference genome Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields BAM is the binary format of SAM http://samtools.sourceforge.net/
33. Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml This site will be accessible after the meeting. Check back for updates and new tutorials.
34.
35. RNA Seq Workflow Convert data to FASTQ Upload files to Galaxy Quality Control Throw out low quality sequence reads, etc. Map reads to a reference genome Many algorithms available Trade off between speed and sensitivity Data summarization Associating alignments with genome annotations Counts Data Visualization Statistical Analysis
36. Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA Sequencing FASTQ file QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service
37. TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111.
38. TopHat is built on the Bowtie alignment algorithm. Trapnell C et al. Bioinformatics 2009;25:1105-1111
41. Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.
42. Galaxy See Tutorial 1 http://main.g2.bx.psu.edu/ Build and share data and analysis workflows No programming experience required Strong and growing development and user community
43. Short Read Archive http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? Short Read Archive Handbook http://www.ncbi.nlm.nih.gov/books/NBK47528/
46. Galaxy on the Cloud Create an Amazon Web Services AWS account Sign up for Amazon Elastic Compute Cloude (EC2) and Amazon Simple Storage Service (S3 service) Use the AWS Management Console to start a master EC2 instance Use the Galaxy Cloud web interface to manage the cluster Step by step instructions are here: https://bitbucket.org/galaxy/galaxy-central/wiki/cloud Screencast to demonstrate the sign up process is here: https://bitbucket.org/galaxy/galaxy-central/wiki/cloud Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.
47. Why Go to the Cloud? Files and Compute needs are much greater for next gen sequence data Amazon cloud provides a scalable, cost-effective solution Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.
48. Some Tips You’ll need a credit card to activate the service You’ll need to be near a phone so that you can verify your identity during the sign up process There is a time lag between signing up for AWS and getting access