IMGS 2011 Bioinformatics WorkshopDeanna Church, NCBICarol Bult, The Jackson Laboratory
IntroSequencing Technology: life in the fast laneAlignments: things to considerFile formats: everything you always wanted to know but were afraid to askTools: Pick the right one for the job at hand
CostThroughputGigabasesCost per KbLucinda Fulton, The Genome Center at Washington University
Sequencing Technologieshttp://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png
Sequence “Space”Roche 454 – Flow spaceMeasure pyrophosphate released by a nucleotide when it is added to a growing DNA chainFlow space describes sequence in terms of these base incorporationshttp://www.youtube.com/watch?v=bFNjxKHP8JcAB SOLiD – Color spaceSequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dyeEach base sequenced twicehttp://www.youtube.com/watch?v=nlvyF8bFDwM&feature=relatedIllumina/Solexa – Base spaceSingle base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groupsSequencing via cycles of base addition/detection followed deprotection of  the 3’ OHhttp://www.youtube.com/watch?v=77r5p8IBwJk&feature=relatedGenomeTV – Next Generation Sequencing (lecture)http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=relatedhttp://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
Optimal global alignmentOptimal local alignmentNeedleman-WunschSmith-WatermanSequences align essentially from end to endSequences align only in small, isolated regionsGlobal and local alignmentsReferencesNeedleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.
Word size = 3(configurable) Hashing methodsReferencesWilbur & Lipman (1983), PNAS80, 726-30Lipman & Pearson (1985), Science227, 1435-1441Pearson & Lipman (1988), PNAS85, 2444-2448MVRRLPERTSTPACEQuery sequenceMVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE
http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial
http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial
Sensitivity vs. SpecificitySensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identifiedPredictedpositivesnegativespositivesActualnegativesSensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)
Richa AgarwalaMHC Alternate locusAlignment to chr6
ToolsAlignmentsBLAST: not for NGSBWABowtieMaq…TranscriptomicsTophatCufflinks…Variant callingssahaSNPMosaic…Counting (Chip-Seq, etc)FindPeaksPeakSeq
Genome Workbenchhttp://www.ncbi.nlm.nih.gov/projects/gbench/
“Standard” File formatsSequence containersFASTAFASTQBAM/SAMAlignmentsBAM/SAMMAFAnnotationBEDGFF/GTF/GFF3WIGVariationVCFGVF
FASTQ: Data FormatFASTQText basedEncodes sequence calls and quality scores with ASCII charactersStores minimal information about the sequence read4 lines per sequenceLine 1: begins with @; followed by sequence identifier and optional descriptionLine 2: the sequenceLine 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)Line 4: encoding of quality scores for the sequence in line 2References/Documentationhttp://maq.sourceforge.net/fastq.shtmlCock et al. (2009). Nuc Acids Res 38:1767-1771.
FASTQ ExampleFor analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.Solexa quality scores have to be converted to PHRED quality scores.FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
SAM (Sequence Alignment/Map)It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM formatSAM is the output of aligners that map reads to a reference genomeTab delimited w/ header section and alignment sectionHeader sections begin with @ (are optional)Alignment section has 11 mandatory fieldsBAM is the binary format of SAMhttp://samtools.sourceforge.net/
Mandatory Alignment Fieldshttp://samtools.sourceforge.net/SAM1.pdf
Alignment ExamplesAlignments in SAM formathttp://samtools.sourceforge.net/SAM1.pdf
Valid BED fileschr1	86114265	86116346	nsv433165chr2	1841774	1846089	nsv433166chr16	2950446	2955264	nsv433167chr17	14350387	14351933	nsv433168chr17	32831694	32832761	nsv433169chr17	32831694	32832761	nsv433170chr18	61880550	61881930	nsv433171chr1	16759829	16778548	chr1:21667704	270866	-chr1	16763194	16784844	chr1:146691804	407277	+chr1	16763194	16784844	chr1:144004664	408925	-chr1	16763194	16779513	chr1:142857141	291416	-chr1	16763194	16779513	chr1:143522082	293473	-chr1	16763194	16778548	chr1:146844175	284555	-chr1	16763194	16778548	chr1:147006260	284948	-chr1	16763411	16784844	chr1:144747517	405362	+
Mouse chrX: 35,000,000-36,000000
Mouse chrX: 35,000,000-36,000000XMGSCv3Build 36
NC_000086.6
GRCh37hg19Zv7danRer5MGSCv37mm8NCBIM37
Assemblies with the same name aren’t always the samechr21:8,913,216-9,246,964
Assemblies with the same name aren’t always the sameZv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
GRCh37hg19GCA_000001405.1
Tutorial Web Sitehttp://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtmlThis site will be accessible after the meeting. Check back for updates and new tutorials.
RNA Seq WorkflowConvert data to FASTQUpload files to GalaxyQuality Control Throw out low quality sequence reads, etc.Map reads to a reference genomeMany algorithms availableTrade off between speed and sensitivityData summarizationAssociating alignments with genome annotationsCountsData VisualizationStatistical Analysis
Typical RNA_Seq Project Work Flow Tissue SampleTotal RNAmRNAcDNA Sequencing FASTQ fileQCTopHatCufflinksGene/Transcript/Exon ExpressionVisualizationStatistical AnalysisJAX Computational Sciences Service
TopHathttp://tophat.cbcb.umd.edu/TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.Trapnell et al. (2009). Bioinformatics 25:1105-1111.
TopHat is built on the Bowtie alignment algorithm.Trapnell C et al. Bioinformatics 2009;25:1105-1111
Cufflinkshttp://cufflinks.cbcb.umd.edu/ Assembles transcripts,
Estimates their abundances, and
Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.
GalaxySee Tutorial 1 http://main.g2.bx.psu.edu/Build and share data and analysis workflowsNo programming experience requiredStrong and growing development and user community
Short Read Archivehttp://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?Short Read Archive Handbookhttp://www.ncbi.nlm.nih.gov/books/NBK47528/
Aspera Connecthttp://www.asperasoft.com/en/products/client_software_2/aspera_connect_8High performance file transfer for getting data from the Short Read Archive
SRA Toolkithttp://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Galaxy on the CloudCreate an Amazon Web Services AWS accountSign up for Amazon Elastic Compute Cloude (EC2) andAmazon Simple Storage Service (S3 service)Use the AWS Management Console to start a master EC2 instanceUse the Galaxy Cloud web interface to manage the clusterStep by step instructions are here:https://bitbucket.org/galaxy/galaxy-central/wiki/cloudScreencast to demonstrate the sign up process is here:https://bitbucket.org/galaxy/galaxy-central/wiki/cloudAfgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.
Why Go to the Cloud?Files and Compute needs are much greater for next gen sequence data Amazon cloud provides a scalable, cost-effective solutionAfgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Imgc2011 bioinformatics tutorial

Editor's Notes

  • #28 Show alignment of a feature from first slide to show how far down the chromosome it has moved…
  • #30 Keeping track of people is way easier than keeping track of assemblies.
  • #33 Can talk about Genomic Collections here