Imgc2011 bioinformatics tutorial


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Show alignment of a feature from first slide to show how far down the chromosome it has moved…
  • Keeping track of people is way easier than keeping track of assemblies.
  • Can talk about Genomic Collections here
  • Imgc2011 bioinformatics tutorial

    1. 1. IMGS 2011 Bioinformatics Workshop<br />Deanna Church, NCBI<br />Carol Bult, The Jackson Laboratory<br />
    2. 2. Intro<br />Sequencing Technology: life in the fast lane<br />Alignments: things to consider<br />File formats: everything you always wanted to know but were afraid to ask<br />Tools: Pick the right one for the job at hand<br />
    3. 3. Cost<br />Throughput<br />Gigabases<br />Cost per Kb<br />Lucinda Fulton, The Genome Center at Washington University<br />
    4. 4. Sequencing Technologies<br /><br />
    5. 5. Sequence “Space”<br />Roche 454 – Flow space<br />Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain<br />Flow space describes sequence in terms of these base incorporations<br /><br />AB SOLiD – Color space<br />Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye<br />Each base sequenced twice<br /><br />Illumina/Solexa – Base space<br />Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups<br />Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH<br /><br />GenomeTV – Next Generation Sequencing (lecture)<br /><br /><br />
    6. 6. Optimal global alignment<br />Optimal local alignment<br />Needleman-Wunsch<br />Smith-Waterman<br />Sequences align essentially from end to end<br />Sequences align only in small, isolated regions<br />Global and local alignments<br />References<br />Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.<br />Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.<br />
    7. 7.
    8. 8. Word size = 3(configurable) <br />Hashing methods<br />References<br />Wilbur & Lipman (1983), PNAS80, 726-30<br />Lipman & Pearson (1985), Science227, 1435-1441<br />Pearson & Lipman (1988), PNAS85, 2444-2448<br />MVRRLPERTSTPACE<br />Query sequence<br />MVR<br />VRR<br />RRL<br />RLP<br />LPE<br />PER<br />ERT<br />RTS<br />TST<br />STP<br />TPA<br />PAC<br />ACE<br />
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.<br />
    14. 14.<br />
    15. 15. Sensitivity vs. Specificity<br />Sensitivity = actual number of true positives (tp) identified<br />Specificity = number of true negatives (tn) identified<br />Predicted<br />positives<br />negatives<br />positives<br />Actual<br />negatives<br />Sensitivity= TP/(TP+FN)<br />Specificity=TN/(TN+FP)<br />
    16. 16. Richa Agarwala<br />MHC Alternate locus<br />Alignment to chr6<br />
    17. 17. Tools<br />Alignments<br />BLAST: not for NGS<br />BWA<br />Bowtie<br />Maq<br />…<br />Transcriptomics<br />Tophat<br />Cufflinks<br />…<br />Variant calling<br />ssahaSNP<br />Mosaic<br />…<br />Counting (Chip-Seq, etc)<br />FindPeaks<br />PeakSeq<br />
    18. 18. Genome Workbench<br /><br />
    19. 19. “Standard” File formats<br />Sequence containers<br />FASTA<br />FASTQ<br />BAM/SAM<br />Alignments<br />BAM/SAM<br />MAF<br />Annotation<br />BED<br />GFF/GTF/GFF3<br />WIG<br />Variation<br />VCF<br />GVF<br />
    20. 20. FASTQ: Data Format<br />FASTQ<br />Text based<br />Encodes sequence calls and quality scores with ASCII characters<br />Stores minimal information about the sequence read<br />4 lines per sequence<br />Line 1: begins with @; followed by sequence identifier and optional description<br />Line 2: the sequence<br />Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional)<br />Line 4: encoding of quality scores for the sequence in line 2<br />References/Documentation<br /><br />Cock et al. (2009). Nuc Acids Res 38:1767-1771.<br />
    21. 21. FASTQ Example<br />For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,<br />Illumina stores quality scores ranging from 0-62;<br />Sanger quality scores range from 0-93.<br />Solexa quality scores have to be converted to PHRED quality scores.<br />FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.<br />
    22. 22. SAM (Sequence Alignment/Map)<br />It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format<br />SAM is the output of aligners that map reads to a reference genome<br />Tab delimited w/ header section and alignment section<br />Header sections begin with @ (are optional)<br />Alignment section has 11 mandatory fields<br />BAM is the binary format of SAM<br /><br />
    23. 23. Mandatory Alignment Fields<br /><br />
    24. 24. Alignment Examples<br />Alignments in SAM format<br /><br />
    25. 25. Valid BED files<br />chr1 86114265 86116346 nsv433165<br />chr2 1841774 1846089 nsv433166<br />chr16 2950446 2955264 nsv433167<br />chr17 14350387 14351933 nsv433168<br />chr17 32831694 32832761 nsv433169<br />chr17 32831694 32832761 nsv433170<br />chr18 61880550 61881930 nsv433171<br />chr1 16759829 16778548 chr1:21667704 270866 -<br />chr1 16763194 16784844 chr1:146691804 407277 +<br />chr1 16763194 16784844 chr1:144004664 408925 -<br />chr1 16763194 16779513 chr1:142857141 291416 -<br />chr1 16763194 16779513 chr1:143522082 293473 -<br />chr1 16763194 16778548 chr1:146844175 284555 -<br />chr1 16763194 16778548 chr1:147006260 284948 -<br />chr1 16763411 16784844 chr1:144747517 405362 +<br />
    26. 26. Mouse chrX: 35,000,000-36,000000<br />
    27. 27. Mouse chrX: 35,000,000-36,000000<br />X<br />MGSCv3<br />Build 36<br />
    28. 28. NC_000086.6<br />
    29. 29. GRCh37<br />hg19<br />Zv7<br />danRer5<br />MGSCv37<br />mm8<br />NCBIM37<br />
    30. 30. Assemblies with the same name aren’t always the same<br />chr21:8,913,216-9,246,964<br />
    31. 31. Assemblies with the same name aren’t always the same<br />Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX<br />
    32. 32. GRCh37<br />hg19<br />GCA_000001405.1<br />
    33. 33. Tutorial Web Site<br /><br />This site will be accessible after the meeting. Check back for updates and new tutorials.<br />
    34. 34.
    35. 35. RNA Seq Workflow<br />Convert data to FASTQ<br />Upload files to Galaxy<br />Quality Control <br />Throw out low quality sequence reads, etc.<br />Map reads to a reference genome<br />Many algorithms available<br />Trade off between speed and sensitivity<br />Data summarization<br />Associating alignments with genome annotations<br />Counts<br />Data Visualization<br />Statistical Analysis<br />
    36. 36. Typical RNA_Seq Project Work Flow<br /> Tissue Sample<br />Total RNA<br />mRNA<br />cDNA<br /> Sequencing<br /> FASTQ file<br />QC<br />TopHat<br />Cufflinks<br />Gene/Transcript/Exon Expression<br />Visualization<br />Statistical Analysis<br />JAX Computational Sciences Service<br />
    37. 37. TopHat<br /><br />TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.<br />Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.<br />Trapnell et al. (2009). Bioinformatics 25:1105-1111.<br />
    38. 38. TopHat is built on the Bowtie alignment algorithm.<br />Trapnell C et al. Bioinformatics 2009;25:1105-1111<br />
    39. 39. Cufflinks<br /><br /><ul><li> Assembles transcripts,
    40. 40. Estimates their abundances, and
    41. 41. Tests for differential expression and regulation in RNA-Seq samples </li></ul>Trapnell et al. (2010). Nature Biotechnology 28:511-515.<br />
    42. 42. Galaxy<br />See Tutorial 1 <br /><br />Build and share data and analysis workflows<br />No programming experience required<br />Strong and growing development and user community<br />
    43. 43. Short Read Archive<br /><br />Short Read Archive Handbook<br /><br />
    44. 44. Aspera Connect<br /><br />High performance file transfer for getting data from the Short Read Archive<br />
    45. 45. SRA Toolkit<br /><br />
    46. 46. Galaxy on the Cloud<br />Create an Amazon Web Services AWS account<br />Sign up for Amazon Elastic Compute Cloude (EC2) and<br />Amazon Simple Storage Service (S3 service)<br />Use the AWS Management Console to start a master EC2 instance<br />Use the Galaxy Cloud web interface to manage the cluster<br />Step by step instructions are here:<br /><br />Screencast to demonstrate the sign up process is here:<br /><br />Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010. <br />
    47. 47. Why Go to the Cloud?<br />Files and Compute needs are much greater for next gen sequence data <br />Amazon cloud provides a scalable, cost-effective solution<br />Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010. <br />
    48. 48. Some Tips <br />You’ll need a credit card to activate the service <br />You’ll need to be near a phone so that you can verify your identity during the sign up process<br />There is a time lag between signing up for AWS and getting access<br />
    49. 49. History<br />Dialog/Parameter Selection<br />Tools<br />Let’s Get Started!<br />