Your SlideShare is downloading. ×
Bio305 genome analysis and annotation 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Bio305 genome analysis and annotation 2012

1,031
views

Published on

Lecture on bacterial genome analysis and annotation for Bio305 course at the University of Birmingham

Lecture on bacterial genome analysis and annotation for Bio305 course at the University of Birmingham

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,031
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 12
  • 13
  • Transcript

    • 1. Bio305 Bacterial Genome Annotation and Analysis Professor Mark Pallen
    • 2. Overview
      • Features of Bacterial Genomes
      • Genome Sequencing
      • Assembly of bacterial genomes
      • Annotation of bacterial genomes
      • Identifying and annotating CDSs
        • An ORF is NOT a CDS!
      • Power and pitfalls of using homology
        • BLAST and PFAM
    • 3. General features of genomes
      • Microbial
      • Human
      • Small WSIWYG genomes (Mbp)
      • Gene density high (>90%)
        • intergenic regions short
        • very little repetitive or non-coding DNA
        • Introns very rare
      • Protein-coding genes (CDS) short (~1kbp)
      • Operons with promoters just upstream
      • Fewer non-coding RNAs
      • Very large genomes (Gbp)
      • Gene density low
        • Only 25% is genes
        • Introns mean only1% codes
      • Genes can span ≥30 kbp
      • Genes have ~3 transcripts
        • Splicing and splice variants
      • Promoter regions distant from gene
    • 4. Bacterial genome organisation
      • Chromosomes
      • Plasmids
      • Most commonly single circular chromosome (always DNA)
        • BUT many species have linear chromosome(s) (e.g. Borrelia, Streptomyces, Rhodoccus )
        • BUT a few species with two chromosomes (e.g. Vibrio cholerae )
      • Can be mix of circular and linear (e.g. Agrobacterium tumefaciens, B. burgdoferi )
      • Independent autonomous replicon, can be circular or linear
      • may integrate into chromosome
      • copy number varies 1 to 10s
      • often carry non-essential genes that confer an adaptive advantage in certain conditions
    • 5. Overview of a genome project
      • Choose strain
        • Fresh isolate or tractable lab strain?
      • Choose strategy
        • Shotgun sequencing
        • Paired-end sequencing
        • Draft or complete?
      • Choose chemistry
        • Sanger; 454; Illumina; Ion Torrent
      • Assembly
        • Automated
      • Closure and finishing
        • Manually intensive
        • Difficulty depends on how repetitive
      • Data Release
        • Immediate or delayed?
      • Annotation
        • Manually intensive bottle neck
      • Publication
    • 6. Whole-Genome Shotgun Sanger Sequencing Random shearing Size selection Cloning Sequence each insert with two primers Pick colonies to create shotgun library bacterial chromosome plasmid vector Plasmid preps
    • 7. High-throughput Sequencing
      • 100x faster, 100x cheaper!
        • A disruptive technology
      • Several technologies in the marketplace from 2007 onwards
        • 454 (Roche)
        • Illumina
        • Ion Torrent
        • PacBio
      • Fundamentally new approaches
        • Solid-phase amplification of clonal templates in “molecular colonies”
          • Massive increase in number of “clones” compensates for shorter read length
        • New chemistries for sequence reading
          • 454: pyrophosphate detection on base addition
          • Illumina: reversible de-protection of fluorescent bases
    • 8. High-Throughput Shotgun Sequencing Random shearing Size selection bacterial chromosome Add adapters Amplify Sequence
    • 9. Illumina Sequencing
    • 10. The Sequence Assembly Problem
      • Sequencing technologies generate reads of <1000 bp
      • These reads must be assembled into a single continuous genomic sequence.
      • Shotgun sequencing exploits many overlapping sequences (high coverage) to infer ordering directly from the sequences themselves
    • 11. The Repeat Problem
      • Repeats at read ends can be assembled in multiple ways
      ATTTATGTGT GTGTGGTGTG GTGTGGTGTG CACTACTGCT ACTACTGCTGACTACT GTGTGGTGTG GTGTGGTGTG ATATCCCT ATTTATGTGT GTGTGGTGTG GTGTGGTGTG CACTACTGCT ACTACTGCTGACTACT GTGTGGTGTG GTGTGGTGTG ATATCCCT Correct Incorrect
    • 12. Paired-end Sequencing Random shearing Size selection for 3kb or 8kb etc bacterial chromosome Add linkers Circularise Shear and select on size and presence of linkers Add adapters Obtain sequences from either side of linker known distance apart in genome
      • Create long fragments of known length
      • Obtain sequence from paired ends
        • known distance apart
      • Allows assembly of contigs across repeats into scaffolds
    • 13. Genome Assembly Scaffold Contig 3 Contig 2 Contig 1 Physical Gap Sequence Gap
    • 14. Re-sequencing
      • Short reads (<200bp) inefficient de novo assembly
      • Instead they are mapped against a reference genome
      • Re-sequencing is like assembling a jigsaw puzzle using the image on the lid
    • 15. SNP calling
      • Comparisons between closely related strains allows identification of SNPs that are informative for
        • Identifying biologically significant changes, e.g. during evolution in lab or patient
        • Reconstructing phylogenies using neutral changes
    • 16. Genome annotation
      • Annotation is the addition of information about the predicted sequence features to the flat file of DNA code
      • Identification of potential coding sequences - CDS
      • Homology searches to predict function
      • Other features can be annotated as well
        • rRNAs
        • Potential promoters
        • tRNAs
        • Small non-coding RNAs
        • Repeat sequences
        • Insertion sequences (ISs), transposons, gene fragments
      • Location of the origin of replication
      • Determination of the number of bases, genes, and G+C%.
    • 17. How to go from this….?
      • >Escherichia coli K-12 MG1655_3870656-3890655
      • TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAACGGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCGGTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTTCAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAATCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCCCGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACCTCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCTCAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTGGCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGAGACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATTCCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGGTCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCTGCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATCGCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATTCAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCATTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCGCGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGGAGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTTGTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTACTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAGTTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGATGACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACGGAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCAAAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCACCGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCAAGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCTTCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGGAATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATTCATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGAGCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGGCGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTTAATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTAGCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACTGGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGAGAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTTCATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTATAGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCATTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCGCAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCGCCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACTCGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTTCAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACGATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTGATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGTCAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAAGCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCGCGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCGACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACCTCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCTGTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCCATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCATAGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGAAGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGGATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACTTCCGCCGCCGA
    • 18. … to this?
      • FT gene complement(9299..10702)
      • FT /db_xref=&quot;GenBank:2367266”
      • FT /gene=&quot;dnaA”
      • FT /note=&quot;b3702”
      • FT CDS complement(9299..10702)
      • FT /db_xref=&quot;GI:2367267”
      • FT /db_xref=&quot;PID:g2367267”
      • FT /function=&quot;putative regulator; DNA - replication, repair,
      • FT restriction/modification”
      • FT /codon_start=1
      • FT /protein_id=&quot;AAC76725.1”
      • FT /gene=&quot;dnaA”
      • FT /translation=&quot;MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR
      • FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT
      • FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG
      • FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF
      • FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR
      • FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL
      • FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR
      • FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF
      • FT SNLIRTLSS”
      • FT /product=&quot;DNA biosynthesis; initiation of chromosome
      • FT replication; can be transcription regulator”
      • FT /transl_table=11
      • FT /note=&quot;f467; 100 pct identical to DNAA_ECOLI SW: P03004;
      • FT CG Site No. 851”
      •  
    • 19. Or this?
    • 20. Caveat
      • Real bioinformaticians do not use graphical web-based tools
      • Real bioinformaticians use the Unix (typically Linux) command line interface
      • Often glue programs together into pipelines using Perl
      • Write programs in e.g. Perl or Python
      • Aim here is to equip lab-based worker with basic know-how
      • If you want to become a bioinformatician, do an MSc in Bioinformatics
    • 21. Sources of information for annotation
      • Comparison with genome sequences from related organisms
      • Published experimental data
        • Demonstration of function of a gene
        • Demonstration of function of a homologous gene
      • Review articles on protein families or groups of proteins
        • Prediction that the CDS encodes a member of the family
        • Prediction that the CDS encodes a conserved motif
      • Protein sequence analysis
        • Annotations are only predictions
        • Sequences generated from RNA-Seq and protein mass spectrometry support annotations
      • Expert knowledge on an organism or protein family can assist in annotation
    • 22. Approaches to functional annotation
      • Most of the work now done automatically by programs
        • Analyses strung together into pipelines, so that on our xBASE site we can assemble then annotate a genome in half an hour
        • But automated approaches work best if a closely related sequence is available
      • Wherever there are conflicting predictions, one has to rely on human judgment and interpretation of context
        • Adjusting start codons
        • Fine-tuning descriptors
      • Annotation should rely on an evidence trail that leads back to experimental results (“genomic isnad”)
    • 23. Base composition aids genome analysis GC skew (G-C)/G+C) Identifies origin of replication and leading lagging strands Genes coded by location & function %G+C Genes shared with E. coli Genes unique to S. typhi
    • 24. Analysis of nucleotide sequence data
      • Search for Sequence Features
        • Promoters, Ribosome-binding Sites
        • Repeats, Inverted Repeats
        • Consensus Sequences for regulator binding site
        • Often rely on sequence motifs
      • tRNA, rRNA, ncRNA
        • tRNA scan, RFAM, RNAmmer
    • 25. Gene Finding in bacteria
      • Ab initio gene prediction
        • By opening reading frame
          • Find ORFs
          • Find credible CDSs within ORFs
          • Resolve conflicting ORFs
        • By codon usage
        • By Markov models
      • By homology
        • Similarity Searches via protein or translated BLAST
        • Comparative genomics
    • 26. Identifying protein-coding sequences
        • In bacteria, quick and dirty approach is to find ORFs (open reading frames)
          • Stretches of sequence without termination codons
          • Can be any of 3 termination codons – TAG, TGA, TAA
          • BUT variant genetic codes in mycoplasmas
          • Can be in any of 6 frames – 3 forward and 3 reverse
          • Do NOT necessarily start with initiation codon
        • Do NOT confuse ORFS and CDSs
          • CDSs have an initiation codon
          • Can be any of 3 initiation codons – ATG, GTG, TTG
          • Has to be in the same frame as the termination codon unless the CDS is frame-shifted
          • Homology to other protein sequences can help identify a CDS
    • 27. The problem of conflicting ORFs Non-coding ORFs CDSs (note ORF can extend upstream of start codon)
    • 28. The Problem of Frameshift Errors Actual sequence 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K Frameshifted sequence after single base error
    • 29. CDS Prediction: Graphical Plots GC content by reading frame Amino-acid composition by reading frame, compared to average for globular proteins
    • 30. CDS Prediction: Markov Models
      • Markov Model-based programs
        • Use probabilities of states and transitions between these states to predict features
      • Glimmer is industry standard for bacterial genomes
        • Can be trained on related genome
        • Or use long-ORFs (>500 codons) option to bootstrap a model
      • Problems
        • Smaller genes not statistically significant so thrown out
        • Algorithms trained with sequences from known genes which biases against genes about which nothing known
    • 31. Annotation of protein-coding genes
      • Structure and composition
        • Transmembrane domains
        • Signal peptides
        • Post-translational modifications
      • Homology to other proteins
      • Function(s)
        • Catalytic activity / cofactors / induction / regulation
        • Metabolic pathways
        • Structural genes
        • Cellular location
      • Phase variants, pseudogenes, SNPs, coding repeats, etc.
      • Annotation pipeline
      • Predict CDSs with Glimmer
      • On the predicted genes
        • Do homology searches (BLAST) against nearest relative
        • Port annotation across on orthologues
        • Apply in-depth analysis to strain-specific genes (or all genes if de novo sequence)
        • domain searches: PFAM or CDD
        • PSI-BLAST
      • Perform other analyses: Coiled coils, signal peptides, TM domains
    • 32. Homology
      • Similarity that arises because of descent from a common ancestor…
      • “ The formation of different languages and of distinct species, and the proofs that both have been developed through a gradual process, are curiously parallel… We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation… Languages, like organic beings, can be classed in groups under groups; and they can be classed either naturally according to descent, or artificially by other characters… The survival or preservation of certain favoured words in the struggle for existence is natural selection.”
      • Charles Darwin, 1871 THE DESCENT OF MAN, Chapter 3
    • 33. Homology
      • Similarities in form (sequence) allow us to infer similarities in “meaning” (structure and function)
      • Homology is not just sequence similarity
        • Two sequences can be similar without any common ancestry, particularly if low complexity
      the cat sat on the mat die Katze sass auf der Matte vge|GBant88-2 ITLITCVSVKDNSKRYVVAG vge|GEfae9-178 LTLITCDQATKTTGRIIVIA vge|GSpne1-403 MTLITCDPIPTFNKRLLVNF sortase_staur LTLITCDDYNEKTGVWEKRK
    • 34. Types of Homology
      • Homologues can be divided into
        • Orthologues: lines of descent congruent with whole genome
        • Paralogues: result of gene duplication
        • Xenologues: result of HGT
    • 35. Homology Searches
      • The aim of homology searches is to identify sequences within these databases that are homologous to your sequence.
      • This involves comparing your sequence with all the database sequences
        • looking for stretches of sequence that appear to be similar
        • then scoring the matches and ranking them
        • a measure of the significance of the match is given
    • 36. What is BLAST?
      • Basic Local Alignment Search Tool
        • Developed in 1990, refined in 1997 (Stephen Altschul)
      • A method of searching sequence databases to find sequences similar to the input sequence
        • Scans a database for alignments to a query sequence
      • Fastest and most frequently used sequence alignment tool
        • the industry standard
      • Can be extremely informative, giving clues to
        • functionality, evolutionary history, important residues
      • Basis for many forms of bioinformatic analysis
    • 37. The several flavours of BLAST
      • BLASTP
        • protein query versus protein sequence database.
      • BLASTN
        • nucleotide query versus nucleotide sequence database.
      • BLASTX
        • translated nucleotide query versus protein sequence database
      • TBLASTN
        • protein query versus translated nucleotide sequence database
      • TBLASTX
        • translated nucleotide query versus translated nucleotide sequence database.
    • 38. Chosing the right flavour
      • What program will best suit your query, and desired output?
      • If you are dealing with a protein-coding gene, comparisons at the protein level give better results
        • Sequence complexity: 20 amino acids versus 4 nucelotides
        • Moderately similar nucleotide sequences could encode a highly similar protein sequence!
        • Use BLASTP or a translated BLAST search, rather than BLASTN
        • Reserve BLASTN for non-coding regions or rRNA/tRNA genes
    • 39. Low complexity filtering
      • Low complexity sequence with pronounced compositional bias can lead to spurious alignments
        • Modern versions of BLAST either take into account amino-acid composition or screens out regions of low complexity
      • At NCBI, adjustment for compositional bias is on but low-complexity filter is off by default
        • For “no stones unturned” approach, explore results with adjustments and filter on and off
      • Watch out for…
        • transmembrane or signal peptide regions
        • coil-coil regions
        • short amino acid repeats (collagen, elastin)
        • homopolymeric repeats
    • 40. Understanding BLAST Results
      • Graphic representation of results
      • Top of graph represents query sequence
      • Underlying bars show where hits occur
      • Colors represent alignment scores
      • Grey areas represent non similar regions surrounded by similar regions
      • Scrolling over bar shows accession and description of hit
      • Clicking on a bar takes you to its alignment with the query
    • 41. Bit Scores high is good E-values low is good http://www.ncbi.nlm.nih.gov/BLAST/tutorial/
    • 42. Typical Blast Output Sum Reading High Probability Sequences producing High-scoring Segment Pairs: Frame Score P(N) N emb|X69337|ECDPS E.coli dps gene for binding protein +2 834 6.4e-109 1 gb|U04242|ECU04242 Escherichia coli core starvation p... +3 828 2.7e-106 1 emb|X14180|ECGLNHPQ Escherichia coli glutamine permeas... +3 443 2.8e-53 1 gb|U18769|HDU18769 Haemophilus ducreyi fine tangled p... +1 150 4.0e-18 2 dbj|D01016|ANALTI46 Anabaena variabilis lti46 gene. >e... +2 129 4.8e-12 2 gb|M84990|P26BPO Plasmid pOP2621 ORF1 gene, 5' end;... -2 131 6.7e-09 1 gb|U16121|HPU16121 Helicobacter pylori neutrophil act... +1 112 1.8e-06 1 gb|M32401|TRPTYF1 T.pallidum pallidum antigen TyF1 g... +3 101 5.6e-06 2 emb|X71436|RPNTRB R.phaseoli ntrB gene +1 67 0.76 2 gb|L35598|DRODGC1A Drosophila melanogaster receptor g... +1 48 0.97 3
    • 43. Typical Blast Output gb|U18769|HDU18769 Haemophilus ducreyi fine tangled pili major pilin subunit gene Length = 780 Plus Strand HSPs: Score = 150 (68.0 bits), Expect = 4.0e-18, Sum P(2) = 4.0e-18 Identities = 36/89 (40%), Positives = 46/89 (51%), Frame = +1 Query: 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVHEMLDGFRTALIDHLDTMAERAVQLGGV 89 E L ++ +L+LI K AHWN+ G FIAVHEMLD + D +D +AER LG Sbjct: 253 EALQMRLQGLNELALILKHAHWNVVGPQFIAVHEMLDSQVDEVRDFIDEIAERMATLGVA 432 Query: 90 ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118 G + + YPL QDHLK Sbjct: 433 PNGLSGNLVETRQSPEYPLGRATAQDHLK 519
    • 44. Domain database searches
      • Rationale
        • Now that databases very large, can be difficult to interpret Blast results when 1000s of hits
        • If one part of protein has many hits and another part has few hits, useful information may get swamped or lost
      • Solution
        • Search databases that contain collections of protein domains/families
      • Pfam
        • pfam.sanger.ac.uk/
      • CDD
        • www.ncbi.nlm.nih.gov/cdd
      • Represented as sequence alignments and/or HMMs
      • Annotated with information about key features of domain
    • 45. Pfam domains
    • 46. Pfam search results
    • 47. The Annotation Catastrophe Signal Peptide A protease B Coiled coil domain C Homology lies in one domain Signal Peptide Protein A “ a protease” Protein B Protein C But functional assignment for whole of protein A comes from another domain, carried across in error, so proteins B and C get misannotated as proteases
    • 48. Annotation: rules to consider
      • Don’t trust your computer blindly
      • Adopt Cartesian doubt!
        • Examine and think about your results
      • Confirm with multiple lines of evidence
        • BLAST
        • genomic context
        • PFAM
    • 49. Overview
      • Features of Bacterial Genomes
      • Genome Sequencing
      • Assembly of bacterial genomes
      • Annotation of bacterial genomes
      • Identifying and annotating CDSs
        • An ORF is NOT a CDS!
      • Power and pitfalls of using homology
        • BLAST and PFAM