Genome annotation 2013

1,630
-1

Published on

Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.

Published in: Education, Technology
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
1,630
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
152
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Try to describe Genome annotation as a process
    Emphasize the ongoing nature of annotation.
    There is no real end point to the annotation process (only artificially defined ones)
    Best to think of this as a ‘best guess’ annotation
  • Softmasking
  • Softmasking
  • Try to describe Genome annotation as a process
    Emphasize the ongoing nature of annotation.
    There is no real end point to the annotation process (only artificially defined ones)
    Best to think of this as a ‘best guess’ annotation
  • Try to describe Genome annotation as a process
    Emphasize the ongoing nature of annotation.
    There is no real end point to the annotation process (only artificially defined ones)
    Best to think of this as a ‘best guess’ annotation
  • Try to describe Genome annotation as a process
    Emphasize the ongoing nature of annotation.
    There is no real end point to the annotation process (only artificially defined ones)
    Best to think of this as a ‘best guess’ annotation
  • Try to describe Genome annotation as a process
    Emphasize the ongoing nature of annotation.
    There is no real end point to the annotation process (only artificially defined ones)
    Best to think of this as a ‘best guess’ annotation
  • Genome annotation 2013

    1. 1. Genome Annotation Karan Veer Singh, Scientist. NBAGR, Karnal, India 1
    2. 2. The Genome • The genome contains all the biological information required to build and maintain any given living organism • The genome contains the organisms molecular history • Decoding the biological information encoded in these molecules will have enormous impact in our understanding of biology
    3. 3. Genomics 1. Structural genomics-genetic and physical mapping of genomes. 2. Functional genomics-analysis of gene function (and non-genes). 3. Comparative genomics-comparison of genomes across species.  Includes structural and functional genomics.  Evolutionary genomics.
    4. 4. Human Genome Project The Human genome project promised to revolutionise medicine and explain every base of our DNA. Large MEDICAL GENETICS focus Identify variation in the genome that is disease causing Determine how individual genes play a role in health and disease
    5. 5. Human Genome Project & Functional Genome It cost 3 billion dollars and took 10 years to complete (5 less than initially predicted). • Approx 200 Mb still in progress – Heterochromatin – Repetitive
    6. 6. Genomics & Genome annotation  First genome annotation software system was designed in 1995 by Dr. Owen White with The Institute for Genomic Research that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae  It involve assembling of the reads to form contigs then assembling with a reference genome (reference assembly) or de novo assembly to obtain the complete genome  Variations such as mutations, SNP, InDels etc can be identified  The genome is then annotated by structural and functional annotation  Mapping Image of Whole genome in an easily understandable manner.
    7. 7. Sequence to Annotation
    8. 8. Input1 to Genome Viewer- Variant Annotation
    9. 9. Input2 to Genome Viewer- Structural Annotation  Structural 2.5.5) Annotation- AUGUSTUS (version
    10. 10. Input3 to Genome Viewer-Functional Annotation
    11. 11. Genome Annotation  The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do  Finding and attaching the structural elements and its related function to each genome locations 11
    12. 12. Genome Annotation gene structure prediction gene function prediction Identifying elements (Introns/exons,CDS,stop,start) in the genome Attaching biological information to these elements- eg: for which 12 protein exon will code for
    13. 13. Structural annotation Structural annotation - identification of genomic elements Open reading frame and their localisation gene structure coding regions location of regulatory motifs
    14. 14. Functional annotation Functional annotation- attaching biological information to genomic elements biochemical function biological function involved regulations
    15. 15. Genome annotation - workflow Genome sequence Repeats Masked or un-masked genome sequence Structural annotation-Gene finding nc-RNAs (tRNA, rRNA), Introns Protein-coding genes Functional annotation View in Genome viewer 16
    16. 16. Genome Repeats & features Polymorphic between individuals/populations  Percentage of repetitive sequences in different organisms Genome Aedes aegypti Genome Size (Mb) % Repeat ~70 Anopheles gambiae 260 ~30 Culex pipiens      1,300 540 ~50 Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR 17
    17. 17. Finding repeats as a preliminary to gene prediction  Repeat discovery Homology based approaches Use RepeatMasker to search the genome and mask the sequence 18
    18. 18. Masked sequence   Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set >my sequence >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggct actattggcttctctagactcgtctatctctatta gctatcatctcgatagcgatcagctagcgatcagg ctactattggcttcgatagcgatcagctagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctactattggctgatcttaggtcttctga tcttct atgagcttcgatagcgatcagctagcgatcaggct actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxatctcgatagcgatcagctagcgatcagg ctactattxxxxxxxxxxxxxxxxxxxtagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctxxxxxxxxxxxxxxxxxxxtcttctga tcttct Positions/locations are not affected by masking 19
    19. 19. Types of Masking- Hard or Soft?  Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked >my sequence >my sequence (softmasked) ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA TCTTCT ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC TACTATTggcttctctagactcgtctatctctatt agtatcATCTCGATAGCGATCAGCTAGCGATCAGG CTACTATTggcttcgatagcgatcagcTAGCGATC AGGCTACTATTggcttcgatagcgatcagcTAGCG ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA TCTTCT >my sequence (hardmasked) atgagcttcgatagcgatcagctagcgatcaggct actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxatctcgatagcgatcagctagcgatcagg ctactattxxxxxxxxxxxxxxxxxxxtagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctxxxxxxxxxxxxxxxxxxxtcttctga20 tcttct
    20. 20. Genome annotation - workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 21
    21. 21. Structural annotation Identification of genomic elements  Open reading frame and their localization  Coding regions  Location of regulatory motifs  Start/Stop  Splice Sites  Non coding Regions/RNA’s  Introns 22
    22. 22. Methods  Similarity • Similarity between sequences which does not necessarily infer any evolutionary linkage  Ab- initio prediction • Prediction of gene structure from first principles using only the genome sequence 24
    23. 23. Genefinding ab initio similarity 25
    24. 24. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh 26
    25. 25. Genefinding - similarity  Use known coding sequence to define coding regions  EST sequences  Peptide sequences Problem to handle fuzzy alignment regions around splice sites Examples: EST2Genome, exonerate, genewise, Augustus, Prodigal Gene-finding - comparative  Use two or more genomic sequences to predict genes based on conservation of exon sequences  Examples: Twinscan and SLAM 27
    26. 26. Genome annotation - workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 28
    27. 27. Genefinding - non-coding RNA genes  Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples  tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes  Rfam - a suite of HMM’s trained against a large number of different RNA genes 29
    28. 28. Gene-finding omissions Alternative isoforms Currently there is no good method for predicting alternative isoforms Only created where supporting transcript evidence is present Pseudogenes Each genome project has a fuzzy definition of pseudogenes Badly curated/described across the board Promoters Rarely a priority for a genome project Some algorithms exist but usually not integrated into an annotation set 30
    29. 29. Practical- structural annotation Eukaryotes- AUGUSTUS (gene model) ~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tr --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff Prokaryotes – PRODIGAL (Codon Usage table) ~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt 31
    30. 30. Structural Annotation-output  Structural Annotation conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model
    31. 31. Functional annotation 33
    32. 32. Genome annotation - workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 34
    33. 33. Functional annotation Genome Transcription Primary Transcript RNA processing Processed mRNA ATG STOP m 7G AAAn Translation Polypeptide Protein folding Folded protein Find function Enzyme activity Functional activity A B 35
    34. 34. Functional annotation Attaching biological information to genomic elements Biochemical function Biological function Involved regulation and interactions Expression • Utilize known structural annotation to predicted protein sequence 36
    35. 35. Functional annotation – Homology Based  Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities  Visually assess the top 5-10 hits to identify whether these have been assigned a function  Functions are assigned 37
    36. 36. Functional annotation - Other features  Other       features which can be determined Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain Secretome See http://expasy.org/tools/ for a good list of possible prediction algorithms 38
    37. 37. Functional annotation - Other features (Ontologies)  Use  of ontologies to annotate gene products Gene Ontology (GO)    Cellular component Molecular function Biological process 39
    38. 38. Practical - FUNCTIONAL ANNOTATION  Homology Based Method  setup blast database for nucleotide/protein  Blasting the genome.fasta for annotations (nucleotide/protein)  sorting for blast minimum E-value (>=0.01) for nucleotide/protein  assigning functions 40
    39. 39. Functional annotation- output August 2008 Bioinformatics tools for Comparative Genomics of Vectors 41
    40. 40. Conclusion  Annotation accuracy is dependent available supporting data at the time of annotation; update information is necessary  Gene predictions will change over time as new data becomes available (NCBI) that are much similar than previous ones  Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins) 42
    41. 41. Genome annotation - workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 43
    42. 42. Genome Viewer The Files that can be visualised Annotation files Indel files Consensus sequence Comparative Genomics 44
    43. 43. Genome View August 2008 45
    44. 44. 46
    45. 45. 47
    46. 46. 48
    47. 47. Short Read track 49
    48. 48. Thank You 50
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×