Successfully reported this slideshow.
Your SlideShare is downloading. ×

Ensembl annotation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
TGAC Browser bosc 2014
TGAC Browser bosc 2014
Loading in …3
×

Check these out next

1 of 31 Ad

More Related Content

Slideshows for you (20)

Advertisement

Similar to Ensembl annotation (20)

More from Genome Reference Consortium (20)

Advertisement

Recently uploaded (20)

Ensembl annotation

  1. 1. EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl annotation Bronwen Aken 21 September 2014
  2. 2. How Ensembl started • Ewan Birney • Michele Clamp • Tim Hubbard
  3. 3. Ensembl’s goals Annotate (vertebrate) genome Integrate with other biological data Make publicly available • Stable, automatic annotation • High quality • Regular release cycles • Open source “Provide a bioinformatics framework to organise biology around the sequences of large genomes”
  4. 4. Challenges 1. Find functional elements in a genome • Data have lots of noise 2. Software / hardware • Storing and manipulating data 3. Intuitive and comprehensive access to data • Visualization
  5. 5. GRCh38 annotation in Ensembl
  6. 6. What is Genebuilding? • Automatic, evidence-based annotation of genes • Not ab initio • Based on sequence alignment • “Best-in-genome” • Aim for high specificity • Prefer to miss a few features than heavily over- predict Automated gene annotation pipeline is designed around decisions made during manual annotation
  7. 7. Advantages of re-annotating • Add new genes to new / fixed genomic regions • Updated supporting evidence: Remove models built on data that has been deleted from archives • Move alignments to regions with better mapping
  8. 8. Gene annotation pipeline – the basics Identify interesting regions • Rough alignment of sequences to genome Exhaustive alignment to produce transcript models Filter models • Prioritize data sources Produce ‘best guess’ gene set
  9. 9. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering TranscriptConsensus LayerAnnotation Also: Small ncRNAs LincRNAs Pseudogenes
  10. 10. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering RNA-Seq models Also: Small ncRNAs LincRNAs Pseudogenes MERGE WITH HAVANA
  11. 11. Release cycle 26 September 2014 11 Regulation Gene Allele Conserved sequence Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ Genes • Coding & noncoding • Protein & mRNA alignments • GTF & BAM files Compara • Conserved DNA sequence • Multiple genome alignments • Homologues • Protein families Regulatory regions • DNA methylation • TFBS • Open chromatin Variation • SNPs, indels, structural variation • Phenotypes • QTLs
  12. 12. Integrate with other speciesChimpanzeeHuman Gene SLC12A1
  13. 13. ‘Patch’ annotation in Ensembl
  14. 14. Genome assembly representation • Coord_system table • Lists the allowed coordinate systems • chromosome, scaffold, contig • With ‘versions’ • GRCh37, GRCh38 • Contigs are shared between assemblies so have no version • ‘Toplevel’ coordinate system • Chromosomes + unplaced scaffolds + unlocalized scaffolds + alternate sequences • Most popular means to access the whole genome • API options for including/excluding alternate sequences and PAR
  15. 15. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome DNA only loaded for contigs
  16. 16. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome DNA only loaded for contigs
  17. 17. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome
  18. 18. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome GRCh37
  19. 19. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome GRCh37
  20. 20. Seq_region names • Regions of the genome are given a slice name; it’s like an address • eg. chromosome:GRCh37:6:133090509:133119701:1 • Users like to say, ‘chromosome 6’ • INSDC coordinates are versioned, but less human-readable • chromosome:GRCh37:CM000668.1:133090509:133119701:1 assembly seq_region. name coord_system start end strand
  21. 21. Alternate sequences • Assembly_exception table defines ‘bubbles’ • Initially set up to handle Y chromosome PAR • Adapted to work for MHC haplotypes • Now also used for GRC patches • Assumes ‘equivalent’ region will be present in primary assembly
  22. 22. Gene annotation on a ‘patched’ genome 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH Assembly excepti... SNORA76 > SNORD104 > MILR1 > Genes (GENCODE... Primary assembly... AC025362.12 > AC016489.18 > < AC234063.4Contigs < Y_RNA < hsa-mir-1273e < AC234063.1 < TEX2 < AC016489.1 < PECAM1 Genes (GENCODE... H.sap-H.sap lastz-... Assembly excepti... 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Assembly excepti... H.sap-H.sap lastz-... SNORA76 > SNORD104 > AC138744.2 > MILR1 > Genes (GENCODE... GL383558.1 ... ...GRC alignment i... AC025362.12 > AC016489.18 > < AC009994.10Contigs < TEX2 < RPL31P57 < POLG2 Genes (GENCODE... Assembly excepti... 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe... protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 331.04 kb Forward strand Reverse strand 331.04 kb 276.06 kb Forward strand Reverse strand 276.06 kb TEX2 gene lies across the patch boundary PECAM1 is annotated only on patch HG183 Gap in primary assembly PatchedchromosomePrimarychromosome
  23. 23. Gene annotation on a ‘patched’ genome
  24. 24. Gene annotation on patches Patch Primary
  25. 25. Gene annotation on patches Patch Primary 1. Manual annotation
  26. 26. Gene annotation on patches Patch Primary Patch Primary 2. Project models to patch 1. Manual annotation
  27. 27. Gene annotation on patches Patch Primary Patch Primary Patch Primary 1. Manual annotation 2. Project models to patch 3. Gap-fill with mini genebuilld
  28. 28. Ongoing challenges • How strict should we be when aligning proteins cDNAs to the genome? 1. Genome assembly • Sequencing error (inversion, artificial duplication) • Assembly incomplete • Alignments must allow for truncated matches 2. Population variation • Linear genome is made from ‘one’ individual vs protein databases contain data from many unknown individuals • Paralogues, gene families, pseudogenes 3. Public databases eg. UniProt • Include suspect data and incomplete for many species • When there’s a match, or no match, is it biologically real? • Aligning proteins from other species must allow for mismatches Specificity Sensitivity
  29. 29. Funding European Commission Framework Programme 7 Ensembl Acknowledgements
  30. 30. Questions?
  31. 31. Reporting data to users Visualisation and Data querying: • - When browsing the primary assembly, how do we make it obvious to users when alternate sequences are available? • - How do we show when the alternate genomic sequences are identical or differ from one another? • - How do we show whether the alternate genome sequences result in identical or different transcribed / translated products? • - How do we make a qualitative call about which allele is “better” to use? eg. ABO • - Data download options • - Concept of a ‘canonical’ transcript per gene (per tissue) Data analysis: • - Linking between alternate alleles (and paralogues?) • - How do we show when data have been mapped from an old to new assembly, compared to freshly aligned to a new assembly? When is it right to map instead of align? • - In a non-linear genome model, how will SNPs (rsIDs) work? • - In a non-linear genome model, what coordinate system should be used?

×