Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apollo:
Collaborative Genome Annotation Editing
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Sourc...
Today...
We will discuss effective ways to extract valuable
information about a genome through curation efforts.
After this workshop, you will:
• Better understand curation in the context of genome annotation:
assembled genome à automa...
Experimental design, sampling
Comparative analyses
Merged
Gene Set
Manual
Annotation
Automated
Annotation
Sequencing
Assem...
Knowledge
”
“NIH seeks fundamental knowledge about the
nature and behavior of living systems and the
application of that knowledge t...
”
“[The NSF aims to] promote and disseminate
research, creating knowledge that is valuable to
society, the economy, and po...
Data
Red: FF0000
Extracting knowledge from data
Data
Information
South-facing traffic light at
St. Charles Ave. has turned red
Extracting knowledge from data
Data
Information
Knowledge
The light I am driving
towards has just turned red
Extracting knowledge from data
Data
Information
Knowledge
I must stop!
Extracting knowledge from data
Biocuration is a knowledge-extraction process
Figure	is	not	yet	public.	Do	not	reproduce.
biocuration.org
IMAGE	IS	NOT	PUB...
Genome Curation
Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild
Unlocking genomes
Good genes are required!
1. Generate gene models
• A few rounds of gene prediction.
2. Annotate gene models
• Function, ex...
Best representation of biology &
removal of elements reflecting
errors in automated analyses.
Functional assignments throu...
• To make accurate orthology assessments
• To accurately annotate expanded / contracted gene families
• To identify novel ...
Predicting & annotating
gene structures
Gene Prediction & Gene Annotation
Identification and annotation of genomic elements:
• Primarily focuses on protein-coding...
Computation Phase (1)
1) Experimental data are aligned to the genome:
expressed sequence tags, RNA-sequencing reads, prote...
2) Gene predictions are generated:
2a) Ab initio: based on nucleotide sequence and composition
e.g. Augustus, GENSCAN, gen...
Gene Prediction - methods for discovery
2a) Ab initio:
- Based on DNA composition
- Deals strictly with genomic sequences
...
2b) Evidence-based:
- Finds genes using either similarity searches against public
databases or other experimental data set...
• The single most likely coding sequence, no UTRs, no isoforms.
Computation Phase: result
• Data from experimental evidence and prediction tools are synthesized into a
reliable set of structural gene annotations....
Consensus Gene Sets
Gene models may be organized into sets using:
• Combiners for automatic integration of predicted sets
...
Challenges
Ab initio
+ can capture species-specific or highly-divergent genes
- false positive predictions (incomplete pre...
Some suggestions
• Hybrid reference-guided & ab initio gene prediction
• Generate transcriptomic data to confirm predictio...
Annotating gene functions
Attaching metadata to structural annotations for the
purpose of assigning a particular function.
• Assignments do not nece...
Terms (classes) arranged in a graph: molecular
functions, biological processes, cellular
locations, and the relationships ...
Collaboratively
curating gene structures
Color	by	CDS	frame,	toggle	strands,	
set	color	scheme	and	highlights.
Upload	evidence	files	(GFF3,	BAM,	BigWig),	
add	comb...
Apollo Functionality Overview
GenomeArchitect.org
Apollo
GenomeArchitect.org
Export
Apollo
GenomeArchitect.org
Collaboration in real time
Apollo Architecture
1. Select or find a region of interest (e.g. scaffold).
2. Select appropriate evidence tracks to review the genome element...
A brief refresher
Defining a gene
Consider that:
• A gene is a genomic sequence (DNA or RNA) directly encoding
functional product molecules,...
The gene: a moving target
“The gene is a union of genomic
sequences encoding a coherent
set of potentially overlapping
fun...
mRNA
Although of brief existence, understanding mRNAs is absolutely crucial.
"Gene structure" by Daycd- Wikimedia Commons
...
Reading frames
In eukaryotes, only one reading frame per section of DNA is biologically
relevant at a time: it has the pot...
Splice sites
Splicing “signals” (from the point of view of an intron):
• 5’ end splice “signal” (site): usually GT (less c...
• Introns can interrupt the reading frame of a gene by inserting a
sequence between two consecutive codons
• Between the f...
Obstacles to transcription and translation
• Premature Stop codons in the message, are possible. A process called
non-sens...
Berkeley Bioinformatics Open-Source Projects,
Environmental Genomics & Systems Biology,
Lawrence Berkeley National Laborat...
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
Upcoming SlideShare
Loading in …5
×

Curation Introduction - Apollo Workshop

6,673 views

Published on

Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.

An introduction on gene annotation & curation for the IAGC and BIPAA research communities.

Published in: Science
  • Be the first to comment

Curation Introduction - Apollo Workshop

  1. 1. Apollo: Collaborative Genome Annotation Editing Monica Munoz-Torres, PhD | @monimunozto Berkeley Bioinformatics Open-Source Projects Environmental Genomics and Systems Biology Division Lawrence Berkeley National Laboratory A workshop for the IAGC research community - 22 March, 2017 UNIVERSITY OF CALIFORNIA
  2. 2. Today... We will discuss effective ways to extract valuable information about a genome through curation efforts.
  3. 3. After this workshop, you will: • Better understand curation in the context of genome annotation: assembled genome à automated annotation à manual annotation • Become familiar with Apollo’s environment and functionality. • Learn to identify homologs of known genes of interest in your newly sequenced genome. • Learn how to corroborate and modify automatically annotated gene models using all available evidence in Apollo.
  4. 4. Experimental design, sampling Comparative analyses Merged Gene Set Manual Annotation Automated Annotation Sequencing Assembly Synthesis & dissemination Life Cycle of a Genome Sequencing Project
  5. 5. Knowledge
  6. 6. ” “NIH seeks fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability. U.S. National Institutes of Health
  7. 7. ” “[The NSF aims to] promote and disseminate research, creating knowledge that is valuable to society, the economy, and politics. Swiss National Science Foundation
  8. 8. Data Red: FF0000 Extracting knowledge from data
  9. 9. Data Information South-facing traffic light at St. Charles Ave. has turned red Extracting knowledge from data
  10. 10. Data Information Knowledge The light I am driving towards has just turned red Extracting knowledge from data
  11. 11. Data Information Knowledge I must stop! Extracting knowledge from data
  12. 12. Biocuration is a knowledge-extraction process Figure is not yet public. Do not reproduce. biocuration.org IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE IMAGE IS NOT PUBLIC – DO NOT REPRODUCE
  13. 13. Genome Curation
  14. 14. Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild Unlocking genomes
  15. 15. Good genes are required! 1. Generate gene models • A few rounds of gene prediction. 2. Annotate gene models • Function, expression patterns, metabolic network memberships. 3. Manually review them • Structure & Function.
  16. 16. Best representation of biology & removal of elements reflecting errors in automated analyses. Functional assignments through comparative analysis using literature, databases, and experimental data. Curation improves quality Apollo Gene Ontology
  17. 17. • To make accurate orthology assessments • To accurately annotate expanded / contracted gene families • To identify novel genes, species-specific isoforms • To efficiently take advantage of transcriptomic analyses Curation is valuable
  18. 18. Predicting & annotating gene structures
  19. 19. Gene Prediction & Gene Annotation Identification and annotation of genomic elements: • Primarily focuses on protein-coding genes. • Also identifies RNAs (tRNA, rRNA, long and small non-coding RNAs (ncRNA)), regulatory motifs, repetitive elements, etc. • Happens in 2 steps: • Computation phase • Annotation phase
  20. 20. Computation Phase (1) 1) Experimental data are aligned to the genome: expressed sequence tags, RNA-sequencing reads, proteins. Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
  21. 21. 2) Gene predictions are generated: 2a) Ab initio: based on nucleotide sequence and composition e.g. Augustus, GENSCAN, geneid, fgenesh, etc. 2b) Using experimental evidence: identifying domains and motifs e.g. SGP2, JAMg, fgenesh++, etc. Computation Phase (2) Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
  22. 22. Gene Prediction - methods for discovery 2a) Ab initio: - Based on DNA composition - Deals strictly with genomic sequences - Makes use of statistical approaches (e.g. HMM) to search for coding regions and typical gene signals • E.g. Augustus, GENSCAN, geneid, fgenesh, etc.
  23. 23. 2b) Evidence-based: - Finds genes using either similarity searches against public databases or other experimental data sets e.g. RNAseq. E.g: SGP2, fgenesh++, JAMg, etc. Gene Prediction - methods for discovery
  24. 24. • The single most likely coding sequence, no UTRs, no isoforms. Computation Phase: result
  25. 25. • Data from experimental evidence and prediction tools are synthesized into a reliable set of structural gene annotations. 5’ UTR 3’ UTR Annotation Phase
  26. 26. Consensus Gene Sets Gene models may be organized into sets using: • Combiners for automatic integration of predicted sets e.g: GLEAN, EvidenceModeler, etc. • Tools packaged into pipelines e.g: MAKER, PASA, Gnomon, Ensembl, etc.
  27. 27. Challenges Ab initio + can capture species-specific or highly-divergent genes - false positive predictions (incomplete predictions, readthrough predictions) - not enough on its own to establish orthology Reference-guided + uses reliable gene orthologs from better-annotated species - can miss species-specific genes and other sequences - not enough on its own to establish orthology
  28. 28. Some suggestions • Hybrid reference-guided & ab initio gene prediction • Generate transcriptomic data to confirm predictions, extend & improve models, identify new expressed loci. – the more tissues, the better! • Review synteny to verify orthologous assignments – largely manual for now.
  29. 29. Annotating gene functions
  30. 30. Attaching metadata to structural annotations for the purpose of assigning a particular function. • Assignments do not necessarily have to be supported by your own experimental data. • Sequence similarity approaches must be informed and validated by evolutionary theory, not just a score value. Functional Annotation
  31. 31. Terms (classes) arranged in a graph: molecular functions, biological processes, cellular locations, and the relationships connecting them all, in a species-independent manner. Gene OntologyGeneOntology.org 1. Molecular Function An elemental activity or task or job • protein kinase activity • insulin receptor activity Insulin Receptor Petrus et al, 2009, ChemMedChem 2. Biological Process A commonly recognized series of events • cell division End of Telophase. Lothar Schermelleh 3. Cellular Component Where a gene product is located • mitochondria • mitochondrial matrix • mitochondrial inner membrane Mitochondrion. PaisekaScience Photo Library
  32. 32. Collaboratively curating gene structures
  33. 33. Color by CDS frame, toggle strands, set color scheme and highlights. Upload evidence files (GFF3, BAM, BigWig), add combination and sequence search tracks. Query the genome using BLAT. Navigate and zoom. Search for a gene model or a scaffold. User-created annotations. Annotator panel. Evidence Tracks. Stage and cell-type specific transcription data. Admin Protein coding, pseudogenes, ncRNAs, regulatory elements, variants, etc. Collaborative, instantaneous, web-based, built on top of JBrowse. GenomeArchitect.org Apollo Genome Annotation Editor
  34. 34. Apollo Functionality Overview GenomeArchitect.org
  35. 35. Apollo GenomeArchitect.org Export
  36. 36. Apollo GenomeArchitect.org Collaboration in real time
  37. 37. Apollo Architecture
  38. 38. 1. Select or find a region of interest (e.g. scaffold). 2. Select appropriate evidence tracks to review the genome element to annotate (e.g. gene model). 3. Determine whether a feature in an existing evidence track will provide a reasonable gene model to start working. 4. If necessary, adjust the gene model. 5. Check your edited gene model for integrity and accuracy by comparing it with available homologs. 6. Comment and finish. General process of curation
  39. 39. A brief refresher
  40. 40. Defining a gene Consider that: • A gene is a genomic sequence (DNA or RNA) directly encoding functional product molecules, either RNA or protein. • If several functional products share overlapping regions, we take the union of all overlapping genomics sequences coding for them. • This union must be coherent – i.e., processed separately for final protein and RNA products – but does not require that all products necessarily share a common subsequence. Biorefresher
  41. 41. The gene: a moving target “The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products.” Gerstein et al., 2007. Genome Res Biorefresher
  42. 42. mRNA Although of brief existence, understanding mRNAs is absolutely crucial. "Gene structure" by Daycd- Wikimedia Commons Biorefresher
  43. 43. Reading frames In eukaryotes, only one reading frame per section of DNA is biologically relevant at a time: it has the potential to be transcribed into RNA and translated into protein. OPEN READING FRAME (ORF) ORF = Start signal + coding sequence (divisible by 3) + Stop signal Biorefresher
  44. 44. Splice sites Splicing “signals” (from the point of view of an intron): • 5’ end splice “signal” (site): usually GT (less common: GC) • 3’ end splice site: usually AG …]5’ - GT / AG - 3’[… Alternatively bringing exons together produces more than one protein from the same genic region: isoforms. Biorefresher
  45. 45. • Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons • Between the first and second nucleotide of a codon • Or between the second and third nucleotide of a codon Exons and Introns Biorefresher
  46. 46. Obstacles to transcription and translation • Premature Stop codons in the message, are possible. A process called non-sense mediated decay checks and corrects them to avoid: incomplete splicing, DNA mutations, transcription errors, and leaky scanning of ribosome – which can cause changes in the reading frame (frame shifts). • Insertions and deletions (indels) can cause frame shifts when the indel is not divisible by three. As a result, the peptide can be abnormally long, or abnormally short – depending when the first in-frame Stop signal is located. Biorefresher
  47. 47. Berkeley Bioinformatics Open-Source Projects, Environmental Genomics & Systems Biology, Lawrence Berkeley National Laboratory Suzanna Lewis & Chris Mungall Seth Carbon (Noctua / AmiGO) Nathan Dunn* (Apollo) Monica Munoz-Torres (Apollo / GO) Jeremy Nguyen Xuan (Monarch Initiative) Funding • Work for GOC is supported by NIH grant 5U41HG002273-14 from NHGRI. • Apollo is supported by NIH grants 5R01GM080203 from NIGMS, and 5R01HG004483 from NHGRI. • BBOP is also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 For your attention, Thank You. berkeleybop.org Collaborators • Ian Holmes, Eric Yao, UC Berkeley (JBrowse) • Chris Elsik, Deepak Unni, U of Missouri (Apollo) • Paul Thomas, USC (Noctua) • Monica Poelchau, USDA/NAL (Apollo) • Gene Ontology Consortium • i5k Community UNIVERSITY OF CALIFORNIA

×