Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Curating genes and genomes

Apollo: a collaborative tool for genome curation
Monica Munoz-Torres, PhD | @monimunozto

Berk...
OUTLINE

Web	
  Apollo	
  Collabora've	
  Cura'on	
  and	
  	
  
Interac've	
  Analysis	
  of	
  Genomes	
  
2OUTLINE
•  T...
APOLLO DEVELOPMENT
APOLLO DEVELOPERS 3
h* p://G e nom e Ar c hite c t. or g /	
   	
  
Nathan Dunn
Eric Yao
JBrowse, UC Be...
4
BY THE END OF THIS TALK

you will

v Be@er	
  understand	
  genome	
  cura'on	
  in	
  the	
  context	
  of	
  annota'o...
Anatomy	
  of	
  a	
  genome	
  	
  
sequencing	
  project	
  
6
Genome Sequencing Project
Anatomy of a genome sequencing project
Experimental design, sampling.
Comparative analyses
Con...
CURATING GENOMES

steps involved
1  Genera=on	
  of	
  Gene	
  Models	
  
calling	
  ORFs,	
  one	
  or	
  more	
  
rounds...
GENOME ANNOTATION

objectives and uses
Curating Genomes 8
The	
  gene	
  set	
  of	
  an	
  organism	
  informs	
  a	
  va...
First,	
  a	
  bio-­‐refresher	
  
WHAT WE NEED TO KNOW

for manual annotation
To	
  remember…	
  Biological	
  concepts	
  to	
  be@er	
  
understand	
  man...
11CURATING GENOMES
WHAT WE KNOW

in very general terms
12CURATING GENOMES
WHAT WE KNOW

in very general terms
http://www.wisegeek.com/
5’	
  
3’	
  
5’	
  
3’	
  
13CURATING GENOMES
CENTRAL “DOGMA”

of molecular biology
v  DNA	
  can	
  be	
  copied	
  to	
  DNA	
  (DNA	
  replica'on...
14BIO-REFRESHER
What is a gene?
v  The	
  defini'on	
  of	
  a	
  gene	
  paints	
  a	
  very	
  complex	
  picture	
  of	...
15BIO-REFRESHER
What is a gene?
v  In	
  your	
  life'me,	
  the	
  Encyclopedia	
  of	
  DNA	
  Elements	
  (ENCODE)	
  ...
16BIO-REFRESHER
What is a gene?

let’s think computationally!
v  Think	
  of	
  the	
  genome	
  as	
  an	
  operating sy...
17BIO-REFRESHER
What is a gene?

considerations
v  Also	
  consider	
  :	
  
•  A	
  gene	
  is	
  a	
  genomic	
  sequen...
18BIO-REFRESHER
“The	
  gene	
  is	
  a	
  union	
  
of	
  genomic	
  sequences	
  
encoding	
  a	
  coherent	
  
set	
  o...
19BIO-REFRESHER
TRANSLATION

reading frame
v  Reading	
  frame	
  is	
  a	
  manner	
  of	
  dividing	
  the	
  sequence	...
20
"Reading Frame" by Hornung Ákos - Wikimedia Commons
BIO-REFRESHER
TRANSLATION

reading frame
21
"ORF" by Thatsonginc - Wikimedia Commons
BIO-REFRESHER
TRANSLATION

reading frame
22BIO-REFRESHER
TRANSLATION

reading frame: splice sites
v  The	
  spliceosome	
  catalyzes	
  the	
  removal	
  of	
  in...
23
"Gene structure" by Daycd- Wikimedia Commons
BIO-REFRESHER
TRANSLATION

now in your mind
•  Although	
  of	
  brief	
  ...
24BIO-REFRESHER
TRANSLATION

reading frame: phase
v  Introns	
  can	
  interrupt	
  the	
  reading	
  frame	
  of	
  a	
 ...
25
"Protein synthesis" by Kelvinsong - Wikimedia Commons
CURATING GENOMES
TRANSLATION

in detail
26BIO-REFRESHER
HICCUPS

in transcription and translation
v  The	
  presence	
  of	
  premature	
  Stop	
  codons	
  in	
...
Predic'on	
  &	
  Annota'on	
  
28Gene Prediction
GENE PREDICTION
v  The	
  iden'fica'on	
  of	
  structural	
  features	
  of	
  the	
  genome:	
  
	
  
...
29Gene Prediction
GENE PREDICTION

methods for discovery
1)	
  Ab	
  ini,o:	
  	
  
-­‐	
  based	
  on	
  DNA	
  composi'o...
30
Nucleic Acids 2003 vol. 31 no. 13 3738-3741
Gene Prediction
GENE PREDICTION

methods for discovery (ctd)
2)	
  Homology...
31
GENE ANNOTATION
Integra'on	
  of	
  data	
  from	
  computa'onal	
  &	
  experimental	
  evidence	
  with	
  data	
  
f...
32
In	
  some	
  cases	
  algorithms	
  and	
  metrics	
  used	
  to	
  generate	
  
consensus	
  sets	
  may	
  actually	...
ANNOTATION

an imperfect art
No one is perfect, least of all automated annotation. 33
New	
  technology	
  brings	
  new	
...
MANUAL ANNOTATION

improving predictions
Precise	
  elucida=on	
  of	
  biological	
  features	
  
encoded	
  in	
  the	
 ...
35
BIOCURATION

structural and functional adjustments
Iden=fies	
  elements	
  that	
  best	
  
represent	
  the	
  underly...
GENOME ANNOTATION

an inherently collaborative task
APOLLO 36
Researchers	
  oDen	
  turn	
  to	
  colleagues	
  for	
  se...
APOLLO

collaborative genome annotation editing tool
37
v  Web	
  based,	
  integrated	
  with	
  JBrowse.	
  
v  Suppor...
APOLLO ARCHITECTURE

simple, flexible
ARCHITECTURE 38
Web-­‐based	
  client	
  +	
  annota'on-­‐edi'ng	
  engine	
  +	
  s...
We	
  con'nuously	
  train	
  and	
  support	
  hundreds	
  of	
  geographically	
  dispersed	
  
scien'sts	
   from	
   d...
40
TRAINING CURATORS

a little training goes a long way!
Provided	
  with	
  adequate	
  tools,	
  wet	
  lab	
  scien'sts...
Apollo	
  
42
APOLLO

annotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color	
  by	
  CDS	
  frame,	
  
toggle	
  stran...
Let’s	
  play!	
  
Instructions
44 | 44	
APOLLO ON THE WEB

instructions
Username:	
  
user.number@example.com	
  
	
  
Password:	
  
usernum...
Cura'ng	
  with	
  Apollo	
  
Becoming Acquainted with Web Apollo
46 | 46	
GENERAL PROCESS OF CURATION

main steps to remember
1.  Select	
  or	
  find	
...
USER NAVIGATION

removable side dock
HIGHLIGHTED IMPROVEMENTS 47
Annotations Organism Users Groups AdminTracks
Reference
S...
EDITS & EXPORTS

annotation details, exon boundaries, data export
HIGHLIGHTED IMPROVEMENTS 48
1 2
Annotations
1
2
HIGHLIGHTED IMPROVEMENTS 49
Reference
Sequences
3
FASTA	
  
GFF3	
  
EDITS & EXPORTS

annotation details, exon boundaries,...
50 | 50	
Becoming Acquainted with Web Apollo.
USER NAVIGATION
Annotator	
  
panel.	
  
•  Choose	
  appropriate	
  evidenc...
51 | 51	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
•  Annota'on	
  right-­‐click	
  menu	
  
52 | 52	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
•  ‘Zoom	
  to	
  base	
  level’	
  op'on	
  reveals	
  the	...
53 | 53	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
•  Color	
  exons	
  by	
  CDS	
  from	
  the	
  ‘View’	
  m...
54 |
Zoom	
  in/out	
  with	
  keyboard:	
  
shij	
  +	
  arrow	
  keys	
  up/down	
  
54	
USER NAVIGATION
Becoming Acquai...
Annota'on	
  
simple	
  cases	
  
“Simple	
  case”:	
  	
  
	
  -­‐	
  the	
  predicted	
  gene	
  model	
  is	
  correct	
  or	
  nearly	
  correct,	
  and...
58 |
•  A	
  confirma'on	
  box	
  will	
  warn	
  you	
  if	
  the	
  receiving	
  transcript	
  is	
  not	
  on	
  the	
 ...
If	
  transcript	
  alignment	
  data	
  are	
  available	
  and	
  extend	
  beyond	
  your	
  original	
  annota'on,	
  ...
To	
  modify	
  an	
  exon	
  boundary	
  and	
  match	
  
data	
   in	
   the	
   evidence	
   tracks:	
   select	
  
bot...
1.  Zoom	
  in	
  to	
  clearly	
  resolve	
  each	
  exon	
  as	
  a	
  dis'nct	
  rectangle.	
  	
  
2.  Two	
  exons	
 ...
Non-­‐canonical	
  splice	
  sites	
  flags.	
   Double	
  click:	
  selec'on	
  of	
  
feature	
  and	
  sub-­‐features	
 ...
63 |
Non-­‐canonical	
  splices	
  are	
  indicated	
  by	
  
an	
   orange	
   circle	
   with	
   a	
   white	
  
exclam...
Web	
  Apollo	
  calculates	
  the	
  longest	
  possible	
  open	
  
reading	
  frame	
  (ORF)	
  that	
  includes	
  can...
complex	
  cases	
  
Evidence	
  may	
  support	
  joining	
  two	
  or	
  more	
  different	
  gene	
  models.	
  	
  
Warning:	
  protein	
  a...
One	
  or	
  more	
  splits	
  may	
  be	
  recommended	
  when:	
  	
  
-­‐	
  different	
  segments	
  of	
  the	
  predi...
DNA	
  Track	
  
‘User-­‐created	
  Annota=ons’	
  Track	
  
68	
COMPLEX CASES
correcting frameshifts and single-base erro...
69	
COMPLEX CASES
correcting selenocysteine containing proteins
Becoming Acquainted with Web Apollo. COMPLEX CASES
70	
COMPLEX CASES
correcting selenocysteine containing proteins
Becoming Acquainted with Web Apollo. COMPLEX CASES
1.  Apollo	
  allows	
  annotators	
  to	
  make	
  single	
  base	
  modifica'ons	
  or	
  frameshijs	
  that	
  are	
  re...
72 | 72	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
•  Annotation right-click menu
73	
Annota'ons,	
  annota'on	
  edits,	
  and	
  History:	
  stored	
  in	
  a	
  centralized	
  database.	
  
73	
USER NA...
Follow	
  the	
  checklist	
  un'l	
  you	
  are	
  happy	
  with	
  the	
  annota'on!	
  
And	
  remember	
  to…	
  
–  c...
75 | 75	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
•  Annotation right-click menu
76	
The	
  Annota'on	
  Informa=on	
  Editor	
  
76	
USER NAVIGATION
Becoming Acquainted with Web Apollo.
DBXRefs	
  are	
...
77	
The	
  Annota'on	
  Informa=on	
  Editor	
  
•  Add	
  PubMed	
  IDs	
  
•  Include	
  GO	
  terms	
  as	
  appropriat...
Checklist	
  
•  Check	
  ‘Start’	
  and	
  ‘Stop’	
  sites.	
  
•  Check	
  	
  splice	
  sites:	
  most	
  splice	
  sites	
  display	...
Example	
  
Example
Example 81
A	
  public	
  Apollo	
  Demo	
  using	
  the	
  Honey	
  Bee	
  genome	
  is	
  available	
  at	
  	
 ...
What do we know about this genome?
•  Currently	
  publicly	
  available	
  data	
  at	
  NCBI:	
  
•  >37,000	
   	
  nuc...
PubMed Search: 

what’s new?
Example 83
PubMed Search: what’s new?
Example 84
“Ten	
  popula'ons	
  (3	
  cultures,	
  7	
  from	
  California	
  water	
  
bodies...
How many sequences are there, publicly available,
for our gene of interest?
Example 85
•  Para,	
  (voltage-­‐gated	
  sod...
Retrieving sequences for 

sequence similarity searches.
Example 86
>vgsc-­‐Segment3-­‐DomainII	
  
RVFKLAKSWPTLNLLISIMGKT...
BLAT search



input
Example 87
>vgsc-­‐Segment3-­‐DomainII	
  
RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTK...
BLAT search



results
Example 88
•  High-­‐scoring	
  segment	
  pairs	
  (hsp)	
  
are	
  listed	
  in	
  tabulated	
  f...
Creating a new gene model: drag and drop
Example 89
•  Apollo automatically calculates ORF.
In this case, ORF includes the...
Available Tracks
Example 90
Get Sequence
Example 91
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Also, flanking sequences (other gene models) vs. NCBI nr
Example 92
In	
  this	
  case,	
  two	
  gene	
  
models	
  upstr...
Review alignments
Example 93
HaztTmpM006234	
  
HaztTmpM006233	
  
HaztTmpM006232	
  
Hypothesis for vgsc gene model
Example 94
Editing: merge the three models
Example 95
Merge	
  by	
  dropping	
  an	
  
exon	
  or	
  gene	
  model	
  
onto	
  anoth...
Result of merging the three models.
Example 96
Editing: correct boundaries
Example 97
Modify	
  exon	
  /	
  intron	
  
boundary:	
  	
  
-­‐  Drag	
  the	
  end	
  of	
...
Editing: set translation start
Example 98
Editing: delete exon
Example 99
Delete	
  first	
  exon	
  from	
  
HaztTmpM006233	
  
Editing: add an exon - supported by RNAseq
Example 100
•  RNAseq	
  reads	
  show	
  evidence	
  in	
  support	
  of	
  tr...
Editing: and adjust other exon boundary using evidence
Example 101
Editing: adjust other boundaries supported by evidence
Example 102
Finished model
Example 103
Corroborate	
  integrity	
  and	
  accuracy	
  of	
  the	
  model:	
  	
  
-­‐	
  Start	
  and	...
Information Editor
•  DBXRefs:	
  e.g.	
  NP_001128389.1,	
  N.	
  
vitripennis,	
  RefSeq	
  
•  PubMed	
  iden'fier:	
  P...
Demo	
  
APOLLO

demonstration
DEMO 106
Demo	
  video	
  is	
  available	
  at	
  	
  
h@ps://youtu.be/VgPtAP_fvxY	
  
OUTLINE

Web	
  Apollo	
  Collabora've	
  Cura'on	
  and	
  	
  
Interac've	
  Analysis	
  of	
  Genomes	
  
107OUTLINE
• ...
Exercises	
  
Exercises
Live	
  Demonstra'on	
  using	
  the	
  Apis	
  mellifera	
  genome.	
  
110
1.	
  Evidence	
  in	
  support	
  ...
Exercises
Live	
  Demonstra'on	
  using	
  the	
  Apis	
  mellifera	
  genome.	
  
111
1.	
  Evidence	
  in	
  support	
  ...
Instrucciones
112 | 112	
APOLLO ON THE WEB

instructions
Username:	
  
user.number@example.com	
  
	
  
Password:	
  
user...
Thank you. 113
•  Berkeley	
  Bioinforma=cs	
  Open-­‐source	
  Projects	
  (BBOP),	
  
Berkeley	
  Lab:	
  Apollo	
  and	...
Genome Curation using Apollo
Genome Curation using Apollo
Upcoming SlideShare
Loading in …5
×

Genome Curation using Apollo

709 views

Published on

Comparative genome analysis requires high quality annotations of all genomic elements. Today’s sequencing projects face numerous challenges including lower coverage, more frequent assembly errors, and the lack of closely related species with well-annotated genomes. Precise elucidation of the many different biological features encoded in any genome requires careful examination and review. We need genome annotation editing tools to modify and refine the location and structure of the genome elements that predictive algorithms cannot yet resolve automatically. During the manual annotation process, curators identify elements that best represent the underlying biology and eliminate elements that reflect systemic errors of automated analyses.

Apollo is a web-based application that supports and enables collaborative genome curation in real time, analogous to Google Docs, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Researchers from nearly one hundred institutions worldwide are currently using Apollo for distributed curation efforts in over sixty genome projects across the tree of life: from plants to arthropods, to fungi, to species of fish and other vertebrates including human, cattle (bovine), and dog.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Genome Curation using Apollo

  1. 1. Curating genes and genomes
 Apollo: a collaborative tool for genome curation Monica Munoz-Torres, PhD | @monimunozto
 Berkeley Bioinformatics Open-Source Projects (BBOP)
 Lawrence Berkeley National Laboratory | 
 University of California Berkeley | U.S. Department of Energy 
 BioInfoGenomicsWkshopv2 | Reed College, Portland, Oregon | 10 October, 2015
  2. 2. OUTLINE
 Web  Apollo  Collabora've  Cura'on  and     Interac've  Analysis  of  Genomes   2OUTLINE •  Today  we  will  discover   how  to  extract  the  most   valuable  informa'on   about  a  genome  through   cura'on  efforts.  
  3. 3. APOLLO DEVELOPMENT APOLLO DEVELOPERS 3 h* p://G e nom e Ar c hite c t. or g /     Nathan Dunn Eric Yao JBrowse, UC Berkeley Christine Elsik’s Lab, University of Missouri Suzi Lewis Principal Investigator BBOP   Moni Munoz-Torres Stephen Ficklin GenSAS, Washington State University Colin DieshDeepak Unni
  4. 4. 4 BY THE END OF THIS TALK
 you will
 v Be@er  understand  genome  cura'on  in  the  context  of  annota'on:     assembled  genome  à  automated  annota=on  à  manual  annota=on   v Become  familiar  with  the  environment  and  func'onality  of  the  Apollo   genome  annota'on  edi'ng  tool.   v Learn  to  iden'fy  homologs  of  known  genes  of  interest  in  a  newly   sequenced  genome.   v Learn  about  corrobora'ng  and  modifying  automa'cally  annotated  gene   models  using  available  evidence  in  Apollo.   What to expect
  5. 5. Anatomy  of  a  genome     sequencing  project  
  6. 6. 6 Genome Sequencing Project Anatomy of a genome sequencing project Experimental design, sampling. Comparative analyses Consensus Gene Set Manual Annotation Automated Annotation Sequencing Assembly Synthesis & dissemination.
  7. 7. CURATING GENOMES
 steps involved 1  Genera=on  of  Gene  Models   calling  ORFs,  one  or  more   rounds  of  gene  predic'on,   etc.     2  Annota=on  of  gene  models   Describing  func'on,   expression  pa@erns,   metabolic  network    memberships.     3  Manual  annota=on   CURATING GENOMES 7
  8. 8. GENOME ANNOTATION
 objectives and uses Curating Genomes 8 The  gene  set  of  an  organism  informs  a  variety  of  studies:   •  Gene  number,  GC%,  TE  composi'on,  repe''ve  regions.   •  Func'onal  assignments.   •  Molecular  evolu'on,  sequence  conserva'on.   •  Gene  families.   •  Metabolic  pathways.   •  What  makes  an  organism  what  it  is?     What  makes  a  bee  a  “bee”?   Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild
  9. 9. First,  a  bio-­‐refresher  
  10. 10. WHAT WE NEED TO KNOW
 for manual annotation To  remember…  Biological  concepts  to  be@er   understand  manual  annota'on   10FOOD FOR THOUGHT •  GLOSSARY   from  con1g  to  splice  site     •  CENTRAL  DOGMA   in  molecular  biology     •  WHAT  IS  A  GENE?   defining  your  goal   •  TRANSCRIPTION   mRNA  in  detail     •  TRANSLATION   and  other  defini'ons   •  GENOME  CURATION   steps  involved  
  11. 11. 11CURATING GENOMES WHAT WE KNOW
 in very general terms
  12. 12. 12CURATING GENOMES WHAT WE KNOW
 in very general terms http://www.wisegeek.com/ 5’   3’   5’   3’  
  13. 13. 13CURATING GENOMES CENTRAL “DOGMA”
 of molecular biology v  DNA  can  be  copied  to  DNA  (DNA  replica'on),     v  DNA  informa'on  can  be  copied  into  mRNA   (transcrip'on),  and   v  Proteins  can  be  synthesized  using  the   informa'on  in  mRNA  as  a  template   (transla'on).   http://www.wisegeek.com/
  14. 14. 14BIO-REFRESHER What is a gene? v  The  defini'on  of  a  gene  paints  a  very  complex  picture  of  molecular  ac'vity   and  it  is  a  con'nuously  evolving  concept.     •  From  the  Sequence  Ontology  (SO):   “A  gene  is  a  locatable  region  of  genomic  sequence,  corresponding  to  a  unit   of  inheritance,  which  is  associated  with  regulatory  regions,  transcribed   regions  and/or  other  func'onal  sequence  regions”.       “Evolving  Concept”  at  h@p://goo.gl/LpsajQ  
  15. 15. 15BIO-REFRESHER What is a gene? v  In  your  life'me,  the  Encyclopedia  of  DNA  Elements  (ENCODE)  project  updated   this  concept  yet  again.  Long  transcripts  &  dispersed  regula1on!       “A  gene  is  a  DNA  segment  that  contributes  to  phenotype/func'on.  In  the  absence  of   demonstrated  func'on,  a  gene  may  be  characterized  by  sequence,  transcrip'on  or  homology.”     https://www.encodeproject.org/
  16. 16. 16BIO-REFRESHER What is a gene?
 let’s think computationally! v  Think  of  the  genome  as  an  operating system for  a  living  being   •  Considering  that  the  nucleo'des  of  the  genome  are  put  together  into  a   code  that  is  executed  through  the  process  of  transcription  and   translation… •  …  think  of  genes  as  subroutines  that  are  repe''vely  called  in  the   process  of  transcription Gerstein et al., 2007. Genome Res.
  17. 17. 17BIO-REFRESHER What is a gene?
 considerations v  Also  consider  :   •  A  gene  is  a  genomic  sequence  (DNA  or  RNA)  directly  encoding   func'onal  product  molecules,  either  RNA  or  protein.   •  If  several  func'onal  products  share  overlapping  regions,  we  take  the   union  of  all  overlapping  genomics  sequences  coding  for  them.   •  This  union  must  be  coherent  –  i.e.,  processed  separately  for  final   protein  and  RNA  products  –  but  does  not  require  that  all  products   necessarily  share  a  common  subsequence. Gerstein et al., 2007. Genome Res.
  18. 18. 18BIO-REFRESHER “The  gene  is  a  union   of  genomic  sequences   encoding  a  coherent   set  of  poten'ally     overlapping  func'onal   products.”   Gerstein et al., 2007. Genome Res The  Gene:  a  moving  target.   What is a gene?
  19. 19. 19BIO-REFRESHER TRANSLATION
 reading frame v  Reading  frame  is  a  manner  of  dividing  the  sequence  of  nucleo'des  in  mRNA   (or  DNA)  into  a  set  of  consecu've,  non-­‐overlapping  triplets  (codons).   v  Three  frames  can  be  read  in  the  5’  à  3’  direc'on.  Given  that  DNA  has  two   an'-­‐parallel  strands,  an  addi'onal  three  frames  are  possible  to  be  read  on   the  an'-­‐sense  strand.  Six  total  possible  reading  frames  exist.   v  In  eukaryotes,  only  one  reading  frame  per  sec'on  of  DNA  is  biologically   relevant  at  a  'me:  it  has  the  poten'al  to  be  transcribed  into  RNA  and   translated  into  protein.  This  is  called  the  OPEN  READING  FRAME  (ORF)   •  ORF  =  Start  signal  +  coding  sequence  (divisible  by  3)  +  Stop  signal   v  The  sec'ons  of  the  mature  mRNA  transcribed  with  the  coding  sequence  but   not  translated  are  called  UnTranslated  Regions  (UTR);  one  at  each  end.  
  20. 20. 20 "Reading Frame" by Hornung Ákos - Wikimedia Commons BIO-REFRESHER TRANSLATION
 reading frame
  21. 21. 21 "ORF" by Thatsonginc - Wikimedia Commons BIO-REFRESHER TRANSLATION
 reading frame
  22. 22. 22BIO-REFRESHER TRANSLATION
 reading frame: splice sites v  The  spliceosome  catalyzes  the  removal  of  introns  and  the  liga'on  of  flanking   exons.   •  introns:  spaces  inside  the  gene,  not  part  of  the  coding  sequence   •  exons:  expression  units  (of  the  coding  sequence)   v  Splicing  “signals”  (from  the  point  of  view  of  an  intron):     •  There  is  a  5’  end  splice  “signal”  (site):  usually  GT  (less  common:  GC)   •  And  a  3’  end  splice  site:  usually  AG   •  …]5’-­‐GT/AG-­‐3’[…     v  It  is  possible  to  produce  more  than  one  protein  (polypep'de)  sequence  from   the  same  genic  region,  by  alterna'vely  bringing  exons  together=  alterna=ve   splicing.  For  example,  the  gene  Dscam  (Drosophila)  has  38,000  alterna'vely   spliced  mRNAs  =  isoforms  
  23. 23. 23 "Gene structure" by Daycd- Wikimedia Commons BIO-REFRESHER TRANSLATION
 now in your mind •  Although  of  brief  existence,  understanding  mRNAs  is  crucial,    as  they  will  become  the  center  of  your  work.  
  24. 24. 24BIO-REFRESHER TRANSLATION
 reading frame: phase v  Introns  can  interrupt  the  reading  frame  of  a  gene  by  inser'ng  a  sequence   between  two  consecu've  codons       v  Between  the  first  and  second  nucleo'de  of  a  codon     v  Or  between  the  second  and  third  nucleo'de  of  a  codon   "Exon and Intron classes”. Licensed under Fair use via Wikipedia
  25. 25. 25 "Protein synthesis" by Kelvinsong - Wikimedia Commons CURATING GENOMES TRANSLATION
 in detail
  26. 26. 26BIO-REFRESHER HICCUPS
 in transcription and translation v  The  presence  of  premature  Stop  codons  in  the  message  is  possible.  A   process  called  non-­‐sense  mediated  decay  checks  for  them  and  corrects   them  to  avoid:  incomplete  splicing,  DNA  muta'ons,  transcrip'on  errors,  and   leaky  scanning  of  ribosome  –  causing  changes  in  the  reading  frame  (frame   shiYs).   v  Inser'ons  and  dele'ons  (indels)  can  cause  frame  shijs,  when  indel  is  not   divisible  by  three  (3).  As  a  result,  the  pep'de  can  be  abnormally  long,  or   abnormally  short  –  depending  when  the  first  in-­‐frame  Stop  signal  is  located.  
  27. 27. Predic'on  &  Annota'on  
  28. 28. 28Gene Prediction GENE PREDICTION v  The  iden'fica'on  of  structural  features  of  the  genome:     •  Primarily  focused  on  protein-­‐coding  genes.     •  Predicts  also  transfer  RNAs  (tRNA),  ribosomal  RNAs  (rRNA),   regulatory  mo'fs,  long  and  small  non-­‐coding  RNAs  (ncRNA),   repe''ve  elements  (masked),  etc.   •  Two  methods  for  iden'fica'on.   •  Some  are  self-­‐trained  and  some  must  be  trained.  
  29. 29. 29Gene Prediction GENE PREDICTION
 methods for discovery 1)  Ab  ini,o:     -­‐  based  on  DNA  composi'on,     -­‐  deals  strictly  with  genomic   sequences   -­‐  makes  use  of  sta's'cal   approaches  to  search  for  coding   regions  and  typical  gene  signals.       •  E.g.  Augustus,  GENSCAN,     geneid,  fgenesh,  etc.   3’   Nat Rev Genet. 2015 Jun;16(6):321-32. doi: 10.1038/nrg3920
  30. 30. 30 Nucleic Acids 2003 vol. 31 no. 13 3738-3741 Gene Prediction GENE PREDICTION
 methods for discovery (ctd) 2)  Homology-­‐based:     -­‐  evidence-­‐based,     -­‐  finds  genes  using  either  similarity  searches  in  the  main  databases  or   experimental  data  including  RNAseq,  expressed  sequence  tags  (ESTs),  full-­‐length   complementary  DNAs  (cDNAs),  etc.       •  E.g:  fgenesh++,  Just  Annotate  My  genome  (JAMg),  SGP2  
  31. 31. 31 GENE ANNOTATION Integra'on  of  data  from  computa'onal  &  experimental  evidence  with  data   from  predic'on  tools,  to  generate  a  reliable  set  of  structural  annota=ons.       Involves:   1)  ab  ini1o  predic'ons   2)  assessment  of  biological  evidence  to  drive  the  gene  predic'on  process   3)  synthesis  of  these  results  to  produce  a  set  of  consensus  gene  models   Gene Annotation
  32. 32. 32 In  some  cases  algorithms  and  metrics  used  to  generate   consensus  sets  may  actually  reduce  the  accuracy  of  the  gene’s   representa'on.   GENE ANNOTATION Gene  models  may  be  organized  into  “sets”  using:   v  automa'c  integra'on  of  predicted  sets  (combiners);  e.g:  GLEAN,   EvidenceModeler   or   v  tools  packaged  into  pipelines;  e.g:  MAKER,  PASA,  Gnomon,   Ensembl,  etc.   Gene Annotation
  33. 33. ANNOTATION
 an imperfect art No one is perfect, least of all automated annotation. 33 New  technology  brings  new  challenges:     •  Assembly  errors  can  cause  fragmented   annota'ons   •  Limited  coverage  makes  precise   iden'fica'on  a  difficult  task   Image: www.BroadInstitute.org
  34. 34. MANUAL ANNOTATION
 improving predictions Precise  elucida=on  of  biological  features   encoded  in  the  genome  requires  careful   examina=on  and  review.     Schiex  et  al.  Nucleic  Acids  2003  (31)  13:  3738-­‐3741   Automated Predictions Experimental Evidence Manual Annotation – to the rescue. 34 cDNAs,  HMM  domain  searches,  RNAseq,   genes  from  other  species.  
  35. 35. 35 BIOCURATION
 structural and functional adjustments Iden=fies  elements  that  best   represent  the  underlying  biology   and  eliminates  elements  that   reflect  systemic  errors  of   automated  analyses.   Assigns  func=on  through   compara've  analysis  of  similar   genome  elements  from  closely   related  species  using  literature,   databases,  and  experimental  data.   MANUAL ANNOTATION h@p://GeneOntology.org   1   2  
  36. 36. GENOME ANNOTATION
 an inherently collaborative task APOLLO 36 Researchers  oDen  turn  to  colleagues  for  second   opinions  and  insight  from  those  with  exper1se  in   par1cular  areas  (e.g.,  domains,  families).   So  many  sequences,  but  not  enough  hands!  
  37. 37. APOLLO
 collaborative genome annotation editing tool 37 v  Web  based,  integrated  with  JBrowse.   v  Supports  real  'me  collabora'on!   v  Automa'c  genera'on  of  ready-­‐made     computable  data.     v  Supports  annota'on  of  genes,    pseudogenes,     tRNAs,  snRNAs,  snoRNAs,  ncRNAs,  miRNAs,  TEs,  and  repeats.   v  Intui've  annota'on,  gestures,  and  pull-­‐down  menus  to  create  and   edit  transcripts  and  exons  structures,  insert  comments  (CV,  freeform   text),  associate  GO  terms,  etc.   APOLLO h@p://GenomeArchitect.org    
  38. 38. APOLLO ARCHITECTURE
 simple, flexible ARCHITECTURE 38 Web-­‐based  client  +  annota'on-­‐edi'ng  engine  +  server-­‐side  data  service   REST / JSON Websockets Annotation Engine (Server) Shiro LDAP OAuth JBrowse Data Organism 2 Annotations Security Preferences Organisms Tracks BAM BED VCF GFF3 BigWig Annotators Google Web Toolkit (GWT) / Bootstrap JBrowse DOJO / jQuery JBrowse Data Organism 1 Load genomic evidence per selected organism Single Data Store PostgreSQL, MySQL, MongoDB, ElasticSearch Apollo v2.0
  39. 39. We  con'nuously  train  and  support  hundreds  of  geographically  dispersed   scien'sts   from   diverse   research   communi'es   in   conduc'ng   manual   annota'ons   efforts   to   recover   coding   sequences   in   agreement   with   all   available  biological  evidence  using  Apollo.     39 LESSONS LEARNED APOLLO What  we  have  learned:     •  Collabora've  work  dis'lls  invaluable  knowledge   •  We  must  enforce  strict  rules  and  formats   •  We  must  evolve  with  the  data   •  NGS  poses  addi'onal  challenges  
  40. 40. 40 TRAINING CURATORS
 a little training goes a long way! Provided  with  adequate  tools,  wet  lab  scien'sts  make   excep'onal  curators  who  can  easily  learn  to  maximize  the   genera'on  of  accurate,  biologically  supported  gene  models.   APOLLO
  41. 41. Apollo  
  42. 42. 42 APOLLO
 annotation editing environment BECOMING ACQUAINTED WITH APOLLO Color  by  CDS  frame,   toggle  strands,  set  color   scheme  and  highlights.   Upload  evidence  files   (GFF3,  BAM,  BigWig),   add  combina=on  and   sequence  search   tracks.   Query  the  genome  using   BLAT.   Naviga'on  and  zoom.   Search  for  a  gene   model  or  a  scaffold.   Get  coordinates  and  “rubber   band”  selec'on  for  zooming.   Login   User-­‐created   annota'ons.   Annotator   panel.   Evidence   Tracks   Stage  and   cell-­‐type   specific   transcrip'on   data.    h@p://genomearchitect.org/web_apollo_user_guide    
  43. 43. Let’s  play!  
  44. 44. Instructions 44 | 44 APOLLO ON THE WEB
 instructions Username:   user.number@example.com     Password:   usernumber   Email   Password   Server   Begin  at   user.one@example.com   userone   1   1   user.two@example.com   usertwo   2   1   user.three@example.com   userthree   3   1   user.four@example.com   userfour   4   1   user.five@example.com   userfive   5   1   user.six@example.com   usersix   1   7   user.seven@example.com   userseven   2   7   user.eight@example.com   usereight   3   7   user.nine@example.com   usernine   4   7   user.ten@example.com   userten   5   7   user.eleven@example.com   usereleven   1   1   user.twelve@example.com   usertwelve   2   1   user.thirteen@example.com   userthirteen   3   1   user.fourteen@example.com   userfourteen   4   1   user.fijeen@example.com   userfijeen   5   1   user.sixteen@example.com   usersixteen   1   7   user.seventeen@example.com   userseventeen   2   7   user.eigh@een@example.com   usereighteen   3   7   user.nineteen@example.com   usernineteen   4   7   user.twenty@example.com   usertwenty   5   7   user.twentyone@example.com   usertwentyone   1   1   user.twentytwo@example.com   usertwentytwo   2   1   user.twentythree@example.com   usertwentythree   3   1   user.twentyfour@example.com   usertwentyfour   4   1   user.twentyfive@example.com   usertwentyfive   5   1   user.twentysix@example.com   usertwentysix   1   7   user.twentyseven@example.com   usertwentyseven   2   7   user.twentyeight@example.com   usertwentyeight   3   7   user.twentynine@example.com   usertwentynine   4   7   Server   URL   1  h@p://52.26.7.239:8080/apollo/annotator/index   2  h@p://52.89.205.105:8080/apollo/annotator/index   3  h@p://52.89.230.210:8080/apollo/annotator/index   4  h@p://52.89.149.42:8080/apollo/annotator/index   5  h@p://52.89.233.118:8080/apollo/annotator/index  
  45. 45. Cura'ng  with  Apollo  
  46. 46. Becoming Acquainted with Web Apollo 46 | 46 GENERAL PROCESS OF CURATION
 main steps to remember 1.  Select  or  find  a  region  of  interest,  e.g.  scaffold.   2.  Select  appropriate  evidence  tracks  to  review  the  gene  model.   3.  Determine  whether  a  feature  in  an  exis'ng  evidence  track   will  provide  a  reasonable  gene  model  to  start  working.   4.  If  necessary,  adjust  the  gene  model.   5.  Check  your  edited  gene  model  for  integrity  and  accuracy  by   comparing  it  with  available  homologs.   6.  Comment  and  finish.  
  47. 47. USER NAVIGATION
 removable side dock HIGHLIGHTED IMPROVEMENTS 47 Annotations Organism Users Groups AdminTracks Reference Sequence
  48. 48. EDITS & EXPORTS
 annotation details, exon boundaries, data export HIGHLIGHTED IMPROVEMENTS 48 1 2 Annotations 1 2
  49. 49. HIGHLIGHTED IMPROVEMENTS 49 Reference Sequences 3 FASTA   GFF3   EDITS & EXPORTS
 annotation details, exon boundaries, data export 3
  50. 50. 50 | 50 Becoming Acquainted with Web Apollo. USER NAVIGATION Annotator   panel.   •  Choose  appropriate  evidence  from  list  of  “Tracks”  on  annotator  panel.       •  Select  &  drag  elements  from  evidence  track  into  the  ‘User-­‐created  Annota1ons’  area.     •  Hovering  over  annota'on  in  progress  brings  up  an  informa'on  pop-­‐up.   •  Crea'ng  a  new  annota'on  
  51. 51. 51 | 51 USER NAVIGATION Becoming Acquainted with Web Apollo. •  Annota'on  right-­‐click  menu  
  52. 52. 52 | 52 USER NAVIGATION Becoming Acquainted with Web Apollo. •  ‘Zoom  to  base  level’  op'on  reveals  the  DNA  Track.  
  53. 53. 53 | 53 USER NAVIGATION Becoming Acquainted with Web Apollo. •  Color  exons  by  CDS  from  the  ‘View’  menu.  
  54. 54. 54 | Zoom  in/out  with  keyboard:   shij  +  arrow  keys  up/down   54 USER NAVIGATION Becoming Acquainted with Web Apollo. •  Toggle  reference  DNA  sequence  and  transla=on  frames  in  forward   strand.  Toggle  models  in  either  direc'on.  
  55. 55. Annota'on  
  56. 56. simple  cases  
  57. 57. “Simple  case”:      -­‐  the  predicted  gene  model  is  correct  or  nearly  correct,  and      -­‐  this  model  is  supported  by  evidence  that  completely  or  mostly   agrees  with  the  predic'on.      -­‐  evidence  that  extends  beyond  the  predicted  model  is  assumed   to  be  non-­‐coding  sequence.       The  following  are  simple  modifica'ons.       57 | 57 ANNOTATING SIMPLE CASES Becoming Acquainted with Web Apollo. SIMPLE CASES
  58. 58. 58 | •  A  confirma'on  box  will  warn  you  if  the  receiving  transcript  is  not  on  the   same  strand  as  the  feature  where  the  new  exon  originated.   •  Check  ‘Start’  and  ‘Stop’  signals  ajer  each  edit.   58 ADDING EXONS Becoming Acquainted with Web Apollo. SIMPLE CASES
  59. 59. If  transcript  alignment  data  are  available  and  extend  beyond  your  original  annota'on,  you   may  extend  or  add  UTRs.     1.  Right  click  at  the  exon  edge  and  ‘Zoom  to  base  level’.     2.  Place  the  cursor  over  the  edge  of  the  exon  un1l  it  becomes  a  black  arrow  then  click   and  drag  the  edge  of  the  exon  to  the  new  coordinate  posi'on  that  includes  the  UTR.     59 | 59 ADDING UTRs Becoming Acquainted with Web Apollo. SIMPLE CASES To  add  a  new  spliced  UTR  to  an  exis'ng     annota'on  follow  the  procedure  for  adding  an  exon.  
  60. 60. To  modify  an  exon  boundary  and  match   data   in   the   evidence   tracks:   select   both   the   offending   exon   and   the   feature  with  the  expected  boundary,   then  right  click  on  the  annota'on  to   select  ‘Set  3’  end’  or  ‘Set  5’  end’  as   appropriate.     60 | In  some  cases  all  the  data  may  disagree  with  the  annota'on,  in   other  cases  some  data  support  the  annota'on  and  some  of  the   data  support  one  or  more  alterna've  transcripts.  Try  to  annotate   as  many  alterna've  transcripts  as  are  well  supported  by  the  data.   60 MATCHING EXON BOUNDARY TO EVIDENCE Becoming Acquainted with Web Apollo. SIMPLE CASES
  61. 61. 1.  Zoom  in  to  clearly  resolve  each  exon  as  a  dis'nct  rectangle.     2.  Two  exons  from  different  tracks  sharing  the  same  start  and/or  end   coordinates  will  display  a  red  bar  to  indicate  matching  edges.   3.  Selec'ng  the  whole  annota'on  or  one  exon  at  a  'me,  use  this  ‘edge-­‐ matching’  func'on  and  scroll  along  the  length  of  the  annota'on,   verifying  exon  boundaries  against  available  data.  Use  square  [  ]   brackets  to  scroll  from  exon  to  exon.   4.  Check  if  cDNA  /  RNAseq  reads  lack  one  or  more  of  the  annotated   exons  or  include  addi'onal  exons.       61 | 61 CHECKING EXON INTEGRITY Becoming Acquainted with Web Apollo. SIMPLE CASES
  62. 62. Non-­‐canonical  splice  sites  flags.   Double  click:  selec'on  of   feature  and  sub-­‐features   Evidence  Tracks  Area   ‘User-­‐created  Annota1ons’  Track   Edge-­‐matching   Apollo’s  edi'ng  logic  (brain):     §  selects  longest  ORF  as  CDS   §  flags  non-­‐canonical  splice  sites   62 ORFs AND SPLICE SITES Becoming Acquainted with Web Apollo. SIMPLE CASES
  63. 63. 63 | Non-­‐canonical  splices  are  indicated  by   an   orange   circle   with   a   white   exclama'on  point  inside,  placed  over   the  edge  of  the  offending  exon.     Canonical  splice  sites:   3’-­‐…exon]GA  /  TG[exon…-­‐5’   5’-­‐…exon]GT  /  AG[exon…-­‐3’   reverse  strand,  not  reverse-­‐complemented:   forward  strand   63 SPLICE SITES Becoming Acquainted with Web Apollo. SIMPLE CASES Zoom  to  review  non-­‐canonical   splice  site  warnings.  Although   these  may  not  always  have  to  be   corrected  (e.g  GC  donor),  they   should  be  flagged  with  the   appropriate  comment.     Exon/intron  splice  site  error  warning   Curated  model  
  64. 64. Web  Apollo  calculates  the  longest  possible  open   reading  frame  (ORF)  that  includes  canonical  ‘Start’   and  ‘Stop’  signals  within  the  predicted  exons.     If  ‘Start’  appears  to  be  incorrect,  modify  it  by  selec'ng   an  in-­‐frame  ‘Start’  codon  further  up  or   downstream,  depending  on  evidence  (protein   database,  addi'onal  evidence  tracks).       It  may  be  present  outside  the  predicted  gene   model,  within  a  region  supported  by  another   evidence  track.     In  very  rare  cases,  the  actual  ‘Start’  codon  may  be   non-­‐canonical  (non-­‐ATG).     64 | 64 ‘START’ AND ‘STOP’ SITES Becoming Acquainted with Web Apollo. SIMPLE CASES
  65. 65. complex  cases  
  66. 66. Evidence  may  support  joining  two  or  more  different  gene  models.     Warning:  protein  alignments  may  have  incorrect  splice  sites  and  lack  non-­‐conserved  regions!     1.  In  ‘User-­‐created  Annota,ons’  area  shij-­‐click  to  select  an  intron  from  each  gene  model  and   right  click  to  select  the  ‘Merge’  op'on  from  the  menu.     2.  Drag  suppor'ng  evidence  tracks  over  the  candidate  models  to  corroborate  overlap,  or   review  edge  matching  and  coverage  across  models.   3.  Check  the  resul'ng  transla'on  by  querying  a  protein  database  e.g.  UniProt,  NCBI  nr.  Add   comments  to  record  that  this  annota'on  is  the  result  of  a  merge.   66 | 66 Red  lines  around  exons:   ‘edge-­‐matching’  allows  annotators  to  confirm  whether  the   evidence  is  in  agreement  without  examining  each  exon  at  the   base  level.   COMPLEX CASES merge two gene predictions on the same scaffold Becoming Acquainted with Web Apollo. COMPLEX CASES
  67. 67. One  or  more  splits  may  be  recommended  when:     -­‐  different  segments  of  the  predicted  protein  align  to  two  or  more   different  gene  families     -­‐  predicted  protein  doesn’t  align  to  known  proteins  over  its  en're  length     Transcript  data  may  support  a  split,  but  first  verify  whether  they  are   alterna've  transcripts.     67 | 67 COMPLEX CASES split a gene prediction Becoming Acquainted with Web Apollo. COMPLEX CASES
  68. 68. DNA  Track   ‘User-­‐created  Annota=ons’  Track   68 COMPLEX CASES correcting frameshifts and single-base errors Becoming Acquainted with Web Apollo. COMPLEX CASES Always  remember:  when  annota'ng  gene  models  using  Apollo,  you  are  looking  at  a  ‘frozen’  version  of   the  genome  assembly  and  you  will  not  be  able  to  modify  the  assembly  itself.  
  69. 69. 69 COMPLEX CASES correcting selenocysteine containing proteins Becoming Acquainted with Web Apollo. COMPLEX CASES
  70. 70. 70 COMPLEX CASES correcting selenocysteine containing proteins Becoming Acquainted with Web Apollo. COMPLEX CASES
  71. 71. 1.  Apollo  allows  annotators  to  make  single  base  modifica'ons  or  frameshijs  that  are  reflected  in   the  sequence  and  structure  of  any  transcripts  overlapping  the  modifica'on.  These   manipula'ons  do  NOT  change  the  underlying  genomic  sequence.     2.  If  you  determine  that  you  need  to  make  one  of  these  changes,  zoom  in  to  the  nucleo'de  level   and  right  click  over  a  single  nucleo'de  on  the  genomic  sequence  to  access  a  menu  that   provides  op'ons  for  crea'ng  inser'ons,  dele'ons  or  subs'tu'ons.     3.  The  ‘Create  Genomic  Inser=on’  feature  will  require  you  to  enter  the  necessary  string  of   nucleo'de  residues  that  will  be  inserted  to  the  right  of  the  cursor’s  current  loca'on.  The   ‘Create  Genomic  Dele=on’  op'on  will  require  you  to  enter  the  length  of  the  dele'on,  star'ng   with  the  nucleo'de  where  the  cursor  is  posi'oned.  The  ‘Create  Genomic  Subs=tu=on’  feature   asks  for  the  string  of  nucleo'de  residues  that  will  replace  the  ones  on  the  DNA  track.   4.  Once  you  have  entered  the  modifica'ons,  Apollo  will  recalculate  the  corrected  transcript  and   protein  sequences,  which  will  appear  when  you  use  the  right-­‐click  menu  ‘Get  Sequence’   op'on.  Since  the  underlying  genomic  sequence  is  reflected  in  all  annota'ons  that  include  the   modified  region  you  should  alert  the  curators  of  your  organisms  database  using  the   ‘Comments’  sec'on  to  report  the  CDS  edits.     5.  In  special  cases  such  as  selenocysteine  containing  proteins  (read-­‐throughs),  right-­‐click  over  the   offending/premature  ‘Stop’  signal  and  choose  the  ‘Set  readthrough  stop  codon’  op'on  from   the  menu.    71 | 71 Becoming Acquainted with Web Apollo. COMPLEX CASES COMPLEX CASES correcting frameshifts, single-base errors, and selenocysteines
  72. 72. 72 | 72 USER NAVIGATION Becoming Acquainted with Web Apollo. •  Annotation right-click menu
  73. 73. 73 Annota'ons,  annota'on  edits,  and  History:  stored  in  a  centralized  database.   73 USER NAVIGATION Becoming Acquainted with Web Apollo.
  74. 74. Follow  the  checklist  un'l  you  are  happy  with  the  annota'on!   And  remember  to…   –  comment  to  validate  your  annota'on,  even  if  you  made  no  changes  to  an   exis'ng  model.  Think  of  comments  as  your  vote  of  confidence.     –  or  add  a  comment  to  inform  the  community  of  unresolved  issues  you   think  this  model  may  have.   74 | 74 Always  Remember:  Apollo  cura'on  is  a  community  effort  so  please   use  comments  to  communicate  the  reasons  for  your     annota'on.  Your  comments  will  be  visible  to  everyone.   COMPLETING THE ANNOTATION Becoming Acquainted with Apollo.
  75. 75. 75 | 75 USER NAVIGATION Becoming Acquainted with Web Apollo. •  Annotation right-click menu
  76. 76. 76 The  Annota'on  Informa=on  Editor   76 USER NAVIGATION Becoming Acquainted with Web Apollo. DBXRefs  are  database  crossed  references:  if  you  have   reason  to  believe  that  this  gene  is  linked  to  a  gene  in  a   public  database  (including  your  own),  then  add  it  here.  
  77. 77. 77 The  Annota'on  Informa=on  Editor   •  Add  PubMed  IDs   •  Include  GO  terms  as  appropriate   from  any  of  the  three  ontologies   •  Write  comments  sta'ng  how  you   have  validated  each  model.   77 USER NAVIGATION Becoming Acquainted with Web Apollo.
  78. 78. Checklist  
  79. 79. •  Check  ‘Start’  and  ‘Stop’  sites.   •  Check    splice  sites:  most  splice  sites  display   these  residues  …]5’-­‐GT/AG-­‐3’[…   •  Check  if  you  can  annotate  UTRs,  for  example   using  RNA-­‐Seq  data:   –  Align  it  against  relevant  genes/gene  family   –  blastp  against  NCBI’s  RefSeq  or  nr   •  Check  for  gaps  in  the  genome.   •  Addi'onal  func'onality  may  be  necessary:   –  Merging  2  gene  predic'ons  on  the  same   scaffold   –  Merging  2  gene  predic'ons  from  different   scaffolds     –  Spligng  a  gene  predic'on   –  Correc'ng  frameshiYs  and  other  errors  in   the  genome  assembly   –  Annotate  selenocysteines,  correct  single-­‐ base  errors,  etc.   79 | 79 •  Add:   –  Important  project  informa'on  in  the  form  of   comments   –  IDs  from  public  databases  e.g.  GenBank  (via   DBXRef),  gene  symbol(s),  common  name(s),   synonyms,  top  BLAST  hits,  orthologs  with   species  names,  and  everything  else  you  can   think  of,  because  you  are  the  expert.   –  Comments  about  the  kinds  of  changes  you   made  to  the  gene  model  of  interest,  if  any.     –  Any  appropriate  func'onal  assignments,  e.g.  via   BLAST,  RNA-­‐Seq  data,  literature  searches,  etc.   THE CHECKLIST for accuracy and integrity MANUAL ANNOTATION CHECKLIST
  80. 80. Example  
  81. 81. Example Example 81 A  public  Apollo  Demo  using  the  Honey  Bee  genome  is  available  at     h@p://genomearchitect.org/WebApolloDemo   -­‐  Cura'on  example  using  the  Hyalella  azteca   genome  (amphipod  crustacean).  
  82. 82. What do we know about this genome? •  Currently  publicly  available  data  at  NCBI:   •  >37,000    nucleo'de  seqsà  scaffolds,  mitochondrial  genes   •  300    amino  acid  seqsà  mitochondrion   •  53    ESTs   •  0      conserved  domains  iden'fied   •  0    “gene”  entries  submi@ed     •  Data  at  i5K  Workspace@NAL  (annota'on  hosted  at  USDA)     -­‐  10,832  scaffolds:  23,288  transcripts:  12,906  proteins   Example 82
  83. 83. PubMed Search: 
 what’s new? Example 83
  84. 84. PubMed Search: what’s new? Example 84 “Ten  popula'ons  (3  cultures,  7  from  California  water   bodies)  differed  by  at  least  550-­‐fold  in  sensi=vity  to   pyrethroids.”     “By  sequencing  the  primary  pyrethroid  target  site,  the   voltage-­‐gated  sodium  channel  (vgsc),  we  show  that   point  muta'ons  and  their  spread  in  natural  popula'ons   were  responsible  for  differences  in  pyrethroid   sensi'vity.”   “The  finding  that  a  non-­‐target  aqua'c  species  has   acquired  resistance  to  pes'cides  used  only  on  terrestrial   pests  is  troubling  evidence  of  the  impact  of  chronic   pes=cide  transport  from  land-­‐based  applica'ons  into   aqua'c  systems.”  
  85. 85. How many sequences are there, publicly available, for our gene of interest? Example 85 •  Para,  (voltage-­‐gated  sodium  channel  alpha   subunit;  Nasonia  vitripennis).     •  NaCP60E  (Sodium  channel  protein  60  E;  D.   melanogaster).   –  MF:  voltage-­‐gated  ca'on  channel  ac'vity   (IDA,  GO:0022843).   –  BP:  olfactory  behavior  (IMP,  GO: 0042048),  sodium  ion  transmembrane   transport  (ISS,GO:0035725).   –  CC:  voltage-­‐gated  sodium  channel   complex  (IEA,  GO:0001518).   And  what  do  we  know  about  them?  
  86. 86. Retrieving sequences for 
 sequence similarity searches. Example 86 >vgsc-­‐Segment3-­‐DomainII   RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDG QMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
  87. 87. BLAT search
 
 input Example 87 >vgsc-­‐Segment3-­‐DomainII   RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDG QMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR
  88. 88. BLAT search
 
 results Example 88 •  High-­‐scoring  segment  pairs  (hsp)   are  listed  in  tabulated  format.   •  Clicking  on  one  line  of  results   sends  you  to  those  coordinates.  
  89. 89. Creating a new gene model: drag and drop Example 89 •  Apollo automatically calculates ORF. In this case, ORF includes the high-scoring segment pairs (hsp), marked here in blue.
  90. 90. Available Tracks Example 90
  91. 91. Get Sequence Example 91 http://blast.ncbi.nlm.nih.gov/Blast.cgi
  92. 92. Also, flanking sequences (other gene models) vs. NCBI nr Example 92 In  this  case,  two  gene   models  upstream,  at  5’   end.   BLAST  hsps  
  93. 93. Review alignments Example 93 HaztTmpM006234   HaztTmpM006233   HaztTmpM006232  
  94. 94. Hypothesis for vgsc gene model Example 94
  95. 95. Editing: merge the three models Example 95 Merge  by  dropping  an   exon  or  gene  model   onto  another.   Merge  by  selec'ng   two  exons  (holding   down  “Shij”)  and   using  the  right  click   menu.   or…  
  96. 96. Result of merging the three models. Example 96
  97. 97. Editing: correct boundaries Example 97 Modify  exon  /  intron   boundary:     -­‐  Drag  the  end  of  the   exon  to  the  nearest   canonical  splice  site.     or     -­‐  Use  right-­‐click  menu.  
  98. 98. Editing: set translation start Example 98
  99. 99. Editing: delete exon Example 99 Delete  first  exon  from   HaztTmpM006233  
  100. 100. Editing: add an exon - supported by RNAseq Example 100 •  RNAseq  reads  show  evidence  in  support  of  transcribed  product,  which  was  not  predicted.   •  Add  exon  at  coordinates  97946-­‐98012  by  dragging  up  one  of  the  RNAseq  reads.  
  101. 101. Editing: and adjust other exon boundary using evidence Example 101
  102. 102. Editing: adjust other boundaries supported by evidence Example 102
  103. 103. Finished model Example 103 Corroborate  integrity  and  accuracy  of  the  model:     -­‐  Start  and  Stop   -­‐  Exon  structure  and  splice  sites  …]5’-­‐GT/AG-­‐3’[…   -­‐  Check  the  predicted  protein  product  vs.  NCBI  nr,  UniProt,  etc.  
  104. 104. Information Editor •  DBXRefs:  e.g.  NP_001128389.1,  N.   vitripennis,  RefSeq   •  PubMed  iden'fier:  PMID:  24065824   •  Gene  Ontology  IDs:  GO:0022843,  GO: 0042048,  GO:0035725,  GO:0001518.   •  Comments.   •  Name,  Symbol.     •  Approve  /  Delete  radio  bu@on.   Example 104 Comments   (if  applicable)  
  105. 105. Demo  
  106. 106. APOLLO
 demonstration DEMO 106 Demo  video  is  available  at     h@ps://youtu.be/VgPtAP_fvxY  
  107. 107. OUTLINE
 Web  Apollo  Collabora've  Cura'on  and     Interac've  Analysis  of  Genomes   107OUTLINE •  BIO-­‐REFRESHER   biological  concepts  for  cura'on   •  ANNOTATION   automa'c  predic'ons   •  MANUAL  ANNOTATION   necessary,  collabora've     •  APOLLO   advancing  collabora've  cura'on     •  EXAMPLE   demos   •  EXERCISES  
  108. 108. Exercises  
  109. 109. Exercises Live  Demonstra'on  using  the  Apis  mellifera  genome.   110 1.  Evidence  in  support  of  protein  coding  gene   models.       1.1  Consensus  Gene  Sets:   Official  Gene  Set  v3.2   Official  Gene  Set  v1.0     1.2  Consensus  Gene  Sets  comparison:   OGSv3.2  genes  that  merge  OGSv1.0  and   RefSeq  genes   OGSv3.2  genes  that  split  OGSv1.0  and  RefSeq   genes     1.3  Protein  Coding  Gene  Predic=ons  Supported  by   Biological  Evidence:   NCBI  Gnomon   Fgenesh++  with  RNASeq  training  data   Fgenesh++  without  RNASeq  training  data   NCBI  RefSeq  Protein  Coding  Genes  and  Low  Quality   Protein  Coding  Genes   1.4  Ab  ini,o  protein  coding  gene  predic=ons:   Augustus  Set  12,  Augustus  Set  9,  Fgenesh,  GeneID,   N-­‐SCAN,  SGP2     1.5  Transcript  Sequence  Alignment:   NCBI  ESTs,  Apis  cerana  RNA-­‐Seq,  Forager  Bee  Brain   Illumina  Con'gs,  Nurse  Bee  Brain  Illumina  Con'gs,   Forager  RNA-­‐Seq  reads,  Nurse  RNA-­‐Seq  reads,   Abdomen  454  Con'gs,  Brain  and  Ovary  454   Con'gs,  Embryo  454  Con'gs,  Larvae  454  Con'gs,   Mixed  Antennae  454  Con'gs,  Ovary  454  Con'gs   Testes  454  Con'gs,  Forager  RNA-­‐Seq  HeatMap,   Forager  RNA-­‐Seq  XY  Plot,  Nurse  RNA-­‐Seq   HeatMap,  Nurse  RNA-­‐Seq  XY  Plot     Becoming Acquainted with Web Apollo.
  110. 110. Exercises Live  Demonstra'on  using  the  Apis  mellifera  genome.   111 1.  Evidence  in  support  of  protein  coding  gene   models  (Con=nued).     1.6  Protein  homolog  alignment:   Acep_OGSv1.2   Aech_OGSv3.8   Cflo_OGSv3.3   Dmel_r5.42   Hsal_OGSv3.3   Lhum_OGSv1.2   Nvit_OGSv1.2   Nvit_OGSv2.0   Pbar_OGSv1.2   Sinv_OGSv2.2.3   Znev_OGSv2.1   Metazoa_Swissprot       2.  Evidence  in  support  of  non  protein  coding  gene   models     2.1  Non-­‐protein  coding  gene  predic=ons:   NCBI  RefSeq  Noncoding  RNA   NCBI  RefSeq  miRNA     2.2  Pseudogene  predic=ons:   NCBI  RefSeq  Pseudogene   Becoming Acquainted with Web Apollo.
  111. 111. Instrucciones 112 | 112 APOLLO ON THE WEB
 instructions Username:   user.number@example.com     Password:   usernumber   Email   Password   Server   Begin  at   user.one@example.com   userone   1   1   user.two@example.com   usertwo   2   1   user.three@example.com   userthree   3   1   user.four@example.com   userfour   4   1   user.five@example.com   userfive   5   1   user.six@example.com   usersix   1   7   user.seven@example.com   userseven   2   7   user.eight@example.com   usereight   3   7   user.nine@example.com   usernine   4   7   user.ten@example.com   userten   5   7   user.eleven@example.com   usereleven   1   1   user.twelve@example.com   usertwelve   2   1   user.thirteen@example.com   userthirteen   3   1   user.fourteen@example.com   userfourteen   4   1   user.fijeen@example.com   userfijeen   5   1   user.sixteen@example.com   usersixteen   1   7   user.seventeen@example.com   userseventeen   2   7   user.eigh@een@example.com   usereighteen   3   7   user.nineteen@example.com   usernineteen   4   7   user.twenty@example.com   usertwenty   5   7   user.twentyone@example.com   usertwentyone   1   1   user.twentytwo@example.com   usertwentytwo   2   1   user.twentythree@example.com   usertwentythree   3   1   user.twentyfour@example.com   usertwentyfour   4   1   user.twentyfive@example.com   usertwentyfive   5   1   user.twentysix@example.com   usertwentysix   1   7   user.twentyseven@example.com   usertwentyseven   2   7   user.twentyeight@example.com   usertwentyeight   3   7   user.twentynine@example.com   usertwentynine   4   7   Server   URL   1  h@p://52.26.7.239:8080/apollo/annotator/index   2  h@p://52.89.205.105:8080/apollo/annotator/index   3  h@p://52.89.230.210:8080/apollo/annotator/index   4  h@p://52.89.149.42:8080/apollo/annotator/index   5  h@p://52.89.233.118:8080/apollo/annotator/index  
  112. 112. Thank you. 113 •  Berkeley  Bioinforma=cs  Open-­‐source  Projects  (BBOP),   Berkeley  Lab:  Apollo  and  Gene  Ontology  teams.  Suzanna   E.  Lewis  (PI).   •  §  Chris1ne  G.  Elsik  (PI).  University  of  Missouri.     •  *  Ian  Holmes  (PI).  University  of  California  Berkeley.   •  Arthropod  genomics  community:  i5K  Steering   Commi@ee  (esp.  Sue  Brown  (Kansas  State)),  Alexie   Papanicolaou  (UWS),  and  the  Honey  Bee  Genome   Sequencing  Consor'um.   •  Stephen  Ficklin,  GenSAS,  Washington  State  University   •  Apollo  is  supported  by  NIH  grants  5R01GM080203  from   NIGMS,  and  5R01HG004483  from  NHGRI.  Both  projects   are  also  supported  by  the  Director,  Office  of  Science,   Office  of  Basic  Energy  Sciences,  of  the  U.S.  Department   of  Energy  under  Contract  No.  DE-­‐AC02-­‐05CH11231   •      •  For  your  a*en=on,  thank  you!   Apollo   Nathan  Dunn   Colin  Diesh  §   Deepak  Unni  §       Gene  Ontology   Chris  Mungall   Seth  Carbon   Heiko  Dietze     BBOP   Apollo:  h@p://GenomeArchitect.org     GO:  h@p://GeneOntology.org   i5K:  h@p://arthropodgenomes.org/wiki/i5K   Thank  you!   NAL  at  USDA   Monica  Poelchau   Christopher  Childers   Gary  Moore   HGSC  at  BCM   fringy  Richards   Kim  Worley     JBrowse          Eric  Yao  *  

×