Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apollo for i5k


Published on

Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.

The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.

This presentation is an introduction to Apollo for the members of the i5K Pilot Project working on species of the order Hemiptera.

Published in: Science
  • Be the first to comment

Introduction to Apollo for i5k

  1. 1. Introduction to Apollo Collaborative genome annotation editing A webinar for the i5K Research Community - Hemiptera Monica Munoz-Torres | @monimunozto Berkeley Bioinformatics Open-Source Projects (BBOP) Environmental Genomics & Systems Biology Division, Lawrence Berkeley National Laboratory i5k Pilot Project Species Calls | 9 February, 2016
  2. 2. Outline •  Today you will discover effective ways to extract valuable information about a genome through curation efforts. Apollo  Collabora've  Cura'on  and     Interac've  Analysis  of  Genomes  
  3. 3. After this talk you will... •  Better understand ‘curation’ in the context of genome annotation: assembled genome à automated annotation à manual annotation •  Become familiar with Apollo’s environment and functionality. •  Learn to identify homologs of known genes of interest in your newly sequenced genome. •  Learn how to corroborate and modify automatically annotated gene models using all available evidence in Apollo.
  4. 4. Experimental design, sampling. Comparative analyses Official / Merged Gene Set Manual Annotation Automated Annotation Sequencing Assembly Synthesis & dissemination. This is our focus.
  5. 5. We must care about curation Marbach et al. 2011. Nature Methods | | Alexander Wild The gene set of an organism informs a variety of studies: •  Characterization: Gene number, GC%, TEs, repeats. •  Functional assignments. •  Molecular evolution, sequence conservation. •  Gene families. •  Metabolic pathways. •  What makes an organism what it is? What makes a bee a “bee”?
  6. 6. Genome Curation Identifies elements that best represent the underlying biology and eliminates elements that reflect systemic errors of automated analyses. Assigns function through comparative analysis of similar genome elements from closely related species using literature, databases, and experimental data. Apollo Gene Ontology Resources
  7. 7. A few things to remember
 when conducting manual annotation To  remember…  Biological  concepts  to  be;er   understand  manual  annota'on   7BIO-REFRESHER •  KEEP  A  GLOSSARY  HANDY     from  con$g  to  splice  site     •  WHAT  IS  A  GENE?   defining  your  goal   •  TRANSCRIPTION   mRNA  in  detail     •  TRANSLATION   reading  frames,  etc.   •  GENOME  CURATION   steps  involved  
  8. 8. The gene: a “moving target” “The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products.” Gerstein et al., 2007. Genome Res
  9. 9. 9 "Gene structure" by Daycd- Wikimedia Commons BIO-REFRESHER mRNA •  Although of brief existence, understanding mRNAs is crucial, as they will become the center of your work.
  10. 10. 10BIO-REFRESHER Reading frames v  In eukaryotes, only one reading frame per section of DNA is biologically relevant at a time: it has the potential to be transcribed into RNA and translated into protein. This is called the OPEN READING FRAME (ORF) •  ORF = Start signal + coding sequence (divisible by 3) + Stop signal
  11. 11. 11BIO-REFRESHER Splice sites v  The spliceosome catalyzes the removal of introns and the ligation of flanking exons. v  Splicing signals (from the point of view of an intron): •  One splice signal (site) on the 5’ end: usually GT (less common: GC) •  And a 3’ end splice site: usually AG •  Canonical splice sites look like this: …]5’-GT/AG-3’[…
  12. 12. 12BIO-REFRESHER Exons and Introns v  Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons v  Between the first and second nucleotide of a codon v  Or between the second and third nucleotide of a codon "Exon and Intron classes”. Licensed under Fair use via Wikipedia
  13. 13. Predic'on  &  Annota'on  
  14. 14. 14GENE PREDICTION & ANNOTATION PREDICTION & ANNOTATION v  Iden'fica'on  and  annota'on  of  genome  features:     •  primarily  focuses  on  protein-­‐coding  genes.     •  also  iden'fies  RNAs  (tRNA,  rRNA,  long  and  small  non-­‐coding   RNAs  (ncRNA)),  regulatory  mo'fs,  repe''ve  elements,  etc.     •  happens  in  2  phases:   1.  Computa'on  phase     2.  Annota'on  phase  
  15. 15. 15GENE PREDICTION & ANNOTATION COMPUTATION PHASE a.  Experimental  data  are  aligned  to  the  genome:  expressed  sequence  tags,   RNA-­‐sequencing  reads,  proteins  (also  from  other  species).             b.  Gene  predic;ons  are  generated:      -­‐  ab  ini$o:  based  on  nucleo'de  sequence  and  composi'on    e.g.  Augustus,  GENSCAN,  geneid,  fgenesh,  etc.    -­‐  evidence-­‐driven:  iden'fying  also  domains  and  mo'fs    e.g.  SGP2,  JAMg,  fgenesh++,  etc.       Result:  the  single  most  likely  coding  sequence,  no  UTRs,  no  isoforms.   Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174
  16. 16. 16GENE PREDICTION & ANNOTATION ANNOTATION PHASE Experimental  data  (evidence)  and  predic'ons  are  synthe'zed  into  gene   annota'ons.     Result:  gene  models  that  generally  include  UTRs,  isoforms,  evidence  trails.   Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174 5’  UTR   3’  UTR  
  17. 17. 17 In  some  cases  algorithms  and  metrics  used  to  generate   consensus  sets  may  actually  reduce  the  accuracy  of  the  gene’s   representa'on.   CONSENSUS GENE SETS Gene  models  may  be  organized  into  sets  using:   v  combiners  for  automa'c  integra'on  of  predicted  sets     e.g:  GLEAN,  EvidenceModeler   or   v  tools  packaged  into  pipelines   e.g:  MAKER,  PASA,  Gnomon,  Ensembl,  etc.   GENE PREDICTION & ANNOTATION
  18. 18. ANNOTATION
 needs some refinement No one is perfect, least of all automated annotation. 18 New  technologies  bring  new  challenges:     •  Assembly  errors  can  cause  fragmented   annota'ons   •  Limited  coverage  makes  precise   iden'fica'on  a  difficult  task   Image:
 improving predictions Precise  elucida;on  of  biological  features   encoded  in  the  genome  requires  careful   examina;on  and  review.     Schiex  et  al.  Nucleic  Acids  2003  (31)  13:  3738-­‐3741   Automated Predictions Experimental Evidence Manual Annotation – to the rescue. 19 cDNAs,  HMM  domain  searches,  RNAseq,   genes  from  other  species.  
 an inherently collaborative task GENE PREDICTION & ANNOTATION 20 So  many  sequences,  not  enough  hands.   Apis  mellifera  |  Alexander  Wild  |  
  21. 21. We have provided continuous training and support for hundreds of geographically dispersed scientists to conduct manual annotations efforts in order to recover coding sequences in agreement with all available biological evidence. 21 Lessons learned APOLLO •  Collaborative work distills invaluable knowledge. •  A little training goes a long way! Wet lab scientists can easily learn to maximize the generation of accurate, biologically supported gene models.
  22. 22. Apollo  
  23. 23. APOLLO: versatile genome annotation editing •  Apollo is a web-based genome annotation editor, integrated with JBrowse •  Supports real time collaboration & generates analysis-ready data USER-CREATED ANNOTATIONS EVIDENCE TRACKS ANNOTATOR PANEL
  24. 24. BECOMING ACQUAINTED WITH APOLLO 24 General process of curation 1.  Select  or  find  a  region  of  interest,  e.g.  scaffold.   2.  Select  appropriate  evidence  tracks  to  review  the  gene  model.   3.  Determine  whether  a  feature  in  an  exis'ng  evidence  track   will  provide  a  reasonable  gene  model  to  start  working.   4.  If  necessary,  adjust  the  gene  model.   5.  Check  your  edited  gene  model  for  integrity  and  accuracy  by   comparing  it  with  available  homologs.   6.  Comment  and  finish.  
  25. 25. Apollo - version at i5K Workspace@NAL 254. Becoming Acquainted with Web Apollo. 25 The  Sequence  Selec'on  Window  
  26. 26. Sort Apollo - version at i5K Workspace@NAL 26 “Old  Track  Select  Page”   4. Becoming Acquainted with Web Apollo. 26
  27. 27. 27 APOLLO
 annotation editing environment BECOMING ACQUAINTED WITH APOLLO Color  by  CDS  frame,   toggle  strands,  set  color   scheme  and  highlights.   -­‐  Upload  evidence  files   (GFF3,  BAM,  BigWig),   -­‐  combina;on  track     -­‐  sequence  search  track   Query  the  genome  using   BLAT.   Naviga'on  and  zoom.   Search  for  a  gene   model  or  a  scaffold.   Get  coordinates  and  “rubber   band”  selec'on  for  zooming.   Login   User-­‐created   annota'ons.   New   annotator   panel.   Evidence   Tracks   Stage  and   cell-­‐type   specific   transcrip'on   data.    h;p://    
  28. 28. 28 | 28 BECOMING ACQUAINTED WITH APOLLO USER NAVIGATION Annotator   panel.   •  Choose  appropriate  evidence  from  list  of  “Tracks”  on  annotator  panel.       •  Select  &  drag  elements  from  evidence  track  into  the  ‘User-­‐created  Annota$ons’  area.     •  Hovering  over  annota'on  in  progress  brings  up  an  informa'on  pop-­‐up.   •  Crea'ng  a  new  annota'on  
  29. 29. Adding a gene model
  30. 30. Adding a gene model
  31. 31. Adding a gene model
  32. 32. Editing functionality
  33. 33. Editing functionality Example: Adding an exon supported by experimental data •  RNAseq reads show evidence in support of a transcribed product that was not predicted. •  Add exon by dragging up one of the RNAseq reads.
  34. 34. Editing functionality Example: Adjusting exon boundaries supported by experimental data
  35. 35. Cura'ng  with  Apollo  
  36. 36. 36 | 36 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO •  ‘Zoom  to  base  level’  reveals  the  DNA  Track.  
  37. 37. 37 | 37 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO •  Color  exons  by  CDS  from  the  ‘View’  menu.  
  38. 38. 38 | Zoom  in/out  with  keyboard:   shio  +  arrow  keys  up/down   38 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO •  Toggle  reference  DNA  sequence  and  transla;on  frames  in  forward   strand.  Toggle  models  in  either  direc'on.  
  39. 39. annota'ng  simple  cases  
  40. 40. “Simple  case”:      -­‐  the  predicted  gene  model  is  correct  or  nearly  correct,  and      -­‐  this  model  is  supported  by  evidence  that  completely  or  mostly   agrees  with  the  predic'on.      -­‐  evidence  that  extends  beyond  the  predicted  model  is  assumed   to  be  non-­‐coding  sequence.       The  following  are  simple  modifica'ons.       40 ANNOTATING SIMPLE CASES BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  41. 41. •  A  confirma'on  box  will  warn  you  if  the  receiving  transcript  is  not  on  the   same  strand  as  the  feature  where  the  new  exon  originated.     •  Check  ‘Start’  and  ‘Stop’  signals  aoer  each  edit.   41 ADDING EXONS BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  42. 42. If  transcript  alignment  data  are  available  &  extend  beyond  your  original  annota'on,     you  may  extend  or  add  UTRs.     1.  Right  click  at  the  exon  edge  and  ‘Zoom  to  base  level’.     2.  Place  the  cursor  over  the  edge  of  the  exon  un$l  it  becomes  a  black  arrow  then  click   and  drag  the  edge  of  the  exon  to  the  new  coordinate  posi'on  that  includes  the  UTR.     42 ADDING UTRs To  add  a  new  spliced  UTR  to  an  exis'ng     annota'on  also  follow  the  procedure  for  adding  an  exon.   BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  43. 43. To  modify  an  exon  boundary  and  match   data   in   the   evidence   tracks:   select   both   the   [offending]   exon   and   the   feature  with  the  expected  boundary,   then  right  click  on  the  annota'on  to   select  ‘Set  3’  end’  or  ‘Set  5’  end’  as   appropriate.     In  some  cases  all  the  data  may  disagree  with  the  annota'on,  in   other  cases  some  data  support  the  annota'on  and  some  of  the   data  support  one  or  more  alterna've  transcripts.  Try  to  annotate   as  many  alterna've  transcripts  as  are  well  supported  by  the  data.   43 MATCHING EXON BOUNDARY TO EVIDENCE BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  44. 44. Non-­‐canonical  splice  sites  flags.   Double  click:  selec'on  of   feature  and  sub-­‐features   Evidence  Tracks  Area   ‘User-­‐created  Annota$ons’  Track   Edge-­‐matching   Apollo’s  edi'ng  logic  (brain):     §  selects  longest  ORF  as  CDS   §  flags  non-­‐canonical  splice  sites   44 ORFs AND SPLICE SITES BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  45. 45. Non-­‐canonical  splices  are  indicated  by   an   orange   circle   with   a   white   exclama'on  point  inside,  placed  over   the  edge  of  the  offending  exon.     Canonical  splice  sites:   3’-­‐…exon]GA  /  TG[exon…-­‐5’   5’-­‐…exon]GT  /  AG[exon…-­‐3’   reverse  strand,  not  reverse-­‐complemented:   forward  strand   45 SPLICE SITES Zoom  to  review  non-­‐canonical   splice  site  warnings.  Although   these  may  not  always  have  to  be   corrected  (e.g  GC  donor),  they   should  be  flagged  with  a   comment.     Exon/intron  splice  site  error  warning   Curated  model   BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  46. 46. Apollo  calculates  the  longest  possible  open  reading   frame  (ORF)  that  includes  canonical  ‘Start’  and   ‘Stop’  signals  within  the  predicted  exons.     If  ‘Start’  appears  to  be  incorrect,  modify  it  by  selec'ng   an  in-­‐frame  ‘Start’  codon  further  up  or   downstream,  depending  on  evidence  (proteins,   RNAseq).       It  may  be  present  outside  the  predicted  gene   model,  within  a  region  supported  by  another   evidence  track.     In  very  rare  cases,  the  actual  ‘Start’  codon  may  be   non-­‐canonical  (non-­‐ATG).     46 ‘Start’ AND ‘Stop’ SITES BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  47. 47. 1.  Two  exons  from  different  tracks  sharing  the  same  start/end  coordinates   display  a  red  bar  to  indicate  matching  edges.   2.  Selec'ng  the  whole  annota'on  or  one  exon  at  a  'me,  use  this  edge-­‐ matching  func'on  and  scroll  along  the  length  of  the  annota'on,   verifying  exon  boundaries  against  available  data.     Use  square  [  ]  brackets  to  scroll  from  exon  to  exon.   User  curly  {  }  brackets  to  scroll  from  annota'on  to  annota'on.   3.  Check  if  cDNA  /  RNAseq  reads  lack  one  or  more  of  the  annotated  exons   or  include  addi'onal  exons.     47 CHECKING EXON INTEGRITY BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
  48. 48. annota'ng  complex  cases  
  49. 49. Evidence  may  support  joining  two  or  more  different  gene  models.     Warning:  protein  alignments  may  have  incorrect  splice  sites  and  lack  non-­‐conserved  regions!     1.  In  ‘User-­‐created  Annota<ons’  area  shio-­‐click  to  select  an  intron  from  each  gene  model  and   right  click  to  select  the  ‘Merge’  op'on  from  the  menu.     2.  Drag  suppor'ng  evidence  tracks  over  the  candidate  models  to  corroborate  overlap,  or   review  edge  matching  and  coverage  across  models.   3.  Check  the  resul'ng  transla'on  by  querying  a  protein  database  e.g.  UniProt,  NCBI  nr.  Add   comments  to  record  that  this  annota'on  is  the  result  of  a  merge.   49 Red  lines  around  exons:   ‘edge-­‐matching’  allows  annotators  to  confirm  whether  the   evidence  is  in  agreement  without  examining  each  exon  at  the   base  level.   COMPLEX CASES merge two gene predictions on the same scaffold BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  50. 50. One  or  more  splits  may  be  recommended  when:     -­‐  different  segments  of  the  predicted  protein  align  to  two  or  more  different   gene  families     -­‐  predicted  protein  doesn’t  align  to  known  proteins  over  its  en're  length   -­‐  Transcript  data  may  support  a  split,  but  first  verify  whether  they  are   alterna've  transcripts.     50 COMPLEX CASES split a gene prediction BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  51. 51. DNA  Track   ‘User-­‐created  Annota;ons’  Track   51 COMPLEX CASES annotate frameshifts and correct single-base errors Always  remember:  when  annota'ng  gene  models  using  Apollo,  you  are  looking  at  a  ‘frozen’  version  of   the  genome  assembly  and  you  will  not  be  able  to  modify  the  assembly  itself.   BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  52. 52. 52 COMPLEX CASES correcting selenocysteine containing proteins BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  53. 53. 53 COMPLEX CASES correcting selenocysteine containing proteins BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  54. 54. 1.  Apollo  allows  annotators  to  make  single  base  modifica'ons  or  frameshios  that  are  reflected  in   the  sequence  and  structure  of  any  transcripts  overlapping  the  modifica'on.  These   manipula'ons  do  NOT  change  the  underlying  genomic  sequence.     2.  If  you  determine  that  you  need  to  make  one  of  these  changes,  zoom  in  to  the  nucleo'de  level   and  right  click  over  a  single  nucleo'de  on  the  genomic  sequence  to  access  a  menu  that   provides  op'ons  for  crea'ng  inser'ons,  dele'ons  or  subs'tu'ons.     3.  The  ‘Create  Genomic  Inser<on’  feature  will  require  you  to  enter  the  necessary  string  of   nucleo'de  residues  that  will  be  inserted  to  the  right  of  the  cursor’s  current  loca'on.  The   ‘Create  Genomic  Dele<on’  op'on  will  require  you  to  enter  the  length  of  the  dele'on,  star'ng   with  the  nucleo'de  where  the  cursor  is  posi'oned.  The  ‘Create  Genomic  Subs<tu<on’  feature   asks  for  the  string  of  nucleo'de  residues  that  will  replace  the  ones  on  the  DNA  track.   4.  Once  you  have  entered  the  modifica'ons,  Apollo  will  recalculate  the  corrected  transcript  and   protein  sequences,  which  will  appear  when  you  use  the  right-­‐click  menu  ‘Get  Sequence’   op'on.  Since  the  underlying  genomic  sequence  is  reflected  in  all  annota'ons  that  include  the   modified  region  you  should  alert  the  curators  of  your  organisms  database  using  the   ‘Comments’  sec'on  to  report  the  CDS  edits.     5.  In  special  cases  such  as  selenocysteine  containing  proteins  (read-­‐throughs),  right-­‐click  over  the   offending/premature  ‘Stop’  signal  and  choose  the  ‘Set  readthrough  stop  codon’  op'on  from   the  menu.     54 COMPLEX CASES annotating frameshifts and correcting single-base errors & selenocysteines BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
  56. 56. 56 The  Annota'on  Informa;on  Editor   56 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO
  57. 57. 57 The  Annota'on  Informa;on  Editor   •  Add  PubMed  IDs   •  Include  GO  terms  as  appropriate   from  any  of  the  three  ontologies   •  Write  comments  sta'ng  how  you   have  validated  each  model.   57 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO
  58. 58. 58 | 58 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO •  Keeping track of each edit
  59. 59. 59 Annota'ons,  annota'on  edits,  and  History:  stored  in  a  centralized  database.   59 USER NAVIGATION BECOMING ACQUAINTED WITH APOLLO
  60. 60. Follow  the  checklist  un'l  you  are  happy  with  the  annota'on!   And  remember  to…   –  comment  to  validate  your  annota'on,  even  if  you  made  no  changes  to  an   exis'ng  model.  Think  of  comments  as  your  vote  of  confidence.     –  or  add  a  comment  to  inform  the  community  of  unresolved  issues  you   think  this  model  may  have.   60 | 60 Always  Remember:  Apollo  cura'on  is  a  community  effort  so  please   use  comments  to  communicate  the  reasons  for  your     annota'on.  Your  comments  will  be  visible  to  everyone.   COMPLETING THE ANNOTATION BECOMING ACQUAINTED WITH APOLLO
  61. 61. Checklist  
  62. 62. •  Check  ‘Start’  and  ‘Stop’  sites.   •  Check    splice  sites:  most  splice  sites  display   these  residues  …]5’-­‐GT/AG-­‐3’[…   •  Check  if  you  can  annotate  UTRs,  for  example   using  RNA-­‐Seq  data:   – align  it  against  relevant  genes/gene  family   – blastp  against  NCBI’s  RefSeq  or  nr   •  Check  for  gaps  in  the  genome.   •  Addi'onal  func'onality  may  be  necessary:   – merging  2  gene  predic'ons  -­‐  same  scaffold   – merging  2  gene  predic'ons  -­‐  different   scaffolds     – spli`ng  a  gene  predic'on   – annota'ng  frameshias   – annota'ng  selenocysteines,  correc'ng   single-­‐base  and  other  assembly  errors,  etc.   62 | 62 •  Add:   –  Important  project  informa'on  in  the  form  of   comments   –  IDs  from  public  databases  e.g.  GenBank  (via   DBXRef),  gene  symbol(s),  common  name(s),   synonyms,  top  BLAST  hits,  orthologs  with   species  names,  and  everything  else  you  can   think  of,  because  you  are  the  expert.   –  Comments  about  the  kinds  of  changes  you   made  to  the  gene  model  of  interest,  if  any.     –  Any  appropriate  func'onal  assignments,  e.g.   via  BLAST,  RNA-­‐Seq  data,  literature  searches,   etc.   CHECKLIST for accuracy and integrity MANUAL ANNOTATION CHECKLIST
  63. 63. Genome  cura'on  with  i5k  
  64. 64. 64i5K Workspace@NAL The collaborative curation process at i5k 1.  A  computa'onally  predicted  consensus  gene  set  has  been  generated   using  mul'ple  lines  of  evidence;  e.g.  HVIT_v0.5.3-­‐Models     2.  i5K  Projects  will  integrate  consensus  computa'onal  predic'ons  with   manual  annota'ons  to  produce  an  updated  Official  Gene  Set  (OGS):   Achtung!   •  If  it’s  not  on  either  track,  it  won’t  make  the  OGS!   •  If  it’s  there  and  it  shouldn’t,  it  will  s'll  make  the  OGS!  
  65. 65. 65 The ‘Replace Models’ rules 65 BECOMING ACQUAINTED WITH APOLLO
  66. 66. 66i5K Workspace@NAL 3.  In  some  cases  algorithms  and  metrics  used  to  generate  consensus  sets   may  actually  reduce  the  accuracy  of  the  gene’s  representa'on.  Use  your   judgment,  try  choosing  a  different  model  to  begin  the  annota'on.   4.  Isoforms:  drag  original  and  alterna'vely  spliced  form  to  ‘User-­‐created   Annota<ons’  area.   5.  If  an  annota'on  needs  to  be  removed  from  the  consensus  set,  drag  it  to   the  ‘User-­‐created  Annota<ons’  area  and  label  as  ‘Delete’  on  the   Informa$on  Editor.   6.  Overlapping  interests?  Collaborate  to  reach  agreement.   7.  Follow  guidelines  for  i5K  Pilot  Species  Projects,  at  h;p://   The collaborative curation process at i5k
  67. 67. Example  
  68. 68. What’s new?... 
 finding inspiration in PubMed. Example 68 “Molecular analysis of bed bug populations from across the USA and Europe found that >80% and >95% of the respective populations contained V419L and/ or L925I mutations in the voltage-gated sodium channel gene, indicating widespread distribution of target-site-based pyrethroid resistance.” Homalodisca vitripennis | Alexander Wild | www.alexanderwild.comHalyomorpha halys | Fondazione Edmund Mach - Italy Now for our species of interest. . .
  69. 69. Example Example 69  Cura'on  example  using  the  Hyalella  azteca   genome  (amphipod  crustacean).  
  70. 70. What do we know about this genome? •  Currently  publicly  available  data  at  NCBI:   •  >37,000    nucleo'de  seqsà  scaffolds,  mitochondrial  genes   •  344    amino  acid  seqsà  mitochondrion   •  47    ESTs   •  0      conserved  domains  iden'fied   •  0    “gene”  entries  submi;ed     •  Data  at  i5K  Workspace@NAL  (annota'on  hosted  at  USDA)     -­‐  10,832  scaffolds:  23,288  transcripts:  12,906  proteins   Example 70
  71. 71. PubMed Search: 
 what’s new? Example 71
  72. 72. PubMed Search: what’s new? Example 72 “Ten  popula'ons  (3  cultures,  7  from  California  water   bodies)  differed  by  at  least  550-­‐fold  in  sensi;vity  to   pyrethroids.”     “By  sequencing  the  primary  pyrethroid  target  site,  the   voltage-­‐gated  sodium  channel  (vgsc),  we  show  that   point  muta'ons  and  their  spread  in  natural  popula'ons   were  responsible  for  differences  in  pyrethroid   sensi'vity.”   “The  finding  that  a  non-­‐target  aqua'c  species  has   acquired  resistance  to  pes'cides  used  only  on  terrestrial   pests  is  troubling  evidence  of  the  impact  of  chronic   pes;cide  transport  from  land-­‐based  applica'ons  into   aqua'c  systems.”  
  73. 73. How many sequences are there, publicly available, for our gene of interest? Example 73 •  Para,  (voltage-­‐gated  sodium  channel  alpha   subunit;  Nasonia  vitripennis).     •  NaCP60E  (Sodium  channel  protein  60  E;  D.   melanogaster).   –  MF:  voltage-­‐gated  ca'on  channel  ac'vity   (IDA,  GO:0022843).   –  BP:  olfactory  behavior  (IMP,  GO: 0042048),  sodium  ion  transmembrane   transport  (ISS,GO:0035725).   –  CC:  voltage-­‐gated  sodium  channel   complex  (IEA,  GO:0001518).   And  what  do  we  know  about  them?  
  74. 74. Retrieving sequences for a 
  75. 75. BLAT search
  76. 76. BLAT search
 results   Example 76 •  High-­‐scoring  segment  pairs  (hsp)   are  listed  in  tabulated  format.   •  Clicking  on  one  line  of  results   sends  you  to  those  coordinates.  
  77. 77. BLAST at i5K 
  78. 78. BLAST at i5K 
 heps://   Example 78
  79. 79. BLAST at i5K: hsps  in  “BLAST+  Results”  track   Example 79
  80. 80. Creating a new gene model: drag and drop Example 80 •  Apollo  automa'cally  calculates  longest  ORF.     •  In  this  case,  ORF  includes  the  high-­‐scoring  segment  pairs  (hsp),   marked  here  in  blue.   •  Note  that  gene  is  transcribed  from  reverse  strand.  
  81. 81. Available Tracks Example 81
  82. 82. Get Sequence Example 82
  83. 83. Also, flanking sequences (other gene models) vs. NCBI nr Example 83 In  this  case,  two  gene   models  upstream,  at  5’   end.   BLAST  hsps  
  84. 84. Review alignments Example 84 HaztTmpM006234   HaztTmpM006233   HaztTmpM006232  
  85. 85. Hypothesis for vgsc gene model Example 85
  86. 86. Editing: merge the three models Example 86 Merge  by  dropping  an   exon  or  gene  model   onto  another.   Merge  by  selec'ng   two  exons  (holding   down  “Shio”)  and   using  the  right  click   menu.   or…  
  87. 87. Result of merging the gene models: Example 87
  88. 88. Editing: correct offending splice site Example 88 Modify  exon  /  intron   boundary:     -­‐  Drag  the  end  of  the   exon  to  the  nearest   canonical  splice  site.     or     -­‐  Use  right-­‐click  menu.  
  89. 89. Editing: set translation start Example 89
  90. 90. Editing: delete exon not supported by evidence Example 90 Delete  first  exon  from   HaztTmpM006233  
  91. 91. Editing: add an exon supported by RNAseq Example 91 •  RNAseq  reads  show  evidence  in  support  of  transcribed  product,  which  was  not  predicted.   •  Add  exon  at  coordinates  97946-­‐98012  by  dragging  up  one  of  the  RNAseq  reads.  
  92. 92. Editing: adjust offending splice site using evidence Example 92
  93. 93. Editing: adjust other boundaries supported by evidence Example 93
  94. 94. Finished model Example 94 Corroborate  integrity  and  accuracy  of  the  model:     -­‐  Start  and  Stop   -­‐  Exon  structure  and  splice  sites  …]5’-­‐GT/AG-­‐3’[…   -­‐  Check  the  predicted  protein  product  vs.  NCBI  nr,  UniProt,  etc.  
  95. 95. Information Editor •  DBXRefs:  e.g.  NP_001128389.1,  N.   vitripennis,  RefSeq   •  PubMed  iden'fier:  PMID:  24065824   •  Gene  Ontology  IDs:  GO:0022843,  GO: 0042048,  GO:0035725,  GO:0001518.   •  Comments   •  Name,  Symbol   •  Approve  /  Delete  radio  bu;on   Example 95 Comments   (if  applicable)  
  96. 96. Go  play!  
 instructions At  i5K   1.  Register  for  access  to  Apollo  at  the  i5K  Workspace@NAL  at   h;ps://­‐apollo-­‐registra'on     2.  Contact  the  coordinator  for  each  species  community  to  receive   more  informa'on  about  how  to  contribute.  Contact  info  is  available   on  each  organism’s  page.    
 instructions Public  Honey  bee  demo  available  at:     h;p://       Username:     Password:   demo  
  99. 99. APOLLO
 demonstration PUBLIC DEMO 99 Demonstra'on  video  is  available  at     h;ps://  
  100. 100. OUTLINE
 Apollo  Collabora've  Cura'on  and     Interac've  Analysis  of  Genomes   100OUTLINE •  BIO-­‐REFRESHER   biological  concepts  for  cura'on   •  ANNOTATION   automa'c  predic'ons   •  MANUAL  ANNOTATION   necessary,  collabora've     •  APOLLO   advancing  collabora've  cura'on     •  EXAMPLE   demos  
  101. 101. Apollo Development Nathan Dunn Eric Yao Christine Elsik’s Lab, University of Missouri Suzi Lewis Principal Investigator BBOP Moni Munoz-Torres Colin DieshDeepak Unni JBrowse. Ian Holmes’ Lab University of California, Berkeley
  102. 102. •  Berkeley Bioinformatics Open-source Projects (BBOP), Berkeley Lab: Apollo and Gene Ontology teams. Suzanna E. Lewis (PI). •  § Christine G. Elsik (PI). University of Missouri. •  * Ian Holmes (PI). University of California Berkeley. •  Arthropod genomics community & i5K Steering Committee. •  Stephen Ficklin, GenSAS, Washington State University •  Apollo is supported by NIH grants 5R01GM080203 from NIGMS, and 5R01HG004483 from NHGRI. Also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 •  For your attention, thank you! Apollo Nathan Dunn Colin Diesh § Deepak Unni § Gene Ontology Chris Mungall Seth Carbon Heiko Dietze BBOP Learn more about Apollo at Thank you! NAL at USDA Monica Poelchau Mei-Ju Chen Christopher Childers Gary Moore HGSC at BCM fringy Richards Kim Worley JBrowse Eric Yao *