Will the correct ORFs please stand up?
tribulations of using genome data for
deep phylogeny of BACE1/2
Christopher Southan, Deanery of Biomedical Sciences
and TW2Informatics, Göteborg, Sweden
Presentation for the Barton Group, University of Dundee, Dec 2019
2
Outline
• BACE1 and BACE2 are protease targets for Alzheimer's and diabetes,
respectively but their validation is now questioned
• Phylogenetic analysis can added functional insights
• This came up against two key problems
• A surprising prevalence of incorrect protein sequences predicted from genomes
• Many BACE1 and BACE2 orthologues had truncation and/or indel errors.
• Key phylogenetic representative genomes are languishing in an unfinished state
• Some options for amelioration of these problems will be described
• An update on the evolution of these enzymes will be shown
3
Context
4
BACE1 as a drug target
5
BACE1 normal function
Suggested normal cleaved APP (P05067) ectodomain involvement nerve cells and
Aβ peptide dampening of neuronal hyperactivity
Voltage-gated sodium channel subunits (SCN4B, O60939) substrates for regulation
of Nav1 channel metabolism
Neuregulin (NRG1, Q022979) substrate for control of nerve cell myelination
Amyloid-like protein 2 (APP2, Q06481) substrate for ectodomain fragments
BACE2 paralogue has 50% sequence identity
6
Alzheimer’s secretases as drug targets
7
GtoPdb BACE1/2 inhibitor curation
BACE1
24 leads
BACE2
12 leads
8
The bad news in 2018• There are 8741
compounds in ChEMBL
(compared to 8363 for
thrombin)
• 342 structures in PDB
• Most large pharma
companies had programs
• But three Phase III
candidates have failed,
some did even worse
than placebo
• De facto target de-
validation?
• Sheds doubt on causality
of amyloid
9
Ensembl: looking at the gift horse
10
Ensembl BACE2 paralogues ranked by % ID vs
human BACE2: ~ 50% erroneous ORFs
11
Ensembl BACE2 paralogues (II)
12
Poor old Lesser Hedgehog – lesser BACE2?
13
Poor old Horse – too much BACE2?
14
Danio rerio BACE2
• After 2.5 years of assembly curation
the 2017 zebrafish reference genome
(and UniProt) still has a clipped 372aa
BACE2
• NCBI has a 503aa version
15
TreeFam: A different
way of looking at the
bad news for the
BACE2 ORFs
16
BACE evolution
17
BACE in
Chordates (I)
18
BACE in Chordates (II)
• Amphioxus outgroup consistent with protochordate 2R whole-genome
duplication followed by paralog persistence
• Distinctly accelerated evolution of BACE2 (neofunctionalisation ?)
• Inferred neuronal role for BAC1 but no data outside human, mouse, fish
• Long branches are partial sequences
• Birds group with turtles
• Coelacanth groups with reptiles and tetrapods (not ray-finned fish)
• Xenopus laevis tetraploidisation has maintained “double” paralogs
• No evidence for pseudogenes or “dead” variants
• Implication of common origin between nerve and pancreatic cells
19
Finding the UrBACE:
Human BACE1 vs Monosiga (Choanoflagellate)
• 33% identity over 432 residues,
9% gaps
• Divergence time ~0.8 billion years
• Two paralogues?
20
But Monosiga still draft shotgun sequence from 2008
Has two paralogues we cant position on an assembly
21
The placezoan: version 1 for 13 years
Three paralogues not complete and can not be positioned or nearest
22
UrBACEs
and
cathepsin
paralogs
23
Ur-BACE evolutionary trajectory
• After duplication from a cathepsin ancestor the emergence of the
UrBACE protein sequence is distinct
• Cathepsin paralogs form a clear outgroup
• “Shuffling-in” of the CTM is the defining post-duplication shift in cellular
location and function
• Multiple divergent paralogs in basal phyla (2 in Monosiga, 3 in
Trichoplax) predate the nervous system
• Found in cnidaria with only nerve nets
• Long branch lengths mainly due to partial sequences
• Major orders now represented by draft genome assemblies
• Basal relationships of Eumetazoans still unresolved
• Beyond limited EST coverage no tissue distribution or functional data
24
Oily tails: key
to secretase
function
25
Invertebrate
phyla without
BACE
homologues
Non-orthologous replacement of UrBACE for pre-
neuronal RIP-related secretase functions ?
26
Evolution updates
27
Fixing dogy ORFs
o BLASTP on both sides of the Atlantic
o TBLASTN against genome reads, ESTs, and the HTS division
o Run gene prediction on contigs (e.g. GeneMark etc)
o Stich the bits together (by hand)
o Iterative repeat searching with those bits
o Run InterPro scan
o If you are still left with partials, they can still slot into trees in informative
positions
28
Latest NCBI results: new creatures
29
New Ur-BACE
30
Rounding off
31
Summarising the ORF problem
o Protein analysis is difficult when ORFs are not correct
o Ensembl gene builds have significant levels of dodgy sequences
o These include many error types
o Searching BACE homologues at EBI or NCBI gives different results from
different pipelines (e.g. Ensembl, JGI and NCBI)
o Many eukaryotic genomes remain in the draft state for a decade or more
o Cordinating deep transcript coverage and a genome assembly seems rare
o Seems paradoxical now we can sequence a genome a day
o This is particularly problematic for species in key phylogentic positions
32
Plans and questions
o Despite problems, make an update
o Crank up Jalview :)
o What is the Ur-BACE function in early metazoan?
o Could this be relevant to contempory mamallian function?
o Could we identify the non-homologous replacement in the Ecdyszoans?
o Can we encourage anyone to try BAC1/2 specific inhibitors for
functional proping in model organisms in lower phyla (e.g. Zebrafish)
o Do BACE1 inhibitors have any clinical future?
33
Questions and endpiece
o https://sites.google.com/view/tw2informatics/home homepage
o https://europepmc.org/search?query=AUTH:%22Southan+C%22 publications

Will the correct BACE ORFs please stand up?

  • 1.
    Will the correctORFs please stand up? tribulations of using genome data for deep phylogeny of BACE1/2 Christopher Southan, Deanery of Biomedical Sciences and TW2Informatics, Göteborg, Sweden Presentation for the Barton Group, University of Dundee, Dec 2019
  • 2.
    2 Outline • BACE1 andBACE2 are protease targets for Alzheimer's and diabetes, respectively but their validation is now questioned • Phylogenetic analysis can added functional insights • This came up against two key problems • A surprising prevalence of incorrect protein sequences predicted from genomes • Many BACE1 and BACE2 orthologues had truncation and/or indel errors. • Key phylogenetic representative genomes are languishing in an unfinished state • Some options for amelioration of these problems will be described • An update on the evolution of these enzymes will be shown
  • 3.
  • 4.
    4 BACE1 as adrug target
  • 5.
    5 BACE1 normal function Suggestednormal cleaved APP (P05067) ectodomain involvement nerve cells and Aβ peptide dampening of neuronal hyperactivity Voltage-gated sodium channel subunits (SCN4B, O60939) substrates for regulation of Nav1 channel metabolism Neuregulin (NRG1, Q022979) substrate for control of nerve cell myelination Amyloid-like protein 2 (APP2, Q06481) substrate for ectodomain fragments BACE2 paralogue has 50% sequence identity
  • 6.
  • 7.
    7 GtoPdb BACE1/2 inhibitorcuration BACE1 24 leads BACE2 12 leads
  • 8.
    8 The bad newsin 2018• There are 8741 compounds in ChEMBL (compared to 8363 for thrombin) • 342 structures in PDB • Most large pharma companies had programs • But three Phase III candidates have failed, some did even worse than placebo • De facto target de- validation? • Sheds doubt on causality of amyloid
  • 9.
    9 Ensembl: looking atthe gift horse
  • 10.
    10 Ensembl BACE2 paraloguesranked by % ID vs human BACE2: ~ 50% erroneous ORFs
  • 11.
  • 12.
    12 Poor old LesserHedgehog – lesser BACE2?
  • 13.
    13 Poor old Horse– too much BACE2?
  • 14.
    14 Danio rerio BACE2 •After 2.5 years of assembly curation the 2017 zebrafish reference genome (and UniProt) still has a clipped 372aa BACE2 • NCBI has a 503aa version
  • 15.
    15 TreeFam: A different wayof looking at the bad news for the BACE2 ORFs
  • 16.
  • 17.
  • 18.
    18 BACE in Chordates(II) • Amphioxus outgroup consistent with protochordate 2R whole-genome duplication followed by paralog persistence • Distinctly accelerated evolution of BACE2 (neofunctionalisation ?) • Inferred neuronal role for BAC1 but no data outside human, mouse, fish • Long branches are partial sequences • Birds group with turtles • Coelacanth groups with reptiles and tetrapods (not ray-finned fish) • Xenopus laevis tetraploidisation has maintained “double” paralogs • No evidence for pseudogenes or “dead” variants • Implication of common origin between nerve and pancreatic cells
  • 19.
    19 Finding the UrBACE: HumanBACE1 vs Monosiga (Choanoflagellate) • 33% identity over 432 residues, 9% gaps • Divergence time ~0.8 billion years • Two paralogues?
  • 20.
    20 But Monosiga stilldraft shotgun sequence from 2008 Has two paralogues we cant position on an assembly
  • 21.
    21 The placezoan: version1 for 13 years Three paralogues not complete and can not be positioned or nearest
  • 22.
  • 23.
    23 Ur-BACE evolutionary trajectory •After duplication from a cathepsin ancestor the emergence of the UrBACE protein sequence is distinct • Cathepsin paralogs form a clear outgroup • “Shuffling-in” of the CTM is the defining post-duplication shift in cellular location and function • Multiple divergent paralogs in basal phyla (2 in Monosiga, 3 in Trichoplax) predate the nervous system • Found in cnidaria with only nerve nets • Long branch lengths mainly due to partial sequences • Major orders now represented by draft genome assemblies • Basal relationships of Eumetazoans still unresolved • Beyond limited EST coverage no tissue distribution or functional data
  • 24.
    24 Oily tails: key tosecretase function
  • 25.
    25 Invertebrate phyla without BACE homologues Non-orthologous replacementof UrBACE for pre- neuronal RIP-related secretase functions ?
  • 26.
  • 27.
    27 Fixing dogy ORFs oBLASTP on both sides of the Atlantic o TBLASTN against genome reads, ESTs, and the HTS division o Run gene prediction on contigs (e.g. GeneMark etc) o Stich the bits together (by hand) o Iterative repeat searching with those bits o Run InterPro scan o If you are still left with partials, they can still slot into trees in informative positions
  • 28.
  • 29.
  • 30.
  • 31.
    31 Summarising the ORFproblem o Protein analysis is difficult when ORFs are not correct o Ensembl gene builds have significant levels of dodgy sequences o These include many error types o Searching BACE homologues at EBI or NCBI gives different results from different pipelines (e.g. Ensembl, JGI and NCBI) o Many eukaryotic genomes remain in the draft state for a decade or more o Cordinating deep transcript coverage and a genome assembly seems rare o Seems paradoxical now we can sequence a genome a day o This is particularly problematic for species in key phylogentic positions
  • 32.
    32 Plans and questions oDespite problems, make an update o Crank up Jalview :) o What is the Ur-BACE function in early metazoan? o Could this be relevant to contempory mamallian function? o Could we identify the non-homologous replacement in the Ecdyszoans? o Can we encourage anyone to try BAC1/2 specific inhibitors for functional proping in model organisms in lower phyla (e.g. Zebrafish) o Do BACE1 inhibitors have any clinical future?
  • 33.
    33 Questions and endpiece ohttps://sites.google.com/view/tw2informatics/home homepage o https://europepmc.org/search?query=AUTH:%22Southan+C%22 publications