SlideShare a Scribd company logo
1 of 225
Download to read offline
D O M A I N S A N D D I S O R D E R
TO WA R D S A S U F F I C I E N T E V O L U T I O N A RY D E S C R I P T I O N O F
P R O T E I N S T R U C T U R E
Matt E. Oates
Department of Computer Science
University of Bristol
A dissertation submitted to the University of Bristol in accordance with the
requirements of the degree of Doctor of Philosophy (PhD) in the Faculty of
Engineering.
September 2, 2014 – version 1.2.0
Matt E. Oates: Domains and Disorder, Towards a Sufficient Evolution-
ary Description of Protein Structure, © September 2nd 2014
D E C L A R AT I O N
I declare that the work in this dissertation was carried out in accor-
dance with the Regulations of the University of Bristol. The work is
original, except where indicated by special reference in the text, and
no part of the dissertation has been submitted for any other academic
award. Any views expressed in the dissertation are those of the au-
thor.
Bristol, September 2nd 2014
Matt E. Oates
"Teaching should be such that what is offered is perceived as a valuable gift
and not as a hard duty." – Albert Einstein
Dedicated to my parents Stephen and Christine.
It is only through reflection as an adult that I understand I had an
extraordinary childhood filled with true teaching every day. This
writing is dedicated to: the days of fossil hunting, the afternoons at
the zoo recording observations, the hours under the stairs looking at
spectra, the minutes of French coming out of a home made crystal
set, the moments of discovery created from trashed record players
found in a hedgerow, and to the countless other kind acts that have
shaped my mind. Thank you.
A B S T R A C T
The general title given to this thesis represents the underlying ethos
of my work that links most parts together, as well as being the motiva-
tion I now have for future work. The main scientific concern I present
within is a more specific evolutionary theory on what has happened
in land plants to a well known Calcium cell-signalling pathway found
in mammals. Namely the Inositol triphosphate mediated Calcium re-
lease of ITPR channels in mammalian neural and muscle cells. This
is discussed at length in Part ii, Chapters 2-3 of this thesis. Chapter
3 contains a detailed discussion surrounding an already known and
characterised Calcium channel (TPC1) in Arabidopsis thaliana that was
found to be related albeit very distantly to ITPR channels. Additional
partner regulatory proteins are introduced and some justification is
made that they interact directly with TPC1 providing it with regu-
lated gated activity specifically in guard cells.
During the development of the theory presented in Part ii it became
essential to understand the location and function of disordered pro-
tein regions in sequences over many species and genes. This lead to
the production of a Database of Disordered Protein Predictions (D2P2)
described in Part iii, as well as methods for visualizing multiple
classes of protein annotations such as structural domains, posttrans-
lational modification sites, and regions of protein disorder that fold
on contact. In Chapter 5 discussion surrounding major results from
producing D2P2 and its implications on the evolution of disordered
protein state are presented.
Finally in Part iv I introduce some relatively unrelated work invest-
igating the domain content of a new genome being sequenced at the
King Abdullha University of Science and Technology for the Dinofla-
gellate species Symbiodinium microadriaticum. One of my main tasks
in this collaboration was to identify proteins that mediate many en-
dosymbiotic relationships carried out by Symbiodinium. Finding an
example of a superfamily only found in a single clade of bacteria I
identify a plausible target protein and mechanism for eluding Toll-
like Receptors of host species.
In the concluding Part v I summarise pieces of work that I have
yet to finish, but include here to give some impression of the sorts of
work I have been thinking about and hope to one day complete.
vii
A C K N O W L E D G M E N T S
I would like to thank the Engineering and Physical Sciences Research
Council for their funding (grant EP/E501214/1) of my PhD. The Bris-
tol Centre for Complexity Sciences were responsible for developing
my extended skill set, with their inspired interdisciplinary doctoral
training program. I am grateful for the guidance and camaraderie I re-
ceived during my time at the centre. Professor John Hogan especially
was forthcoming in advice and support, both financial and practical
for continuing my PhD when I moved on after a year working on
evolutionary swarm robotics. The Department of Computer Science
have also been instrumental in the production of this thesis, specific-
ally their forward thinking policy towards freely available coffee in
the common room.
Professor Julian Gough as my primary supervisor blessed me with
his brand of supervision. Whilst being wholly supportive he afforded
me great freedom to pursue topics I found engaging on my own
terms. From having seen the many differing styles of PhD supervi-
sion I know this to be a rare fit. I feel very lucky to be amongst those
graduating from the Gough lab. In time I’m sure I will find myself
to be one amongst a large extended superfamily of successful and
happy people Julian has helped to inspire and send on their way.
Anywhere throughout this thesis where I have referenced the SU-
PERFAMILY database I am respectfully acknowledging years of hard
work by Professor Gough; that I believe many would give thanks
for. It goes without saying that large bodies of knowledge, insight
and ideas on protein evolution I espouse have been inherited from
Professor Gough’s through his diligent tutelage. I hope to be able to
repay this great gift in time.
Professor Alistair Hetherington I have to thank for introducing me
to the wonderful world of cell-signalling which as a computer scient-
ist previously working in artificial life, robotics and collective beha-
viour I have found most intriguing. Alistair has perhaps seen me at
both my most intellectually wretched and rich states of being, but has
always shown the same level of care and enthusiasm for my efforts.
After the first nine months of exhaustively searching for a plant ITPR
gene his excitement and thirst for a satisfactory answer was a core mo-
tivation to continue. Alistair’s broad knowledge of the experimental
field and literature moved theory forward when I often ran out of
ideas. The findings by Larisch et al. (2012) were especially welcome
to my inbox when I started to doubt my own ideas on the importance
of the role of disorder in TPC1.
ix
Much of my PhD has been spent in collaboration with other stu-
dents, postdoctoral researchers and primary investigators in other
labs. I would especially like to thank both Dr Elodie Marchadier and
Dr Jean-Charles Isner for their help. Without their insights and ex-
perimental work I would not have continued to push the plant ITPR
theory in Part II towards it’s current state. In Part III I present work I
have carried out in collaboration with a select group of people in the
field of disorder prediction, amongst them I would especially like to
acknowledge Professor Keith Dunker’s support. Keith’s generous at-
titude towards collaborative work and knowledge sharing has made
him the unwritten third supervisor of my PhD. I would not have the
data I present on disordered protein content, or have broken through
with my work on ITPR in plants without his evangelical efforts to
spread knowledge of disordered protein structure. With respect to
my work on the Symbiodinium genome in Part IV, Dr Manuel Aranda
contributed to my understanding of the surrounding biology during
weekly video meetings with the King Abdullah University of Science
and Technology. I am grateful that he took the extra time to answer
my rather simplistic questions. There is much work I have not presen-
ted in this thesis, and many others have worked with me on side pro-
jects that have indirectly contributed to the work I have presented.
I would especially like to thank Dr Owen Rackham, Dr David Mo-
rais, Dr Hai Fang, and Adam Sardar for all of their input and shared
knowledge. I give double thanks to Adam for being a good friend
when they were in short supply, and to Owen for always being one
step ahead enthusiastically showing the way forwards.
My endless thanks to Natasha for her ever present support and
companionship throughout my joint MRes and PhD. Living with me
during this period of my life has not always been desirable or easy,
and rarely fun and uplifting. I cannot imagine how she has managed
to remain optimistic with only her faith in me as a person to guide
her. With the end of this period I hope to revert back to the happier
and calmer man she once knew almost five years ago.
I’d also like to thank Ben Smithers, Jon Stahlhacke, Matthew Or-
linski and Michael Sheldon for feedback and discussion about this
manuscript and some welcome proof reading at the last minute.
x
C O N T E N T S
i background 1
1 introduction and assumed knowledge 3
1.1 Domain Description of Proteins 3
1.1.1 What is a protein 3
1.1.2 Proteins fold 3
1.1.3 Domains and evolution 5
1.1.4 Domain structure prediction 7
1.2 Intrinsically Unstructured Protein 8
1.2.1 What is disordered protein structure 8
1.2.2 Disordered regions as flexible domain linkers 10
1.2.3 Synergy between domains and disordered regions 12
ii inositol phosphate signalling in viridiplantae
15
2 inositol-1,4,5-triphosphate mediated calcium re-
lease in plants 17
2.1 Abstract 17
2.2 Introduction and Background 18
2.2.1 Inositol-1,4,5-triphosphate signalling at the heart
of Ca2+ ion release from internal stores in Mam-
mals and more broadly Metazoans 18
2.2.2 Physiological evidence for Inositol-1,4,5-triphosphate
mediated signalling in Viridiplantae 20
2.2.3 Known structure of the Mammalian ITPR protein
families 22
2.3 IP3BC and MIR containing proteins across the tree of
life 26
2.3.1 ITPR-like protein readily identifiable in Chloro-
phyta 26
2.4 Searching for proteins that have some homology to ITPR
in Embryophytes 28
2.4.1 Model building 28
2.4.2 Search in twenty-five Embryophytes 30
2.4.3 ITPR Calcium channel domain has homology to
Two-pore Channels of Embryophyta 33
2.4.4 Disordered binding core in algal ITPR-like pro-
teins 33
2.5 Exhaustive search of Embryophyta genomes for an ITPR-
like protein 35
2.5.1 Checking for ITPR-like pseudogenes 35
2.5.2 Checking for bad or incomplete gene predictions 36
2.6 Conclusion 36
xi
xii contents
3 an alternative hypothesis for inositol phosphate
induced calcium release in embryophytes 39
3.1 Introduction 39
3.2 Phosphoinositide metabolism in Embryophytes is very
different to that found in Mammals 39
3.3 TPCs as an alternative channel to ITPR mediated Inosi-
tol phosphate signalling in Embryophytes 40
3.3.1 Evidence for an ITPR-like protein specifically in
plants is weaker than reported 41
3.3.2 Two-pore Channel dynamics can only be explained
by additional protein interactions 42
3.3.3 Two-pore Channels of Embryophytes have a con-
served C-terminal region of protein disorder 43
3.3.4 The C-terminal region is both vital for channel
function, and appears to be a complex forming
region 45
3.4 What about IP/PIP binding and regulation? 48
3.4.1 Inositol phosphate and Phosphatidylinositol phos-
phate binding alternate domains 49
3.4.2 Arabidopsis has some unique expansion of pro-
teins using PH domains 50
3.5 AT1G58230.1 as a plausible TPC1 regulatory subunit
(TPC1R) 52
3.5.1 Co-expression with a putative PIK-like protein
kinase associates TPCR with the TPC1 although
indirectly 53
3.5.2 Is the C-terminal region of TPC1 a cryptic dimeris-
ing coiled-coil regulated by PIKK and other ki-
nase activity? 54
3.5.3 Guard cell Ca2+ dependent CO2 response pheno-
type experimentally associated with both TPCR
candidate and related PIKK through knockouts 56
3.6 Conclusion 57
iii protein disorder 61
4 creating a database of disordered protein pre-
dictions d2 p2 63
4.1 Abstract 63
4.2 Introduction and Background 63
4.2.1 Predicting disorder 63
4.2.2 Databases that already exist and why they are
not appropriate 66
4.3 Predicting Disorder 67
4.3.1 The SUPERFAMILY sequence library 67
4.3.2 Integrating predictions through consensus 68
4.3.3 Assessing predictor coverage with SCOP efficiently 69
4.4 Constructing a Database 70
contents xiii
4.4.1 Sequence 70
4.4.2 Predictions 71
4.4.3 Search 71
4.4.4 Statistics 71
4.4.5 Reports 72
4.5 Future Work 73
5 the importance of disordered protein in cellu-
lar life 75
5.1 Abstract 75
5.2 Disorder in each domain of life 75
5.3 Disorder and its association with posttranslational mod-
ifications 77
5.4 Disorder and Alternative Splicing 79
5.5 Disorder and Eukaryotic Linear Motifs 80
5.5.1 What are Linear Motifs 80
5.5.2 ELMs in D2P2 81
5.5.3 Calculating self-information of motifs 81
5.6 Disorder Predictors Compared 84
5.7 Community impact of D2P2 86
5.7.1 CASP10 86
5.7.2 Pfam 86
iv new genomes 87
6 protein domain analysis of the symbiodinium sp.
a1 genome 89
6.1 Abstract 89
6.2 Introduction and Background 90
6.2.1 KAUST and Reef Genomics 90
6.2.2 What is Symbiodinium 91
6.2.3 Relevance of Symbiodinium to global environ-
mental science 91
6.2.4 Unusual genetics of dinoflagellates 92
6.3 Assessing Gene Annotation Quality 93
6.3.1 Several rounds of gene-calling 93
6.3.2 Using SUPERFAMILY to evaluate the state of gene-
calling 94
6.4 Domain Content 98
6.4.1 Functional enrichment 98
6.4.2 Transcription factors 100
6.4.3 RISC, Dicer and Argonautes 101
6.4.4 Histone content 101
6.5 A FucT-like family domain identified for the first time
in Eukaryotes 102
6.5.1 FucT-like from Helicobacter pylori or just a FuT
from everywhere? 103
6.5.2 Structure of the putative FucT-like domain 104
6.5.3 Toxin reuse and host immune response evasion 107
xiv contents
6.6 Phylogenetic Placement 108
6.6.1 The sTOL method 108
6.6.2 Tree building 109
6.6.3 The Chromalveolate hypothesis 112
6.6.4 Relevance of Symbiodinium’s placement on the
tree 113
6.7 Conclusion 114
v future work 115
7 directions for future research 117
7.1 Protein domain decomposition 117
7.2 Conserved protein disorder 119
7.2.1 Conserved in sequence 119
7.2.2 Describing and categorising disordered regions
within conserved domains 121
7.2.3 Characterising the TPC1 C-terminal conserved dis-
order domain 122
7.3 Geographic distribution of protein structure 123
7.4 Closing remarks 126
bibliography 129
Bibliography 129
vi additional material 157
a tpr-like homologous proteins found in chloro-
phyta 159
b symbiodinium additional data 161
b.1 Domain Annotation 161
b.2 RISC, Dicer and Argonautes 166
b.3 RAxML Tree of FucT-like related proteins 166
vii published papers 171
c d2 p2 : database of disordered protein predictions 173
d a daily-updated tree of (sequenced) life as a ref-
erence for genome research 183
e the evolution of human cells in terms of pro-
tein innovation 195
L I S T O F F I G U R E S
Figure 1 All states of Spinach Thylakoid soluble phos-
phoprotein (TSP9). 9
Figure 2 Schematic of GRB2 and its short type disordered
regions. 11
Figure 3 A sketch of the IP3 signal transduction cascade
found in Mammals. 19
Figure 4 The IP3 binding-core region of human ITPR1 in
the bound state. 22
Figure 5 Schematic of the Human ITPR1 protein as shown
in the D2P2 resource. 25
Figure 6 A phylogenetic tree of IP3BC presence in Plan-
tae from the SUPERFAMILY database. 27
Figure 7 V. carteri f. nagariensis disordered IP3BC insert
ELM assignment. 34
Figure 8 Phosphoinositide metabolism in plants versus
human 40
Figure 9 Disordered and structural architecture of the atTPC1
protein. Highlighted is an N-terminal trafficking
motif region, and a conserved C-terminal disor-
dered region. 44
Figure 10 Eukaryotic Linear Motif results for atTPC1 fil-
tered by relevancy to Arabidopsis thaliana. 45
Figure 11 Structure prediction of the atTPC1 disordered
tail yields multiple structural alignments. 47
Figure 12 TM-Alignments of the best I-TASSER model of
atTPC1 48
Figure 13 Novel domain adjacency network for Arabidop-
sis thaliana contains several Inositide binding re-
lated structures. 51
Figure 14 Structural depiction of a plausible transient state
after PIKK activated folding in the disordered
tail of atTPC1. 55
Figure 15 Putative TPCR knockout experimental results. 57
Figure 16 PIK-like Kinase knockout experiment results. 58
Figure 17 A to scale venn diagram of the SUPERFAMILY
and D2P2 sequence library in green versus all
UniProt sequences in red. 67
Figure 18 Schematic of how the D2P2 predictor consensus
is calculated (see Figure 21 on page 72 for a real
example). 68
xv
xvi List of Figures
Figure 19 Simple Python function to calculate the shared
coverage between two regions of assigned pro-
tein annotation. 69
Figure 20 A schematic for efficient calculation of overlap-
ping coverage between two protein annotations. 70
Figure 21 D2P2 graphical reports with yet to be released
data for the Ensembl Human genome annotated. 72
Figure 22 Disordered amino acid coverage per genome grouped
by domain of life for each prediction method in
D2P2. 75
Figure 23 Disordered amino acid coverage inside SUPER-
FAMILY predicted SCOP domains per genome,
grouped by domain of life for each prediction
method in D2P2. 76
Figure 24 Posttranslational modifications and their associ-
ation with disordered amino acids 78
Figure 25 Relevant equations for calculating entropy meas-
ures. 82
Figure 26 Distribution of self information for all Eukaryotic
Linear Motifs in whole genomes from D2P2. D
= 0.1734, p-value < 2.2e-16 84
Figure 27 Distribution of self information rate for all Eu-
karyotic Linear Motifs in whole genomes from
D2P2. D = 0.3349, p-value < 2.2e-16 85
Figure 28 A graph comparing the differing extents protein
predictors assign intrinsic disorder, or smaller
regions of disorder. 85
Figure 29 Microscope photography of Symbiodinium microdri-
aticum Fredenthal (CCMP2467) 90
Figure 30 The number of genes called with protein an-
notation and domain assignment in each round
of gene calling. 95
Figure 31 A bar chart of unique domain assignment di-
versity during Symbiodinium gene calling. 96
Figure 32 Affect of calling longer genes later in the as-
sembly and the affect on domain assignment
length. 96
Figure 33 Coverage of domain assignment as the assembly
progressed. 97
Figure 34 FucT-like related sequences maximum likelihood
tree. 103
Figure 35 Structural alignment using TM-align of the I-
TASSER Symbiodinium FucT protein model 104
Figure 36 H. pylori FucT binding sites are conserved in the
I-TASSER model from Symbiodinium. 105
Figure 37 A subtree from the tree built with Symbiodinium
where placement is constrained to the NCBI clas-
sification polytopes. 110
Figure 38 A cladogram showing the subtree from the tree
built with Symbiodinium free to be placed any-
where in the whole sTOL tree of life. 111
Figure 39 Results of domain decomposition using TF-IDF
to rank importance. 118
Figure 40 Sequence 89704.m00121 from Trichomonas vaginalis
contains an example of the largest disorder cluster
at its N-terminus. 120
Figure 41 An evolutionary schematic of pre-BEACH PH-
like domains. 122
Figure 42 Map of all sequenced species from the GBIF re-
source 123
Figure 43 Global distribution of the domain architecture
of atTPC1 versus hsITPR1 for 106 Eukaryotic
genomes. 124
Figure 44 ITPR-like homologous proteins found in Chloro-
phyta. 160
Figure 45 Dicer domain architectures from the draft Sym-
biodinium genome. 167
Figure 46 Argonaute domain architectures from the draft
Symbiodinium genome. 168
Figure 47 Large scale sequence tree built for FucT-like re-
lated proteins. 169
L I S T O F TA B L E S
Table 1 Five functional regions of ITPR as defined by
Bosanac et al. (2005) 23
Table 2 ITPR-like homologs found in Chlorophyta with
E-values reported for the IP3BC domain i.e. the
IP3 receptor type 1 binding core and the MIR
domain being assigned together. 29
Table 3 Seed sequences used to build both whole-protein
and IP3BC HMMER models. 30
Table 4 Summary results of searching IPTR-like protein
models against 25 plant genomes. All “MIR-like”
results are likely Mannosyltransferases. Those
genomes marked with a * are of draft quality
at the time of writing. 32
xvii
xviii List of Tables
Table 5 Eukaryotic Linear Motifs assigned to the C-terminal
disordered region of atTPC1 near to the di-Arginine
WD40 binding motif. 46
Table 6 Alternate domains that bind Inositol phosphatides
in Arabidopsis thaliana. 50
Table 7 Putative PH+BEACH containing proteins of Ara-
bidopsis thaliana with domain combinations im-
plying a possible TPCR role. 53
Table 8 Amino acid probabilities used in self-information
calculations based on the D2P2 sequence library
propensities. 82
Table 9 Reproduction of Pfam coverage from Table 2 of
“The challenge of increasing Pfam coverage in
the human proteome” (Mistry et al., 2013) 86
Table 10 Molecular function GO-terms assigned using the
dcGO resource only found in Symbiodinium versus
an algal background. 99
Table 11 Table of transcription factor related domains for
each round of gene calling. 100
Table 12 List of Symbiodinium Histone proteins. 102
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 161
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 162
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 163
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 164
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 165
Table 13 Domains present in the Symbiodinium genome
after Modeller gene models were included in
with the data freeze release. 166
Part I
B A C K G R O U N D
In this part I present some introduction and background
knowledge assumed throughout the rest of this thesis. Any
reader familiar with the general concept of what a pro-
tein molecule is would not lose out skipping to the later
sections of this part discussing disordered proteins and
related properties.
1I N T R O D U C T I O N A N D A S S U M E D K N O W L E D G E
“I still somewhat shudder at the thought that highly
efficient, purposive, organizational elements, like the pro-
teins, should originate in a random process. Yet many effi-
cient and purposive media, e.g., language, or the national
economy, also look statistically controlled, when viewed
from a suitably limited aspect. On balance, I would there-
fore say that your argument is quite strong.”
From a personal letter from John von Neumann to George Gamow,
25th July 1955.
1.1 domain description of proteins
1.1.1 What is a protein
Proteins are biomolecules, polymers formed of amino acid residues
held together by peptide bonds. Each amino acid residue has a one-to-
one relationship with an equivalent triplet (or codon) of nucleotides
in the parent gene’s DNA, with the sequence of amino acids being
the same as the sequence of codons transcribed from a gene. The ulti-
mate directionality of information being stored in DNA, transcribed
to messenger RNA intermediates and ultimately translated to amino
acid sequence by a ribosome, was most famously codified in the cent-
ral dogma of molecular biology by Crick (1970). However, the full
information content of a protein coding gene is only fully apparent at
the level of the expressed protein. Each protein has a unique and spe-
cific three-dimensional structure that directly relates to its biological
function. The final shape or conformation a protein assumes starting
from a linear chain is referred to as a protein’s fold.
1.1.2 Proteins fold
A nascent protein will self organise through self-interaction and in-
teraction with the surrounding solvent to reach a native ground state
in the range of milliseconds to a second time scales. The speed of a
protein fold was surprising to early researchers given the size and
organisational complexity of known structures, especially given the
naive assumption that all conformational arrangements of a protein
are explored before randomly reaching a native state.
3
4 introduction and assumed knowledge
Both the problem with the speed of protein folding and the solution
is best communicated through Levinthal’s paradox (Levinthal, 1969).
Levinthal starts from the assumption that proteins fold through a ran-
dom process i.e. all proteins are random-coils until they reach their
native state. To find the minimum Gibbs free energy structure of a
150 amino acid protein having 450 degrees of freedom, a space of
~10300 conformations would need to be inspected, given a bond angle
accuracy of one-tenth of a radian. The time for a protein to naturally
reach its native and stable fold is at a maximum on the order of two
seconds, so only ~108 conformations could be visited by a protein.
From this thought experiment Levinthal went on to conclude that
local short-range interaction and stability between amino acids must
limit the total number of possible conformations visited during the
folding process as it progresses.
Levinthal wasn’t wrong. In the simplified case of globular proteins
in solution, an exact fold is largely due to interaction between dif-
fering residues locally. However, early on in the folding process con-
formational restriction is dominated by global hydrophobic effects
caused by the aqueous environment. The hydrophobic effect “packs”
residues with hydrophobic side chains towards the interior of the pro-
tein to form a dry core, as this is a more energetically favourable con-
formation. This idea is referred to more broadly as the hydrophobic
collapse hypothesis of protein folding, with the initial collapsed in-
termediate referred to as a molten-globule. These two processes give
rise to what is known as a protein’s secondary and tertiary structure,
primary structure being the defined sequence of amino acids. Other
models and processes have been both observed and put forward for
protein folding. For example, many metal ion containing proteins are
driven by nucleated folding initiated by electrostatic interaction of
side chains with the ion.
The process of protein folding can be more broadly conceptualised
by the folding funnel hypothesis, which suggests as a protein folds
through a space of intermediates it travels down a “pathway” des-
cending a funnel (or minima) in the Gibbs free energy landscape of
possible conformations. The resultant amino acid arrangement min-
imises the global free energy for the tertiary structure of the protein.
Small local minima along the sides of this theoretical folding funnel
produce a rough space of transient conformers with intermediate en-
sembles of secondary structure similar to the thought experiments of
Levinthal.
Secondary structure arises from local effects due to backbone hy-
drogen bonding between amino acids, after the initial collapse of a
protein. Backbone hydrogen bonding forms several recognisable sec-
ondary structures such as turns, helix and strand (Kabsch and Sander,
1983). Although secondary structure is tightly related to exact se-
quence, secondary structures can be arranged in alternate ensembles,
1.1 domain description of proteins 5
forming approximately the same tertiary structure when folded. Even
if portions of sequence are displaced relative to one another the same
resultant tertiary structure will be formed. It is this observation that
brings us to protein domains.
1.1.3 Domains and evolution
Protein domains are self-stable sub units of proteins that fold to well
aligned tertiary structure given the same sets of secondary structure
motifs, but not necessarily recognisably similar sequences of amino
acids. Domains can be thought of as atomic parts of large scale struc-
ture common between either different proteins or homologous (evol-
utionarily related) proteins in different species. Domain structure of-
ten relates to specific protein function. Individual domains can be
copied between two genes, resulting in the transfer of a functional
protein sub-unit. A common example of more promiscuous domains
of this kind, transfer functions relating to the formation of protein
quaternary structure. Protein quaternary structure is formed when
several polypeptide chains bind together, at the interface between
well folded domains to form a complex (Bennett et al., 1995; Heringa
and Taylor, 1997). The WD40-repeat [scop:50979] domain is a great
example which forms a beta-propellor structure, facilitating protein
quaternary structure with various degrees of radial symmetry.
Domain tertiary structure as defined by the superfamily classific-
ation is recognised by the conserved arrangement of a protein’s hy-
drophobic core and the 3D arrangement of its secondary structures.
This is robust in terms of an evolutionary process by definition. Large
scale change to a protein’s core would prevent efficient folding to a
given tertiary structure. The specific definition of any superfamily
I mention throughout this thesis comes from the manually curated
Structural Classification of Proteins (SCOP) database. The concept of
how a domain arises has also been formally modelled in studies such
as Bornberg-Bauer and Chan (1999).
Bornberg-Bauer and Chan (B&C) postulate that thermodynamic sta-
bility of a protein sequence is negatively correlated with its muta-
tional plasticity, or in other words positively correlated with the se-
quence conservation. B&C propose the concept of “super funnels” in
the space of protein sequence similarity to define domains. Within
super funnels point mutations in protein sequence smoothly affect
thermodynamic stability. Well folded and efficient domain families
lie at local minima near the bottom of these selective funnels. Any
point mutation adversely affects thermodynamic stability from these
minima and thus must be under selective pressure unrelated to effi-
cient folding to move from this conserved sequence. The birth of a
new fold in this theoretical paradigm can be thought of as a random
sequence starting just within a selective horizon of a super funnel,
6 introduction and assumed knowledge
with any selection pressure to fold efficiently pushing a sequence to-
wards the canonical domain sequence.
In the SCOP definition and hierarchy of domain classification, su-
perfamilies are those domains that share recognisably common an-
cestry for the folding of their hydrophobic core. Pethica et al. (2012)
demonstrated that domains at the family level of SCOP classification
in general agree with traditional sequence-only phylogenetic recon-
struction of protein families.
Domains although conserved in structure still undergo modifica-
tion and elaboration at the level of each amino acid, but more im-
portant to evolutionary understanding are the genetic mechanisms
allowing for more rapid movement of whole domains between pro-
teins (Bork, 1991). A point mutation in DNA can result in a non-
synonymous change to a codon. Non-synonymous changes manifest
as either a single substitution of an amino acid species at the given
protein locus, or introduce a premature end to the protein in the case
of introducing a stop codon. Additional large-scale structural changes
in DNA such as chromosomal inversion and translocation can cause
less localised changes to a given gene; resulting in protein trunca-
tion, elongation, fusion and fission. Another event that is common to
all life is that of DNA replication slippage, slippage of a DNA poly-
merase can cause the extension of a protein. This is most readily ob-
served as terminal duplication in a protein sequence, and especially
visible in proteins with conserved and repetitive C-terminal structure.
Finally the phenomena of exon shuffling seen exclusively in the Euk-
arya domain is the long range movement of exons from one gene to
an unrelated gene and can occur due to transposon activity or sexual
recombination.
When these events in DNA map to boundaries between domains
(specifically regions capable of self folding) the likelihood of a biolo-
gical function being retained is higher and thus selection for retaining
these events is also higher. If an event occurs breaking up a region of
protein sequence that is capable of self folding this structure can be-
come irreparably destabilised and unable to retain function in any
gene affected by the event. Similarly for point mutations, if made
at the surface of a protein structure within an active site. However,
outside of an active site structural changes can slowly accrue with
minimal cost to fitness due to a lack of selective pressure, plausibly
creating new active sites with time. When whole genes are duplicated
this is a key mechanism for functional repurposing of the new gene.
Point mutations within the hydrophobic core of a protein fold could
disrupt the ability of a protein to fold; either by the introduction of
a hydrophilic species or with a species that’s side chain cannot be
accommodated such as Arginine. If folding is negatively affected to
the point of completely disrupting the ability of a domain to attain its
native state then it is highly likely all function is lost and the mutation
1.1 domain description of proteins 7
is selected against, or a domain is lost in the population. This is the
basis of the domain perspective of protein evolution. New domain
folds are created rarely at evolutionary time scales and by chance.
In general more events exist that duplicate or cause the loss of a do-
main. However, even with domain duplication it’s estimated that only
between 0.4 and 4% of SCOP domain combinations at the superfam-
ily level of classification are created convergently, ignoring tandem
repeats (Gough, 2005).
A final consideration is that of multi-domain composition and how
this can affect higher order selective processes of a protein. Vogel et al.
(2004) define the “supra-domain” by observing in 131 sequenced gen-
omes common arrangements of domain pairs of which they found
1,400 examples, and 166 examples of the less common domain triplet
that are conserved with an abundance of additional domain partners.
A key definition of a conserved supra-domain versus just witnessing
an abundance of copy number for a pair or triplet is the observation
that for any random domain pair you might see 3.0 other superfamily
species of domain partner, versus a true duplet supra-domain having
5.5 superfamily partners on average, or 7.2 in the case of over repres-
ented supra-domains such as those made up of P-loop domains.
1.1.4 Domain structure prediction
Since secondary structure mostly arises from local effects between
near by residues it is possible to predict this with a reasonable ac-
curacy ~80% (Pollastri and McLysaght, 2005) from a given protein se-
quence. If domains form from secondary structure motifs, it should
be possible from predicting secondary structure to predict a classifica-
tion of the domain. Work has been done on structural classification of
existing proteins SCOP (Murzin et al., 1995) with the current version
of the database (1.75 June 2009) taking 38,221 protein structures from
the Protein Data Bank (PDB) (Rose et al., 2011) and classifying 110,800
domains. SCOP defines a hierarchy of classification that at the family
level approximates relationships found through traditional phylogen-
etic techniques based on protein sequence alone.
The most broad category is the Class of a protein, this is related to
the secondary structures that are present in a protein domain, such
as alpha-helices or beta-sheets. So proteins that mostly or only con-
tain alpha helices come under the alpha-helix Class. Below a Class
membership is the Fold which separates protein space based on how
secondary structures fold together. Then with an increasing level of
discrimination, the relatedness of the 3D position and orientation of
each secondary structure within a domain can be used to form the
Superfamily classification, where all proteins share a common evol-
ution. To further classify proteins, sequence or primary structure is
used. Domains that have similar 3D structure might have several dif-
8 introduction and assumed knowledge
ferent orderings in a sequence of the constituent secondary structure
motifs. It is from this detailed view that a Family level classification
is given, so proteins in the same family have very close evolution-
ary history, and their included domains are equally tightly related by
association.
From a detailed classification of all the protein domains currently
described, it is possible to see how often new structures fall into
already classified areas of this protein space. Chothia (1992) estim-
ated that approximately 1,000 superfamilies were required to com-
pletely classify all proteins at that time. Furthermore a third of all
newly described protein structures fall into previously recognised su-
perfamilies.
As new structures are described through X-ray crystallography and
similar means it should become increasingly possible to statistically
infer through homology of sequence alone what Family or Super-
family classification a protein should be given. Gough et al. (2001)
describe a method using Hidden Markov Models to predict SCOP
classifications through sequence homology. Following this in 2002 the
method was released to the public domain in the form of the SUPER-
FAMILY online resource(Gough, 2002b). When sufficient sequence ho-
mology exists SUPERFAMILY annotates all proteins in all currently
sequenced genomes with relevant SCOP classifications (Gough, 2006).
It is this work and resource that is instrumental to the methods de-
scribed in the rest of this thesis.
1.2 intrinsically unstructured protein
1.2.1 What is disordered protein structure
Intrinsically disordered or unstructured proteins exist as highly flex-
ible polypeptide chains in vivo behaving as an ensemble of conform-
ational states with no stable tertiary structure (Uversky et al., 2000).
Regions of IDP can exist as unfolded chains or molten globules with
well-developed secondary structure and often function through trans-
ition between differently folded states (Uversky, 2002). To some the
concept of unstructured protein being functional might not fit well
with accepted theory on enzyme action, and the concept of exact
3D structure implying function. However, even theory surrounding
active site function in well structured compact globular enzymes in-
cludes the addition of flexibility and structural rearrangement with
the move from Hermann Fischer’s lock and key model to Daniel
Koshland’s induced fit model. Here the dynamic switch in structure
of the enzyme in the bound state with the selected for substrate in-
duces the desired exact 3D structure to perform a necessary function.
Essentially the argument for functional unstructured or disordered
protein is an extension of this argument. Fold transitions to an exact
1.2 intrinsically unstructured protein 9
3D arrangement through time yields the desired function. Already
the fold dynamics of well structured proteins have been known to fol-
low pathways of folding that have many flexible intermediates, these
are however transient rather than a protein permanently remaining
in a semi-molten state.
Disordered regions are highly enriched for many forms of posttrans-
lational modifications. A good example is the Thylakoid soluble phos-
phoprotein (TSP9) from Spinach (Figure 1), which was shown to flip
between a folded trans-thylakoid-membrane state and an unfolded
but stromal accessible state with the addition of phosphorous groups
in two key disordered regions (Song et al., 2006). This bistable state
between the central helix being transmembrane or free in the stroma
also facilitates another key property of disordered proteins, that of
activated protein-complex formation. When the helix is accessible a
much larger protein complex is able to form at the interface of the
thylakoid membrane to perform a task.
Figure 1: A single stable helix in black separates two disordered segments
shown in blue and green-red of the Spinach Thylakoid soluble
phosphoprotein (TSP9). All known states from NMR structure 2fft
from PDB are shown superposed and aligned about the central
helix. This figure has previously appeared in a special issue of
Current opinion in structural biology (Gough and Dunker, 2013)
Mechanisms for functional conformational transition include: bind-
ing with other proteins, nucleic acids, and various small molecules;
as well as the addition of numerous posttranslational modifications
such as phosphorylation, acetylation and methylation. Phosphoryla-
tion especially has been shown to be important for inducing fold
transitions (Iakoucheva et al., 2004; Song et al., 2006). With a par-
ticularly striking example found in Cytoplasmic Linker-associated
Protein 2 (CLASP2), where a series of phosphorylated sites disrupt
“molecular velcro” holding microtubule networks together through
10 introduction and assumed knowledge
retraction of disordered regions from a series of arginine residues
interacting with the added phosphorous-groups (Kumar et al., 2012).
Biological functions of known IDPs are varied and their roles in-
clude: instigation of protein complex formation, molecular recogni-
tion as seen in nucleoporins of the nuclear pore complex (Yamada
et al., 2010), signal transduction, transcriptional regulation and many
other functions related to interaction and regulated activity (Dunker
et al., 2008; Dyson and Wright, 2005).
Much work has been done on producing classification and annota-
tion of known unstructured regions from 3D experimental data found
in the PDB (Rose et al., 2011), such as the DisProt (Sickmeier et al.,
2007) and IDEAL (Fukuchi et al., 2012) resources. However, the past
focus on structured protein domains has limited the total number of
described IDP regions. For example: the current release of DisProt
(6.00 2012-07-01) describes 667 proteins containing 1,467 verified dis-
ordered regions; and the IDEAL database (as of 2012-05-09) describes
209 disordered proteins in detail, 97 of which have been experiment-
ally verified to be structured and disordered over the same region un-
der different conditions; and also MobiDB (Di Domenico et al., 2012)
has applied a method for identifying mobile regions from NMR struc-
tures (Martin et al., 2010) to 26,933 proteins (v1.2.1 as of 2012-11-01).
Due to the biases of structural resolution and the relative ease in
predicting disordered state compared with de-novo fold prediction,
many algorithms have been developed to discover novel regions of
disorder from amino acid sequence alone (He et al., 2009; Peng and
Kurgan, 2012b). For full discussion on protein disorder prediction
please see Part iii.
1.2.2 Disordered regions as flexible domain linkers
The concept of domain linkers is not new, and perhaps predates any
mention of protein disorder. However, their generality and import-
ance in protein regulation has only become increasingly apparent
within the last decade. Large scale attempts at classifying and charac-
terising these regions was attempted, with a notable example being
George and Heringa (2002). The properties of these linkers would
later become known as “short” type disorder, with lengths reaching
a maximum of around thirty-five amino acids. Although longer re-
gions of disorder can still act as flexible links between domains, they
usually contain additional regulatory and interaction motifs rather
than existing purely as spacers for improving fold dynamics of their
neighbouring domains.
1.2 intrinsically unstructured protein 11
Figure 2: A schematic of the domain arrangement of human GRB2
[UP:P62993]. The DisProt track at the top of the figure reflects ex-
perimental results from published literature. We can see a slight
peak in flexibility predicted in DynaMine around the second ex-
perimental disordered region, as well as some associated phos-
phorylation sites postfixing this region regulating function.
Even very short flexible linkers two-to-seven residues long have
been found to be significant such as in the study by Yuzawa et al.
(2001). They showed that Growth factor receptor-bound protein 2 or GRB2
in humans is a monomer in solution with the C-terminal SH3 domain
joined to the middle SH2 domain by a flexible linker. The study com-
pared their own NMR based structures to the previously attained X-
ray crystal structures that predicted GRB2 to exist in a single compact
dimeric state with both the N- and C-terminal SH3 domains being in
close contact. However, NMR evidence showed that the C-terminal
SH3 domain existed in a multitude of states from extended to bound
to the other SH3 domain.
GRB2 functions as an adapter protein mediating signal transduc-
tion from cell membrane bound receptors such as the epidermal growth
factor receptor (EGFR) to internal Ras/MAP kinase-signaling cas-
cades. The central SH2 domain binds to phosphotyrosine motifs (RXXK)
specific to its target protein kinases, the N-terminal SH3 domain binds
12 introduction and assumed knowledge
to the membrane bound receptor, with the C-terminal SH3 domain
dynamically folding to bring kinase, receptor and a target protein
into contact.
Single point mutations in SH2 domain of GRB2 increases its af-
finity of binding to given R-X-X-K targets by 40 times, this affords
the SH2 domains specificity to a given kinase as well as selectivity
through small changes to these sequences. This is a desirable trait
for repurposing duplicate copies of a gene, as new GRB proteins
could be used to rewire receptors to other pathways with only minor
mutation events. Additionally, both SH3 domains bind Proline rich
motifs in both receptor and target proteins. These motifs are often
referred to as Eukaryotic Linear Motifs (ELMs) or Short Linear Mo-
tifs (SLiMS) and the concept of modification to or interaction with
these regions causing structural rearrangement are often referred to
as Molecular Recognition Features (MoRFs). More about disordered
protein regions and these introductory concepts is discussed in Part
iii.
1.2.3 Synergy between domains and disordered regions
In the previous section I highlighted that the SH3 domain of GRB2 in-
teracted with a specific motif or pattern of residues in another protein,
in that example a kinase. However, these motifs known by various
names such as Linear Eukaryotic Motifs (ELM), or Molecular Recog-
nition Features (MoRF) or simply linear motifs, interact with domain
partners of various kinds and not always to mediate specificity of en-
zyme action; but also the formation of large protein complexes, and
mediating signal transduction through structured interaction. For ad-
ditional information on motifs and resources available to study them
please see Section 5.5.
An important area to consider the synergistic interactions between
disordered protein and globular domains is in viral pathology. In
a recent study by Hagai et al. (2014) 2,208 viral genomes were in-
spected from 536 prokaryotic hosts and 1,672 eukaryotic hosts. The
protein annotations for each virus were annotated with ELM motifs
that mimicked conserved motif arrangements of the host mediating
protein interactions. The incidence of linear motif mimicry was en-
riched more in eukaryotic species, with especially higher enrichment
in animals than plants. Looking at expectation of co-occurance of mo-
tifs in whole proteins - both in viral and host genomes - Hagai et al.
demonstrated that mimicked motifs co-occurred as much as would
be expected from randomly selected disordered segments. This can
be interpreted as motif mimicry within viruses is more likely to be a
convergent and competitive phenomena, rather than the product of
horizontal gene transfer from the host. In this paradigm ELM inter-
action provides an easily evolvable and flexible strategy for viruses
1.2 intrinsically unstructured protein 13
to interact and inhibit or promote specific processes of the host cell.
This highlights the importance of the properties of motif rich disorder
to the host too, as direct exploitation of this system of regulation and
interaction is being increasingly elaborated and incorporated in meta-
zoans.
Tight synergistic coupling of disordered regions and domains within
the same peptide is common too. With common examples described
from proteins involved in cell-signalling processes, especially where
auto-inhibition is observed. The disordered regions adjacent to do-
mains related to signal transduction provide fine tuned inhibition
of protein activation or activity. This is achieved through alternative
splicing of IM length as well as the inclusion of conserved motif re-
gions specific to the active site of a neighbouring domain (Trudeau
et al., 2013). The non-binding loop portions of IM regions are highly
enriched with sites of posttranslational modification such as phos-
phorylation. This modification allows for further fine tuning and reg-
ulation given profiles of cell type specific co-expressed kinases, given
the example of phosphorylation. Additional activator proteins can
interact directly with the IM region through further conserved mo-
tifs, rather than needing to be specific to each associated signalling
domain. These properties make the disordered IM regions relatively
easy to adapt for rewiring of pathways both over evolutionary time
given a gene, or in short time periods at each regulatory stage of the
expressed protein in a cell, transcription, splicing and translation, and
posttranslational processing and modification. Although originally in-
tended as a description of evolutionarily coupled globular domains,
disordered auto-inhibitory modules (IM) form what could also be
considered supra-domains within signalling proteins.
A final general relationship between disordered regions and well
structured globular domains is through chaperone activity (function-
ing as a catalyst to the folding process of a macromolecule). More
than 33% of protein chaperones are known to be disordered with
more than 50% of RNA chaperones being found in the disordered
state too (van der Lee et al., 2014). An interesting concrete example
from Foit et al. (2013) presents the unusual properties of HdeA, one
of the most abundantly expressed proteins found in the periplasmic
space of the bacteria Escherichia coli. E. coli is routinely found travel-
ling through the digestive system of animals, as such it requires a
high tolerance for acidic environments. At neutral pH HdeA is found
in a well folded dimeric state packed at high concentration in the
periplasm, where it does not function as a chaperone. On entering
the acidic environment inside a host HdeA dimers separate due to
charge imbalance from the drop in pH and unfold. In this unfolded
state HdeA monomers “wrap” surrounding proteins binding to their
surface and help to maintain tertiary structure. On the increase in
pH when E. coli moves through to the intestinal environment HdeA
14 introduction and assumed knowledge
slowly releases its partner proteins. This slow release over several
minutes is implicated in the chaperone activity disfavouring aggrega-
tion and promoting correct folding of the target protein.
The phenomena of including mixtures of disordered segments and
globular domains in a proteins structural architecture is not limited
to auto-inhibition and signalling. Babu et al. (2012) provide a concise
view on the continuum of structured state both for a single domain
transitioning between states, and for the composition of disorder and
domains. The addition of disordered segments affords diversity of
function and interaction of the parent protein, with minimal cost in
the transfer of material between genes due to the much reduced con-
straint of a disordered region not needing to fold. This general opin-
ion on the synergistic composability of both domains and disorder
provides the basis from which most of my thesis develops. For further
reading on how disorder functions on its own, and in the context of
structural domains I highly recommend reading the comprehensive
review “Classification of Intrinsically Disordered Regions and Pro-
teins” by van der Lee et al. (2014).
Part II
I N O S I TO L P H O S P H AT E S I G N A L L I N G I N
V I R I D I P L A N TA E
This part outlines a bioinformatic and theoretical under-
taking to identify and describe what has happened to the
Inositol triphosphate signalling pathway in plants. All work
is my own apart from any experimental laboratory result
in Chapter 3 which is presented here at the permission
of both Dr Jean-Charles Isner of the University of Bristol
Guard Cell Lab and Dr Elodie Marchadier who was pre-
viously a member of the Guard Cell Lab before moving
to the french National Institute for Agricultural Research
(INRA).
2I N O S I TO L - 1 , 4 , 5 - T R I P H O S P H AT E M E D I AT E D
C A L C I U M R E L E A S E I N P L A N T S
2.1 abstract
In this chapter I present a search of vascular land plants for proteins
homologous to the Inositol-triphosphate Receptor (ITPR) of humans.
A specific focus is placed on Arabidopsis thaliana to facilitate experi-
mental validation. Both root hair growth and stomatal opening dy-
namics in plants are controlled and regulated by ionic Calcium (Ca2+)
release from internal stores. This cellular function is similar to the
Inositol-triphosphate mediated release of Ca2+ in human muscle and
neural tissues. In these mammalian systems the ITPR gene produces
tetrameric protein complexes that are the primary channel for gated
ion release from the endoplasmic reticulum.
Although plants appear to have similar releases of Calcium to the
ITPR mediated Calcium release of mammals, I demonstrate in this
chapter that no direct gene homolog for ITPR can be found. However,
ITPR-like proteins are readily identifiable in green algae and various
protists, suggesting an ancient creation for ITPR and a general trend
of loss throughout the eukaryotic domain. Unlike previous studies to
find an ITPR homolog in plants I return to looking at the assembled
genomes of thirty two plants from the Phytozome project, as well as
check for bad gene predictions where some fragmented homologous
protein sequence is identifiable.
The chapter is concluded with the discovery that the tetrameric
channel domain of ITPR has distant homology to only a single pro-
tein in Arabidopsis thaliana and other land plants. The TPC1 protein in
plants is already known to form dimeric Calcium ion channels in the
tonoplast, however this is still formed from a tetramer of the chan-
nel domain as in ITPR. This deep homology detection of the channel
domain from ITPR raises the question of: If TPCs have functionally
replaced ITPRs in plants, how are they regulated? One plausible an-
swer to this is developed in the next chapter.
17
18 inositol-1,4,5-triphosphate mediated calcium release in plants
2.2 introduction and background
“No matter how much evidence for the phosphoinos-
itide signalling pathway in plants has been gathered, the
final step has not been realized as the gene corresponding
to plant InsP3-R [ITPR] has not yet been identified.”
–Ondˇrej Krinke et al 2007.
This chapter outlines a deep search for a gene that resembles the
mammalian Inositol-1,4,5-triphosphate receptor in plants. I will show
that in green algae (Chlorophyta) a homologous ITPR gene is readily
identifiable, but in land plants (Embryophyta) this gene is absent, in-
dicative of gene loss at the root of Embryophtes. I will demonstrate
this gene loss is not due to poor gene calling, and that not even a
pseudogene of similar sequence and domain composition to mam-
malian ITPR is likely to exist. Later in Chapter 3 on page 39 I develop
my own hypothesis of what might have happened in land plants to
accommodate for the loss of ITPR.
2.2.1 Inositol-1,4,5-triphosphate signalling at the heart of Ca2+ ion release
from internal stores in Mammals and more broadly Metazoans
Inositol 1,4,5-triphosphate (IP3) acts as the secondary messenger in
the signal transduction cascade of the canonical inositol-phospholipid
signalling pathway of Metazoans. This signalling mechanism was
first suggested by Berridge and Irvine (1984) in a review collecting
all work at that time related to various inositol-phosphate metabol-
ism and Ca2+ release. The inositol-phospholipid pathway takes an
extracellular signal at low concentration such as the presence of a hor-
mone, and converts it into an internal ionic Calcium signal that can
more broadly affect the internal physiology of the cell in response
to the extracellular signal. In Humans and other animals the Inositol-
triphosphate Receptor (ITPR) protein families form tetrameric com-
plexes (Maeda et al., 1991) that act as ligand gated Ca2+ channels in
the endoplasmic reticulum (ER).
The IP3 signal cascade begins with a transmembrane G protein-
coupled receptor (GPCR) binding to an extracellular small molecule
or protein. The bound GPCR in the presence of Guanosine triphos-
phate (GTP) activates the cytosolic facing Gq-a-subunit. This in turn
activates Phospholipase Cb (PLCB) which hydrolyses Phosphatidylinositol-
4,5-bisphosphate (PIP2) in the surrounding membrane. The hydro-
lysis of PIP2 forms two products IP3 and Diacylglycerol (DAG). Con-
ceptually we can imagine PLCB splitting the hydrophilic head of PIP2
from its fatty acid tails (DAG) in the membrane, leaving the soluble
inositol-phosphatide ring IP3to move freely into the cytosol.
2.2 introduction and background 19
The liberated IP3 in the cytosol binds to each monomer of the ITPR
tetrameric complex (Meyer et al., 1990) inducing a ligand-gated re-
lease of stored Ca2+ from the lumen of the endoplasmic reticulum.
Finally this Ca2+ wave activates one of several specific calcium de-
pendent protein-kinases (PKCs) that localise to the DAG left behind
in the plasma membrane. The PKC in question goes on to activate its
end effect target proteins through phosphorylation of specific Serine
and Threonine.
Figure 3: A sketch of the Inositol-triphosphate signal transduction cascade
pathway found in Metazoa. This figure is a reproduction of a fig-
ure by Alberts et al. (2002) on page 860 of Molecular Biology of the
Cell.
There are fifteen described families of PKC in the Human genome
(Nishizuka, 1995), each has several mature splice products with tissue
specificity enabling a wide set of physiological end effects for a given
signalling cascade. It’s worth noting that the chemistry taking place at
the end of the Calcium cascade is often localised into nano-domains
adjacent to the membrane where a kinase is bound to DAG from the
initial onset of the extracellular signal. Please see Figure 3 on page 19
for a complete sketch of this signal transduction pathway. The IP3 signalling
pathway.This system of extracellular signal transduction is important to
health in humans. Of all FDA approved drugs (as of 2006) 26.8%
target some form of GPCR and a further 7.9% target specific ligand-
20 inositol-1,4,5-triphosphate mediated calcium release in plants
gated ion channels of which ITPR is a single example (Overington
et al., 2006). So almost 35% of all known drugs target systems of a
similar nature to the Inositol-triphosphate signal cascade pathway in
Human with much overlap/crosstalk in the pathways and proteins
involved in mediating a signal. The importance of these systems as
drug targets predominantly comes from their ubiquity in cells of dif-
ferent tissue types, and their modular reuse and altered dynamics
within distinct pathways from each cell type.
One might assume the diversity and specificity of Ca2+ signalling
seen in various cells comes purely from the GPCR association, or
from the presence of particular families of PKC as discussed previ-
ously. However, the observed dynamics of cellular Calcium release
are varied not just by the end effect or trigger of the signalling cas-
cade, but per tissue type even when a cascade is similar in function
(Berridge and Irvine, 1989).The mature ITPR
channel has a vast
number of
variants.
ITPR has been observed forming hetero-tetrameric complexes in
rat livers (Joseph et al., 1995; Monkawa et al., 1995) formed by the
mixing of translated transcripts of the three ITPR gene homologs
ITPR1-3. The transcriptional products for ITPR1 in lab rats have also
been found to produce seventeen distinct structural species of mature
mRNA. It is possible from this single ITPR gene that 23,001 hetero-
tetramer protein tertiary-structure variants could exist (Regan et al.,
2005). Given there are several homologous ITPR genes in both rat and
human, the total variation of ITPR tetrameric complexes becomes vast
and is selectively expressed in each tissue.
2.2.2 Physiological evidence for Inositol-1,4,5-triphosphate mediated sig-
nalling in Viridiplantae
There is conflicting and inconclusive physiological evidence for the
intermediate role of IP3 as a secondary messenger in higher land
plants, leading to recent questions over the validity of assuming this
system does exist. Of special note is that both Ryanodine (RyR) and
ITPR channels are frequently mentioned in a very general and collect-
ive manner in physiological experiments. These two genes also share
very similar protein domain combinations and are similar enough
that at greater evolutionary distances from Human it is non-trivial to
distinguish homology.
Krinke et al. (2007) have written a near exhaustive and authoritat-
ive review of all experimental physiological evidence for IP3 medi-
ated Ca2+ release in higher plants from the early 1990s to late 2006.
Physiological evidence is abundant at the cellular level, but with only
a few experiments specifically characterising an ITPR-like protein me-
diating signal transduction.
Krinke et al. present doubt over a mammalian ITPR-like protein
being found in plants, and consider whether the reports are more
2.2 introduction and background 21
an artefact of the experimental methods. Their conclusion is that the
wealth of physiological evidence of Calcium release induced by IP3
micro injection in guard cells, as well as more direct immunoblot evid-
ence for structurally ITPR-like proteins such as presented by Muir
and Sanders (1997) justifies the continued belief that an ITPR-like
ligand gated channel exists in land plants. Krinke et al. briefly re-
view previous domain homology analysis, and undertake their own
whole-protein search which they present as inconclusive, leaving the
question remaining open for the existence of a plant ITPR homolog.
Krinke et al. also highlight the increased importance of IP3 as an in-
termediate in signalling involving further phosphorylation of IP3 to
IP5 and IP6 Inositol-phosphatides, as well as the alternate use of Di-
acylglycerol in downstream pathways through phosphorylation me-
diated by Diacylglycerol kinase. The latter is a system that competes
with PKC activity in Metazoa. No conclusive
molecular evidence
for ITPR existing
in plants exists.
Sanders et al. (2002) offer a less comprehensive review of Ca2+ re-
lease in plants, but include more discussion of the implications of
physiological experiments. They discuss results relating to a putative
RyR channel activated by Cyclic adenosine diphosphoribose (cADPR)
and an unknown Nicotinic acid adenine dinucleotide phosphate-activated
(NAADP) Ca2+ channel as existing due to physiological experiments,
but with no candidate proteins yet described. Although these pro-
posed ITPR and RyR-like channels have been characterised in the
endomembranes of plants through electrophysiological and biochem-
ical experiment, nothing is described at the molecular level to im-
plicate the specific requirement of an ITPR-like ligand gated channel
(Xiong et al., 2006).
The uncharacterised NAADP-activated Ca2+ mentioned by Sanders
et al. is also described in mammals with a recent characterisation of
the Two Pore Channel (TPC) protein family being responsible for this
function (Ruas et al., 2010; Calcraft et al., 2009).
In Arabidopsis thaliana a single TPC gene is known, atTPC1, and
this is a possible candidate for NAADP activity (Hedrich and Marten,
2011). In mammals NAADP and cADPR are produced by a singular
dual-function enzyme CD38 [ensembl:ENSP00000226279] (Lee, 2011)
and CD157 [ensembl:ENSP00000265016] both described as having this
dual synthesis function specific to RyR activity.
A plausible explanation for the lack of a RyR/ITPR candidate in
plants could be the coinciding lack of synthesis of cADPR by a plant
CD38-like protein. Using the SUPERFAMILY database we find that
both CD38 and CD157 come from a distinct structural superfamily
known as N-(deoxy)ribosyltransferase-like [scop:52309] in SCOP. We
find that there is no protein that contains an N-(deoxy)ribosyltransferase-
like (NRT) domain in any sequenced species of Viridiplantae. In-
stances of NRT are highly prevalent in more complex metazoans
and sporadic within fungi and some protists. This still leaves ques-
22 inositol-1,4,5-triphosphate mediated calcium release in plants
tion over the presence of a RyR-like channel in plants, and how both
NAADP and cADPR are synthesised. The key point we wish to com-
ment on is that atTPC1 could perhaps explain the unknown NAADP
activated Ca2+ release from the ER discussed by Sanders et al. rather
than a conserved pathway involving RyR. This will be an important
point expanded and discussed in Chapter 3.
2.2.3 Known structure of the Mammalian ITPR protein families
Figure 4: The IP3 binding-core region of human ITPR1 in the bound state
[pdb:1N4K] bottom in yellow is the beta trefoil of the MIR domain,
top in red the side chains of an alpha helix from the armadillo-like
IP3 Binding Core domain.
ITPR proteins in mammals are on the order of ~2,700 amino acids in
length with both transmembrane and globular cytosolic domains. The
structure of a whole ITPR monomer has not been described to a high
accuracy. The specific region that binds IP3 the “IP3-binding domain”
has been structurally characterised through X-ray crystallography to
a resolution of 2.2 Å by Bosanac et al. (2002) shown in Figure 4 on
page 22 comprised of a beta-Trefoil domain and an Armadillo-like
multi alpha-helical domain. An additional high resolution structure
was resolved by Bosanac et al. (2005) for the most N-terminal “sup-
pressor domain” region of ITPR, finding that this region contained
an additional beta-Trefoil fold. In the SCOP classification of domains
both beta-Trefoil domains have been classified in the same family and
superfamily MIR domain [scop:82109]. The adjoining Armadillo-like
2.2 introduction and background 23
bundle found in the IP3-binding domain was classified as IP3 receptor
type 1 binding core, domain 2 (IP3BC) [scop:100909]. This arrangement
is more clearly understood from reviewing the domain architecture
schematic shown in Figure 5 on page 25.
IP3BC is always found alongside the MIR domain either forming
a functional unit under selection or a supra-domain as defined by
Vogel et al. (2004). IP3 is bound only at the interface of these two
conserved structures. See Figure 4 on page 22 for a detailed view of
this interface.
The MIR domain is not always found alongside the IP3BC and is
named after all the proteins it is present in: Mannosyltransferase
(EC 2.4.1.109), Inositol 1,4,5-triphosphate Receptor, and Ryanodine
Receptor (RyR). In Mannosyltransferase MIR occurs on its own as
a single domain protein without the IP3BC adjacent. For complete
details of the relation between SCOP/Pfam domains and the nomen-
clature used by Bosanac et al. please see Table 1 on page 23.
Bosnac et al.
Nomenclature
Residues SCOP Superfamilies Pfam Families
Suppressor domain 1-226 MIR domain Inositol 1,4,5-
triphosphate/ryanodine
receptor
IP3-binding domain 226-576 IP3 receptor type 1
binding core, domain
2;
MIR domain
MIR domain;
RIH domain
Modulatory and
transducing domain
576-2100 IP3 receptor type 1
binding core, domain
2;
ARM repeat
PB004081; PB012613;
RIH domain;
PB017883; PB007236;
PB001126;
RIH assoc.
Channel domain 2100-2590 N/A PB001430;
Ion transport protein
Coupling domain 2590-2743 N/A PB000285; PB000188
Table 1: The five functional regions of ITPR1 as defined by Bosanac et al.
(2005), with the corresponding domain classifications from both SU-
PERFAMILY and Pfam.
The in-complex structure of a whole ITPR channel tetramer has
been modelled at lower resolutions using Cryo-electron tomography
by various groups and is summarised well in a review be Taylor et al.
(2004). All models of complete channels agree on the IP3 binding core
being the tip of a large cytosolic domain covering the transmembrane
pore. From the tetrameric quaternary structure of ITPR it becomes
clearer how both the suppressor region and IP3 binding core oper-
24 inositol-1,4,5-triphosphate mediated calcium release in plants
ates. Suppressor action is formed from suppressor domains from each
monomer reaching over the pore and binding to each other, when in
contact with a secondary repressor protein. IP3being in the bound
state also alters the angle between the IP3BC and MIR domains affect-
ing the relative proximity of each copy of this region over the channel
pore.
2.2 introduction and background 25
Figure 5: A wrapped schematic of the SUPERFAMILY (above the central
black amino line) and Pfam (below the central amino line) domain
architecture assignment for the ITPR1 Human protein from En-
sembl [ENSP00000306253]. The five main functional regions as de-
scribed by Bosanac et al. are noted between red boundaries on the
amino acid ruler line. Additionally, the unmarked black and white
stripe demarcates the transcript exon boundaries for this protein
product.
26 inositol-1,4,5-triphosphate mediated calcium release in plants
2.3 ip3 bc and mir containing proteins across the tree
of life
The majority of evidence for the presence or absence of ITPR proteins
in plants is based on whole-protein sequence homology using some
variant of BLAST. This strategy has several limitations, the first being
the general sensitivity of BLAST being reduced over HMM methods,
as well as traditional BLAST variants not accounting for the domain
assignment problem. Profile HMM search software such as HMMER
has much higher rates of true positives for finding multi domain pro-
teins than BLAST whilst accepting the same number of false positives
(Eddy, 2011). Previously Krinke et al. (2007) did attempt a HMMER
search for ITPR but used whole-protein similarity rather than focus-
sing on specific regions of ITPR required for IP3 gated activity.
From the known protein structure of the IP3 binding core region
of ITPR1, the presence of IP3BC is required for IP3 binding activity
and implies the protein is related to either ITPR or RyR channels. It
stands to reason that any distantly related channel that is IP3 gated
and evolutionary related to ITPR must at least contain this domain.
Using the SUPERFAMILY (Gough et al., 2001) online resource one
can readily identify proteins containing both the IP3BC and MIR out-
side of Metazoans and more specifically within green plants.
2.3.1 ITPR-like protein readily identifiable in Chlorophyta
Focussing just on plant genomes available in SUPERFAMILY 1.75
as of May 17th 2013, two species and three genomes can be identi-
fied that contain the IP3BC domain. Moreover these proteins appear
to have more general domain architecture homology to ITPR/RyR-
like proteins . Figure 6 on page 27 shows the sTOL domain based
phylogenetic tree (Fang et al., 2013) pruned to show Viridiplantae
out grouped by Cyanidioschyzon merolae, which is the only example
species from Rhodophyta in SUPERFAMILY.
2.3 ip3 bc and mir containing proteins across the tree of life 27
Figure 6: A phylogenetic tree of IP3BC presence in Plantae from the SU-
PERFAMILY database retrieved 2013-05-17. Cyanidoschyzon merolae
is an extremophilic red alga and is the only genome represented
from Rhodophyta. The next branch is between Embryophyta (top)
and Chlorophyta (bottom). Highlighted in red and green are the
three genomes where a homolog of mammalian ITPR or perhaps
RyR is readily identifiable from domain presence.
Table 2 on page 29 presents the domain assignments in Chloro-
phyta for the putative algal ITPR proteins highlighted in Figure 6
on page 27. Included are the expect values taken from the SUPER-
FAMILY resource: any assignment with an E-value less than 1E-4 is
considered as a significant hit. Interestingly the two strains of Vol-
vox carteri have differing domain assignments. It appears as if the
28 inositol-1,4,5-triphosphate mediated calcium release in plants
protein annotation in Volvox carteri v199 is truncated either as a res-
ult of poor assembly or a lack of conservation in ITPR even at the
level of strains of the same species in algae. I believe the former is
more likely as the SUPERFAMILY assignment statistics for V. carteri
v199 show that domain species diversity is reduced, there are fewer
protein sequences annotated, with a 3% reduction in the total num-
ber of those sequences with any domain assignment. Considering
the ITPR homolog just looks like one half of the protein from the V.
carteri f. nagariensis strain I suspect that longer genes have not been
correctly assembled or called for this genome. This is a common and
easily overlooked problem with low-depth short-read assembly using
a next-generation sequencing technology.
It is interesting to see an ITPR homolog in Chlorophyta as you
would expect to find a similarly well preserved gene in other clades of
Chlorophyta or even the Embryophyta if this gene had been strongly
conserved. Already this leads me to suspect that Embryophytes pro-
foundly lack a direct gene homolog to ITPR. Loss of the gene is a
likely hypothesis but other events could have lead to ITPR being
highly altered or repurposed throughout all plants and only being
preserved in the Volvocales. To be sure that the ITPR gene was lost
I propose a full search for anything that resembles the various com-
ponents of the ITPR domain architecture to discount fission events or
large inserts and rearrangements in the IP3BC preventing a standard
HMM search from recognising this sequence.
2.4 searching for proteins that have some homology to
itpr in embryophytes
2.4.1 Model building
To perform a deep search of Embryophytes for an ITPR homolog I cre-
ated an ensemble of whole-protein and IP3BC region specific hidden
Markov models (HMM). The power of SUPERFAMILY to find dis-
tantly related domain homology is in its ability to use many HMM
models from multiple seed sequences to a single SCOP superfamily.
This increases the space of sequences recognisable as distantly related
improving sensitivity over building a single HMM model (Gough,
2002a).
In addition to the three Chlorophyta homologs mentioned in the
previous section, other non-mammalian ITPR homologs can be iden-
tified using the SUPERFAMILY resource. From each of the the se-
quences the IP3BC region was cut out and used to build a single
HMM model, along with another whole-protein model. In total forty
models were built using ten different proteins from seven species, see
Table 3 on page 30 for a detailed list.
2.4 searching for proteins that have some homology to itpr in embryophytes 29
GenomeSequenceIDSCOPPfamE-value
Volvoxcarteri
f.nagariensis
jgi|Volca1|121781|
estExt_fgenesh5_synt.C_1280001
MIRdomain[scop:82109];
IP3receptortype1bindingcore,domain
2[scop:100909];
ARMrepeat[scop:48371]
Inositol1,4,5-trisphosphate/ryanodine
receptor[pfam:PF08709.6];
PB007702;
RIHdomain[pfam:PF01365.16];
PB006047;
PF08454.6
1.06E-16
Chlamydomonas
reinhardtii
jgi|Chlre4|153750|
Chlre2_kg.scaffold_67000026
MIRdomain;
IP3receptortype1bindingcore,domain
2;
Voltage-gatedpotassiumchannels
[scop:81324]
Inositol1,4,5-trisphosphate/ryanodine
receptor;
RIHdomain;
PF08454.6
1.03E-22
Volvoxcarteri
v199
Vocar20006779m|
PACid:23131843
ARMrepeat;
IP3receptortype1bindingcore,domain
2
RIHdomain;N/A
Table2:ITPR-likehomologsfoundinChlorophytawithE-valuesreportedfortheIP3BCdomaini.e.theIP3receptortype1bindingcoreandtheMIR
domainbeingassignedtogether.
30 inositol-1,4,5-triphosphate mediated calcium release in plants
Species Sequence ID Length IP3BC
region
Homo sapiens sp|Q14643|ITPR1_HUMAN 2758 225-578
Homo sapiens sp|Q14571|ITPR2_HUMAN 2701 225-577
Homo sapiens sp|Q14573|ITPR3_HUMAN 2671 223-574
Amphimedon
queenslandica
Aqu1.228673|PACid:15727201 2654 222-561
Mucor
circinelloides
jgi|Mucci1|82895|fgeneshMC_pg.5_#_852 2596 207-538
Phycomyces
blakesleeanus
jgi|Phybl1|80438|estExt_fgeneshPB_pg.C_620017 2551 217-546
Emiliania
huxleyi
jgi|Emihu1|209909|gm1.4000297 2712 210-478
Volvox carteri f.
nagariensis
jgi|Volca1|121781|estExt_fgenesh5_synt.C_1280001 3167 212-533
Chlamydomonas
reinhardtii
jgi|Chlre4|153750|Chlre2_kg.scaffold_67000026 3140 173-514
Table 3: Seed sequences used to build both whole-protein and IP3BC HM-
MER models.
For each seed sequence an HMM profile was iteratively built us-
ing the jackhmmer software from the HMMER3 suite (Eddy, 2009).
Jackhmmer was used with the --max option removing all heuristic
filtering which slows the model building time but improves sensitiv-
ity. For each seed sequence models from iteration three and five of
jackhmmer were used in any search to increase model diversity. The
UniRef100 non redundant sequence library was used as a background
for iterative alignment and profile building.
2.4.2 Search in twenty-five Embryophytes
Twenty five plant genomes were available in the SUPERFAMILY pro-
tein library at the time of writing. The majority of species represen-
ted were Angiosperms: eighteen Eudicots, and five Monocots; as well
as single examples from the Lycopodiophyta and Bryophyta. Many
of these came from the Phytozome collection(Goodstein et al., 2012),
others were taken from their respective source project websites.
Each HMM was searched against each genome using hmmsearch
with a relaxed expect value acceptance threshold of E  0.01. No sig-
nificant results were found for any IP3BC model. All whole-protein
models failed to find anything resembling the complete protein with
no better hits to the IP3BC than the models specific to this region.
However, some homology to the full protein was found in every spe-
2.4 searching for proteins that have some homology to itpr in embryophytes 31
cies searched specific to the MIR domain and the ’Channel domain’
as defined by Bosanac et al. (2005) in mammalian ITPR.
From investigating all of the proteins with any homology to ITPR1
in mammals it’s clear they fall into two broad categories: those with a
single MIR domain that look to be likely Mannosyltransferases rather
than ITPR or RyR, and a second family of Two Pore Channel (TPC)
like Calcium ion channels. This second class is most interesting as
the channel domain as defined by Pfam in ITPR1 in all whole-protein
models aligns to the channel domains of the target plant TPC pro-
teins. More importantly TPC-like proteins are the only protein to have
homology to the ITPR channel found in plants, other Calcium ion
channels in the same Pfam are not recovered with models built from
ITPR-like homologs from around the tree of life.
A complete listing of results and the domains recovered per spe-
cies can be found in Table 4 on page 32. Of special note is the re-
lationship between the number of TPC-like proteins to the number
of Ca2+ channel domains observed hitting the whole-protein ITPR
models. The TPC-like targets have two copies of the channel domain
found in ITPR, both are hit by the models with slight bias to the most
C-terminal domain having better alignment.
32 inositol-1,4,5-triphosphate mediated calcium release in plants
GenomeTax.NoSeq.MIRCa2+ARMOtherTPC-likeBestHitSeq.ID
ArabidopsisthalianaEudi.642101TPC-likeAT4G03560.1
ArabidopsislyrataEudi.533001TPC-likejgi|Araly1|911837|scaffold_603894.1
CaricapapayaEudi.411111TPC-likeevm.TU.supercontig_848.1
TheobromacacaoEudi.412021TPC-likeCGD0009922
MedicagotruncatulaEudi.521200MIR-likeIMGA|CU019603_1.2
GlycinemaxEudi.1542451TPC-likeGlyma13g37280.1
*LotusjaponicusEudi.432001TPC-likechr3.CM0160.230.nc
FragariavescaEudi.1222181MIR-likegene15020
*MalusxdomesticaEudi.1121080Ca2+chan.MDP0000334484
PrunuspersicaEudi.824132TPC-likeppa001980m
PopulustrichocarpaEudi.822321TPC-likejgi|Poptr1_1|837245|estExt_fgenesh4_pm.C_1420009
ManihotesculentaEudi.982011TPC-likecassava12501.valid.m1
*RicinuscommunisEudi.515012TPC-like29983.m003190
*SolanumlycopersicumEudi.312011TPC-likeSL1.00sc05390_201.1.1
VitisviniferaEudi.522021TPC-likeGSVIVT01020549001
AquilegiacoeruleaEudi.1173200MIR-likeAcoGoldSmith_v1.011018m|PACid:18153340
CucumissativusEudi.1266303TPC-likeCucsa.147640.1
MimulusguttatusEudi.1072011TPC-likemgf006798m
BrachypodiumdistachyonMono.982001TPC-likeEG:BRADI2G46510.1
Oryzasativassp.japonicaMono.1324182TPC-likeLOC_Os01g48680.2|13101.m05051|protein
Zeamayssubsp.maysMono.624022MIR-likeGRMZM2G144367_P01
SorghumbicolorMono.932051TPC-likejgi|Sorbi1|5053513|Sb03g031110
*PhoenixdactyliferaMono.532000MIR-likePDK_20s1454741g001
SelaginellamoellendorffiiSpikem.1183000MIR-likejgi|Selmo1|444987|estExt_fgenesh2_pg.C_500138
PhyscomitrellapatensMosses17320139TPC-likejgi|Phypa1_1|94200|fgenesh1_pg.scaffold_256000041
Table 4: Summary results of searching IPTR-like protein models against 25
plant genomes. All “MIR-like” results are likely Mannosyltrans-
ferases. Those genomes marked with a * are of draft quality at the
time of writing.
2.4 searching for proteins that have some homology to itpr in embryophytes 33
2.4.3 ITPR Calcium channel domain has homology to Two-pore Channels
of Embryophyta
So far I have focussed on the likelihood of the IP3BC+MIR supra-
domain being present in land plants, and that this form of IP3 bind-
ing activity exists. A key result from HMMER search of whole-protein
models of ITPR is that the Calcium channel domain is readily iden-
tifiable in the TPC-like family of proteins in most Embryophytes se-
quenced to date with results looking even stronger when taking into
account the draft quality of genomes not containing TPC-like pro-
teins.
It is clear from the results above that TPC proteins are a very good
candidate for finding a Calcium channel related to mammalian ITPR.
I will go into much more detail in the following chapter on the signi-
ficance of TPC proteins having remote homology to ITPR.
2.4.4 Disordered binding core in algal ITPR-like proteins
The longer length of the aligned IP3BC region in M. circinelloides, V.
carteri and C. reinhardtii seen in Table 3 is due to disordered inserts
between the MIR and IP3 binding core SCOP domains. Please see
Figure 44 on page 160 for a full domain architecture schematic of the
two Chlorophyta ITPR-like proteins, where the variable disordered
insert is clearly visible. Looking in greater detail at one of these insert
regions from V. carteri f. nagariensis one of the disordered regions is
of low sequence complexity and Glycine and Alanine rich:
ATGPGGAGGGGPGGEGAPAGAGGEFGGGAGAAAAAAAVSGGGGGEMQNVLHSADEDVLYDEA
From eukaryotic linear motif assignment via ELM (Dinkel et al.,
2012) shown in Figure 7 on page 34 I find that the final Serine of this
insert could be phosphorylated by a CK2 protein kinase, localising
to the sequence VLHSADE. CK2 like ITPR is found as a tetrameric as-
sembly in vivo (Niefind et al., 2001), so is perhaps a key regulator of
this ITPR-like protein in V. carteri f. nagariensis. Although ELM has not
assigned a region of interaction at the point of the CK2 Serine phos-
phorylation, it’s quite likely this is a phosphorylation activated MoRF
region or “adhesive region” as expressed by Meggio and Pinna (2003)
in this context, perhaps interacting with a 14-3-3 protein or one of the
many other domains that interact with phosphoserine linear motifs
in signalling proteins.
This is likely a novel interaction motif and worth investigating ex-
perimentally. See Yaffe and Elia (2001) for a review of the increasingly
diverse array of signalling domains associated with phosphoserine
other than 14-3-3. Conversely the SH2 ligand binding motif at the
end of this region requires a phosphotyrosine to be present, so either
a novel kinase binding adjacent to the YDEA motif needs to be present
34 inositol-1,4,5-triphosphate mediated calcium release in plants
or this is a false positive assignment with that region still plausibly
binding PDZ domain containing proteins.
Figure 7: V. carteri f. nagariensis disordered IP3BC insert ELM assignment,
much of which is predicted to be a MoRF region by ANCHOR as
well.
I have given some brief detail of this single disorder insert in the
IP3BC domain of V. carteri f. nagariensis. This region is not conserved
in sequence in M. circinelloides or C. reinhardtii but is conserved in
its disordered state predicted by the IUPred (Dosztányi et al., 2005)
disorder predictor. I suggest that these disordered insert regions are
an adaptation that has led to the continued inclusion of ITPR in those
specific species that still retain a recognisable ITPR-like protein.
One explanation for this is that these channels perhaps no longer
bind IP3 due to the separation of the MIR and IP3BC domains, but in-
stead allow the channel to be regulated by secondary signalling path-
ways mediated by MoRF interactions rather than IP3. Another hypo-
thesis is that with phosphorylation these disordered regions contract
due to favourable folding conditions and the formation of secondary
structure, leading to the MIR and IP3BC domains becoming adjacent
forming the IP3 binding site on activation.
These sorts of inserted-disorder functions are described for PIP2
binding activity in the signalling protein BIN1, given a ten residue
MoRF insert (Kojima et al., 2004; Weatheritt and Gibson, 2012) that
has an SH3 binding motif interlaced with a PIP2 binding region. How-
ever, this insertion is at the level of transcriptional variance within a
single organism, rather than divergence of sequence between species.
My expectation is that the overall mechanism is still relatively analog-
ous in the disparate ITPR-like proteins seen around the eukaryotic
tree of life.
2.5 exhaustive search of embryophyta genomes for an itpr-like protein 35
2.5 exhaustive search of embryophyta genomes for an
itpr-like protein
Previously I discarded MIR containing proteins from Embryophytes
as likely originating from Mannosyltransferase coding genes. How-
ever, these could be bad gene calls, with larger ITPR-like genes be-
ing present at these positions. Many plant genome assemblies rely
on Arabidopsis models for their gene calling and annotation so it is
possible bias has slipped in and continued to cause problems. Addi-
tionally, TPC-like genes could perhaps be missing further regulatory
regions not currently seen. Neither of these scenarios are likely, but to
be sure that more identifiable ITPR-like genes do not exist this must
be ruled out.
I have already shown that a homologous protein to mammalian
ITPR can be readily found within green algae, specifically the order
Volvocales. It is possible some remnant of an ITPR gene still exists
within the full nucleotide sequence of vascular plants, either as a
pseudogene or a fragment that has not decayed beyond the detection
of sequence homology. Understanding how and approximately when
the ITPR gene was lost from vascular land plants is important in un-
derstanding what has happened in these organisms to accommodate
the loss of this gene.
I propose the use of HMM profile based search against all six-frame
translated open reading frames (ORFs) of a given genome as the least
conservative and most exhaustive method of locating any likely se-
quence fragment matching the ITPR binding core. In a similar fash-
ion to how the ’protein2genome’ model from exonerate (Slater and
Birney, 2005) allows one to search a protein sequence against a gen-
ome, independent of intron inserts and frame shifts. The advantage
of my proposed technique over exonerate is that it uses a more sens-
itive profile technique for protein homology detection, the disadvant-
age is that frame shifts and phase changes between ORFs will not be
accounted for. However, gene calling has already been performed on
genomes with completed protein annotation. This study is an attempt
to do the least conservative search to identify any small piece of evid-
ence to disprove a default assumption that ITPR has been truly lost
in Embryophyta.
2.5.1 Checking for ITPR-like pseudogenes
Search of
nucleotide
sequence in plants
supports absence
of IP3binding site
domains.
Hard masked assembled nucleotide files were taken from release nine
of the Phytozome (Goodstein et al., 2012) plant genome collection1.
These genomic DNA files were then six-frame translated to attain all
plausible ORFs that can be produced in or outside of a known gene.
To create the translated amino acid sequence the sixpack tool from
1 Phytozome files: ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/
36 inositol-1,4,5-triphosphate mediated calcium release in plants
EMBOSS 6.3.1 (Rice et al., 2000) was used with a minimum ORF size
of seven amino acids, breaking on stop codons.
Each translated ORF was scored against the fifth iteration IP3BC
model, built from the Human ITPR1 seed sequence from previous
protein searches. hmmsearch from the HMMER3 suite (Eddy, 2009)
was used to perform the search, accepting sequence matches with an
expect value less than or equal to 0.01.
The results for the translated nucleotide search were consistent with
the search of protein annotations found in Table 4 on page 32 apart
from many genomes either had no significant hits and only fragments
of the MIR region were recovered. There does not appear to be any
recognisable remnant of the IP3BC domain in any vascular plant in-
cluded in the study.
2.5.2 Checking for bad or incomplete gene predictions
Table 4 on page 32 shows that six sequences were found with do-
mains related to ITPR in Arabidopsis thaliana. Using the TAIR10 inter-
active genome browser (Lamesch et al., 2012) the flanking region for
every target gene to its neighbouring genes both up and down stream
within the same chromosome were selected and saved as a nucleotide
FASTA file. Each of these chromosome regions were six-frame trans-
lated using the procedure outlined in the previous section, but against
all ITPR models as used in previous protein sequence searches. Align-
ments to whole-protein ITPR models showed no new homology from
the translated sequence, therefore it is highly unlikely any additional
exons for a larger gene are present around MIR containing proteins
or TPC-like proteins.
2.6 conclusion
From the widespread but sparse distribution of recognisable ITPR
homologs over all eukaryotes it appears that the IP3BC domain was
present in the last universal common ancestor to all eukaryotes (LECA).
Since this time ITPR has been lost multiple times early in the develop-
ment of each kingdom, other than Metazoa. I propose that ITPR was
lost approximately 725–1200 million years ago in the evolutionary his-
tory of modern land plants, with the divergence of Streptophyta from
Chlorophyta. This puts the loss of the IP3BC supra-domain coincid-
ental with the development of green land plants (Becker and Marin,
2009). Within Chlorophyta the mode of evolution was still loss of
ITPR, with only the Chlorophyceae or perhaps just the Volvocales
specifically retaining a recognisable ITPR gene or IP3BC domain. The
phylogenetic placement of recognisable conserved homologous se-
quence supports this hypothesis, and the profound lack of any sig-
2.6 conclusion 37
nal from any Embryophyte nucleotide sequence does not raise any
significant doubt of an ancient loss event common to all land plants.
All conserved and recognisable IP3BC regions outside of Metazoa
appear to have disordered inserts between the MIR and IP3 bind-
ing core SCOP domains, perhaps facilitating additional regulation
through disorder-order protein interaction. These regions are only
conserved as disorder, with different linear motifs recognisable in
different species, even species more closely related such as the C. re-
inhardtii and V. carteri. This suggests that ITPR has been lost, apart
from in species that have repurposed this protein, perhaps facilitated
by additional protein interactions within the binding core allowing
cross talk with alternative signalling pathways in these species.
Further work is necessary to support this idea, but certainly it’s
clear that within Metazoa the IP3BC aligns free of these disordered
inserts, and the sparse representation of this domain elsewhere in
eukaryotes requires this extra structure to be present.
Physiological evidence for Ca2+ signalling in root hairs and sto-
matal guard cells suggests that Embryophytes still have a signalling
system relatively analogous to mammalian IP3 signalling. Which sys-
tem of proteins is performing this role and if IP3 specifically is in-
volved remains an open question. The next chapter goes on to sug-
gest my working hypothesis for what might have happened since the
loss of ITPR in ancestral land plants.
3A N A LT E R N AT I V E H Y P O T H E S I S F O R I N O S I TO L
P H O S P H AT E I N D U C E D C A L C I U M R E L E A S E I N
E M B RY O P H Y T E S
3.1 introduction
In the previous chapter I demonstrated there appears to be no direct
whole protein homolog of ITPR found in Embryophyta. There is how-
ever some evidence of distant relation between the channel domain
of ITPR and TPC-like proteins of plants. Based on a working hypo-
thesis that TPC is the channel providing the cell physiology seen in
the majority of studies reporting ITPR activity this chapter develops
an alternative theory for how inositide signalling can still be involved.
3.2 phosphoinositide metabolism in embryophytes is very
different to that found in mammals
Previously in Chapter 2 I mentioned that there are 15 families of PKC
homologs, these can be subdivided into three groups based on their
requirements for activation. The conventional group of PKCs PKCa-g
require both DAG and Ca2+ for activation, there is additionally a
novel group that localise to DAG in the membrane but do not re-
quire Ca2+ release for kinase activity PKCd,e,h,j. The third group of
PKC genes are atypical in that the active sites associated with both
DAG and Ca2+ binding are either cleaved, alternately spliced from the
protein or otherwise non-functional PKCz,i,N1,N2,N3 these PKCs are in-
stead involved in Phosphatidylinositol (3,4,5)-triphosphate (PIP3) re-
lated signalling as the C2 domain of PKCs has conserved affinity for
PIP3 in all families (Nishizuka, 1995).
All PKCs have affinity for PIP3 but no documented Phosphatidylin-
ositide 3-kinases (PI3K) that can metabolise PIP2 to PIP3 have been
documented to date in plants (Lee et al., 2008). Additionally not
a single PKC from any of the ubiquitous protein families found in
Mammals has been identified in plants through sequence homology.
This brings into question the validity of even looking for an ITPR-
IP3-like Ca2+ release channel in higher plants as there are in theory
no end effect kinases that are associated with the canonical case in
mammals. For an overview of the PI to PIPx pathways that do and
don’t exist in plants please see Figure 8 on page 40.
Physiological experiments still show some form of IP3 mediated re-
lease is perhaps present. In this chapter I call into question the causal
mechanism of IP3 as the direct mediator of Ca2+ release from internal
39
40 an alternative hypothesis for inositol phosphate induced calcium release in
Figure 8: Phosphoinositide metabolism in plants versus human. Figure re-
produced from the book chapter “Phospholipid Signalling in Root
Hair Development” written by Takashi Aoyama (Aoyama, 2009).
Blue blocks represent metabolites present in both humans and
plants, green solid arrows and black text represent pathways and
associated enzymes present in both plants and humans. Arrows
dashed in red with accompanying grey text are those pathways
only found in humans.
stores in plants. Instead I propose a novel mechanism for Ca2+ re-
lease involving atTPC1 in-complex-with or downstream-of a novel
regulatory protein I will refer to as the Two Pore Channel Regulator
subunit protein or TPCR.
3.3 tpcs as an alternative channel to itpr mediated in-
ositol phosphate signalling in embryophytes
Previously in Chapter 2 I showed that TPCs are not only the protein
with highest sequence homology to ITPR in Embryophytes, but they
are also related specifically through their channel forming domain,
albeit distantly. I believe that this relationship is more than just a
reflection of a shared domain, but specifically a shared context of
these channels in respect to Inositolphosphatide signalling.
This idea has already been postulated by Wheeler and Brownlee
(2008) in their paper “Ca2+ signalling in plants and green algae – chan-
ging channels” along with many other plausible candidates. However,
no concrete evidence has been presented as to which of many channel
families has taken on the role of ITPR. I believe that TPCs are a strong
candidate, but unusually through an unproven relationship with a
second hypothetical regulatory protein. This argument involves the
development of several emerging current opinions around the role of
intrinsic disorder and posttranslational modification. However, from
my development of the D2P2 database discussed in Part II of this
thesis, I feel increasingly reassured this is not too far removed con-
ceptually from what is likely happening in plants.
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2
MattOates_PhD_Domains_and_Disorder-Final-v1_2

More Related Content

Similar to MattOates_PhD_Domains_and_Disorder-Final-v1_2

Evolution of DNA repair genes, proteins and processes
Evolution of DNA repair genes, proteins and processesEvolution of DNA repair genes, proteins and processes
Evolution of DNA repair genes, proteins and processesJonathan Eisen
 
Reflective essay biomedical techniques
Reflective essay biomedical techniquesReflective essay biomedical techniques
Reflective essay biomedical techniquesjosearzon99
 
Cell Biology ( PDFDrive ).pdf
Cell Biology ( PDFDrive ).pdfCell Biology ( PDFDrive ).pdf
Cell Biology ( PDFDrive ).pdfPaolaCrdenas26
 
Haneelam Virdee - All Purpose Cover Letter
Haneelam Virdee - All Purpose Cover LetterHaneelam Virdee - All Purpose Cover Letter
Haneelam Virdee - All Purpose Cover LetterHaneelam Virdee
 
Wholeness in science a methodology for pattern recognition and clinical int...
Wholeness in science   a methodology for pattern recognition and clinical int...Wholeness in science   a methodology for pattern recognition and clinical int...
Wholeness in science a methodology for pattern recognition and clinical int...Elsa von Licy
 
statement of purpose example 06.doc
statement of purpose example 06.docstatement of purpose example 06.doc
statement of purpose example 06.docmaicmayer1
 
High School Biology Instructional Unit_Jordan Hampton
High School Biology Instructional Unit_Jordan HamptonHigh School Biology Instructional Unit_Jordan Hampton
High School Biology Instructional Unit_Jordan HamptonJordan Hampton
 
Bio 101 introduction to biology tui
Bio 101 introduction to biology tuiBio 101 introduction to biology tui
Bio 101 introduction to biology tuiNsgCourses
 
Stronger the Foresight Reflects The Stronger Scientific Acuity
Stronger the Foresight Reflects   The Stronger Scientific AcuityStronger the Foresight Reflects   The Stronger Scientific Acuity
Stronger the Foresight Reflects The Stronger Scientific AcuityBalwant Meshram
 
BIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIBIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIsarfarazzafar2
 
BIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIBIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIAtifkhilji
 
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXI
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXIROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXI
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXIMichael Petriello
 
Scientist of the Month - February Ricardo Gouveia
Scientist of the Month - February Ricardo Gouveia Scientist of the Month - February Ricardo Gouveia
Scientist of the Month - February Ricardo Gouveia St John's Laboratory Ltd
 
10. monica rivera 4th seminarreflectionoctober29
10. monica rivera 4th seminarreflectionoctober2910. monica rivera 4th seminarreflectionoctober29
10. monica rivera 4th seminarreflectionoctober29Monica Rivera
 
Investigación con embriones humanos ¿sí o no
Investigación con embriones humanos ¿sí o noInvestigación con embriones humanos ¿sí o no
Investigación con embriones humanos ¿sí o noseminary
 

Similar to MattOates_PhD_Domains_and_Disorder-Final-v1_2 (20)

Evolution of DNA repair genes, proteins and processes
Evolution of DNA repair genes, proteins and processesEvolution of DNA repair genes, proteins and processes
Evolution of DNA repair genes, proteins and processes
 
Reflective essay biomedical techniques
Reflective essay biomedical techniquesReflective essay biomedical techniques
Reflective essay biomedical techniques
 
Cell Biology ( PDFDrive ).pdf
Cell Biology ( PDFDrive ).pdfCell Biology ( PDFDrive ).pdf
Cell Biology ( PDFDrive ).pdf
 
Haneelam Virdee - All Purpose Cover Letter
Haneelam Virdee - All Purpose Cover LetterHaneelam Virdee - All Purpose Cover Letter
Haneelam Virdee - All Purpose Cover Letter
 
Wholeness in science a methodology for pattern recognition and clinical int...
Wholeness in science   a methodology for pattern recognition and clinical int...Wholeness in science   a methodology for pattern recognition and clinical int...
Wholeness in science a methodology for pattern recognition and clinical int...
 
statement of purpose example 06.doc
statement of purpose example 06.docstatement of purpose example 06.doc
statement of purpose example 06.doc
 
bio essays.pdfBio Essays
bio essays.pdfBio Essaysbio essays.pdfBio Essays
bio essays.pdfBio Essays
 
High School Biology Instructional Unit_Jordan Hampton
High School Biology Instructional Unit_Jordan HamptonHigh School Biology Instructional Unit_Jordan Hampton
High School Biology Instructional Unit_Jordan Hampton
 
Bio 101 introduction to biology tui
Bio 101 introduction to biology tuiBio 101 introduction to biology tui
Bio 101 introduction to biology tui
 
Essay Of Science
Essay Of ScienceEssay Of Science
Essay Of Science
 
Stronger the Foresight Reflects The Stronger Scientific Acuity
Stronger the Foresight Reflects   The Stronger Scientific AcuityStronger the Foresight Reflects   The Stronger Scientific Acuity
Stronger the Foresight Reflects The Stronger Scientific Acuity
 
BIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIBIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUI
 
BIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUIBIO 101 INTRODUCTION TO BIOLOGY TUI
BIO 101 INTRODUCTION TO BIOLOGY TUI
 
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXI
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXIROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXI
ROLE OF CAVEOLIN-1 AND NRF2 IN NUTRITIONAL MODULATION OF PCB TOXI
 
Scientific Essay Sample
Scientific Essay SampleScientific Essay Sample
Scientific Essay Sample
 
Anfinsen.pdf
Anfinsen.pdfAnfinsen.pdf
Anfinsen.pdf
 
Scientific Method
Scientific MethodScientific Method
Scientific Method
 
Scientist of the Month - February Ricardo Gouveia
Scientist of the Month - February Ricardo Gouveia Scientist of the Month - February Ricardo Gouveia
Scientist of the Month - February Ricardo Gouveia
 
10. monica rivera 4th seminarreflectionoctober29
10. monica rivera 4th seminarreflectionoctober2910. monica rivera 4th seminarreflectionoctober29
10. monica rivera 4th seminarreflectionoctober29
 
Investigación con embriones humanos ¿sí o no
Investigación con embriones humanos ¿sí o noInvestigación con embriones humanos ¿sí o no
Investigación con embriones humanos ¿sí o no
 

MattOates_PhD_Domains_and_Disorder-Final-v1_2

  • 1. D O M A I N S A N D D I S O R D E R TO WA R D S A S U F F I C I E N T E V O L U T I O N A RY D E S C R I P T I O N O F P R O T E I N S T R U C T U R E Matt E. Oates Department of Computer Science University of Bristol A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Doctor of Philosophy (PhD) in the Faculty of Engineering. September 2, 2014 – version 1.2.0
  • 2. Matt E. Oates: Domains and Disorder, Towards a Sufficient Evolution- ary Description of Protein Structure, © September 2nd 2014
  • 3. D E C L A R AT I O N I declare that the work in this dissertation was carried out in accor- dance with the Regulations of the University of Bristol. The work is original, except where indicated by special reference in the text, and no part of the dissertation has been submitted for any other academic award. Any views expressed in the dissertation are those of the au- thor. Bristol, September 2nd 2014 Matt E. Oates
  • 4.
  • 5. "Teaching should be such that what is offered is perceived as a valuable gift and not as a hard duty." – Albert Einstein Dedicated to my parents Stephen and Christine. It is only through reflection as an adult that I understand I had an extraordinary childhood filled with true teaching every day. This writing is dedicated to: the days of fossil hunting, the afternoons at the zoo recording observations, the hours under the stairs looking at spectra, the minutes of French coming out of a home made crystal set, the moments of discovery created from trashed record players found in a hedgerow, and to the countless other kind acts that have shaped my mind. Thank you.
  • 6.
  • 7. A B S T R A C T The general title given to this thesis represents the underlying ethos of my work that links most parts together, as well as being the motiva- tion I now have for future work. The main scientific concern I present within is a more specific evolutionary theory on what has happened in land plants to a well known Calcium cell-signalling pathway found in mammals. Namely the Inositol triphosphate mediated Calcium re- lease of ITPR channels in mammalian neural and muscle cells. This is discussed at length in Part ii, Chapters 2-3 of this thesis. Chapter 3 contains a detailed discussion surrounding an already known and characterised Calcium channel (TPC1) in Arabidopsis thaliana that was found to be related albeit very distantly to ITPR channels. Additional partner regulatory proteins are introduced and some justification is made that they interact directly with TPC1 providing it with regu- lated gated activity specifically in guard cells. During the development of the theory presented in Part ii it became essential to understand the location and function of disordered pro- tein regions in sequences over many species and genes. This lead to the production of a Database of Disordered Protein Predictions (D2P2) described in Part iii, as well as methods for visualizing multiple classes of protein annotations such as structural domains, posttrans- lational modification sites, and regions of protein disorder that fold on contact. In Chapter 5 discussion surrounding major results from producing D2P2 and its implications on the evolution of disordered protein state are presented. Finally in Part iv I introduce some relatively unrelated work invest- igating the domain content of a new genome being sequenced at the King Abdullha University of Science and Technology for the Dinofla- gellate species Symbiodinium microadriaticum. One of my main tasks in this collaboration was to identify proteins that mediate many en- dosymbiotic relationships carried out by Symbiodinium. Finding an example of a superfamily only found in a single clade of bacteria I identify a plausible target protein and mechanism for eluding Toll- like Receptors of host species. In the concluding Part v I summarise pieces of work that I have yet to finish, but include here to give some impression of the sorts of work I have been thinking about and hope to one day complete. vii
  • 8.
  • 9. A C K N O W L E D G M E N T S I would like to thank the Engineering and Physical Sciences Research Council for their funding (grant EP/E501214/1) of my PhD. The Bris- tol Centre for Complexity Sciences were responsible for developing my extended skill set, with their inspired interdisciplinary doctoral training program. I am grateful for the guidance and camaraderie I re- ceived during my time at the centre. Professor John Hogan especially was forthcoming in advice and support, both financial and practical for continuing my PhD when I moved on after a year working on evolutionary swarm robotics. The Department of Computer Science have also been instrumental in the production of this thesis, specific- ally their forward thinking policy towards freely available coffee in the common room. Professor Julian Gough as my primary supervisor blessed me with his brand of supervision. Whilst being wholly supportive he afforded me great freedom to pursue topics I found engaging on my own terms. From having seen the many differing styles of PhD supervi- sion I know this to be a rare fit. I feel very lucky to be amongst those graduating from the Gough lab. In time I’m sure I will find myself to be one amongst a large extended superfamily of successful and happy people Julian has helped to inspire and send on their way. Anywhere throughout this thesis where I have referenced the SU- PERFAMILY database I am respectfully acknowledging years of hard work by Professor Gough; that I believe many would give thanks for. It goes without saying that large bodies of knowledge, insight and ideas on protein evolution I espouse have been inherited from Professor Gough’s through his diligent tutelage. I hope to be able to repay this great gift in time. Professor Alistair Hetherington I have to thank for introducing me to the wonderful world of cell-signalling which as a computer scient- ist previously working in artificial life, robotics and collective beha- viour I have found most intriguing. Alistair has perhaps seen me at both my most intellectually wretched and rich states of being, but has always shown the same level of care and enthusiasm for my efforts. After the first nine months of exhaustively searching for a plant ITPR gene his excitement and thirst for a satisfactory answer was a core mo- tivation to continue. Alistair’s broad knowledge of the experimental field and literature moved theory forward when I often ran out of ideas. The findings by Larisch et al. (2012) were especially welcome to my inbox when I started to doubt my own ideas on the importance of the role of disorder in TPC1. ix
  • 10. Much of my PhD has been spent in collaboration with other stu- dents, postdoctoral researchers and primary investigators in other labs. I would especially like to thank both Dr Elodie Marchadier and Dr Jean-Charles Isner for their help. Without their insights and ex- perimental work I would not have continued to push the plant ITPR theory in Part II towards it’s current state. In Part III I present work I have carried out in collaboration with a select group of people in the field of disorder prediction, amongst them I would especially like to acknowledge Professor Keith Dunker’s support. Keith’s generous at- titude towards collaborative work and knowledge sharing has made him the unwritten third supervisor of my PhD. I would not have the data I present on disordered protein content, or have broken through with my work on ITPR in plants without his evangelical efforts to spread knowledge of disordered protein structure. With respect to my work on the Symbiodinium genome in Part IV, Dr Manuel Aranda contributed to my understanding of the surrounding biology during weekly video meetings with the King Abdullah University of Science and Technology. I am grateful that he took the extra time to answer my rather simplistic questions. There is much work I have not presen- ted in this thesis, and many others have worked with me on side pro- jects that have indirectly contributed to the work I have presented. I would especially like to thank Dr Owen Rackham, Dr David Mo- rais, Dr Hai Fang, and Adam Sardar for all of their input and shared knowledge. I give double thanks to Adam for being a good friend when they were in short supply, and to Owen for always being one step ahead enthusiastically showing the way forwards. My endless thanks to Natasha for her ever present support and companionship throughout my joint MRes and PhD. Living with me during this period of my life has not always been desirable or easy, and rarely fun and uplifting. I cannot imagine how she has managed to remain optimistic with only her faith in me as a person to guide her. With the end of this period I hope to revert back to the happier and calmer man she once knew almost five years ago. I’d also like to thank Ben Smithers, Jon Stahlhacke, Matthew Or- linski and Michael Sheldon for feedback and discussion about this manuscript and some welcome proof reading at the last minute. x
  • 11. C O N T E N T S i background 1 1 introduction and assumed knowledge 3 1.1 Domain Description of Proteins 3 1.1.1 What is a protein 3 1.1.2 Proteins fold 3 1.1.3 Domains and evolution 5 1.1.4 Domain structure prediction 7 1.2 Intrinsically Unstructured Protein 8 1.2.1 What is disordered protein structure 8 1.2.2 Disordered regions as flexible domain linkers 10 1.2.3 Synergy between domains and disordered regions 12 ii inositol phosphate signalling in viridiplantae 15 2 inositol-1,4,5-triphosphate mediated calcium re- lease in plants 17 2.1 Abstract 17 2.2 Introduction and Background 18 2.2.1 Inositol-1,4,5-triphosphate signalling at the heart of Ca2+ ion release from internal stores in Mam- mals and more broadly Metazoans 18 2.2.2 Physiological evidence for Inositol-1,4,5-triphosphate mediated signalling in Viridiplantae 20 2.2.3 Known structure of the Mammalian ITPR protein families 22 2.3 IP3BC and MIR containing proteins across the tree of life 26 2.3.1 ITPR-like protein readily identifiable in Chloro- phyta 26 2.4 Searching for proteins that have some homology to ITPR in Embryophytes 28 2.4.1 Model building 28 2.4.2 Search in twenty-five Embryophytes 30 2.4.3 ITPR Calcium channel domain has homology to Two-pore Channels of Embryophyta 33 2.4.4 Disordered binding core in algal ITPR-like pro- teins 33 2.5 Exhaustive search of Embryophyta genomes for an ITPR- like protein 35 2.5.1 Checking for ITPR-like pseudogenes 35 2.5.2 Checking for bad or incomplete gene predictions 36 2.6 Conclusion 36 xi
  • 12. xii contents 3 an alternative hypothesis for inositol phosphate induced calcium release in embryophytes 39 3.1 Introduction 39 3.2 Phosphoinositide metabolism in Embryophytes is very different to that found in Mammals 39 3.3 TPCs as an alternative channel to ITPR mediated Inosi- tol phosphate signalling in Embryophytes 40 3.3.1 Evidence for an ITPR-like protein specifically in plants is weaker than reported 41 3.3.2 Two-pore Channel dynamics can only be explained by additional protein interactions 42 3.3.3 Two-pore Channels of Embryophytes have a con- served C-terminal region of protein disorder 43 3.3.4 The C-terminal region is both vital for channel function, and appears to be a complex forming region 45 3.4 What about IP/PIP binding and regulation? 48 3.4.1 Inositol phosphate and Phosphatidylinositol phos- phate binding alternate domains 49 3.4.2 Arabidopsis has some unique expansion of pro- teins using PH domains 50 3.5 AT1G58230.1 as a plausible TPC1 regulatory subunit (TPC1R) 52 3.5.1 Co-expression with a putative PIK-like protein kinase associates TPCR with the TPC1 although indirectly 53 3.5.2 Is the C-terminal region of TPC1 a cryptic dimeris- ing coiled-coil regulated by PIKK and other ki- nase activity? 54 3.5.3 Guard cell Ca2+ dependent CO2 response pheno- type experimentally associated with both TPCR candidate and related PIKK through knockouts 56 3.6 Conclusion 57 iii protein disorder 61 4 creating a database of disordered protein pre- dictions d2 p2 63 4.1 Abstract 63 4.2 Introduction and Background 63 4.2.1 Predicting disorder 63 4.2.2 Databases that already exist and why they are not appropriate 66 4.3 Predicting Disorder 67 4.3.1 The SUPERFAMILY sequence library 67 4.3.2 Integrating predictions through consensus 68 4.3.3 Assessing predictor coverage with SCOP efficiently 69 4.4 Constructing a Database 70
  • 13. contents xiii 4.4.1 Sequence 70 4.4.2 Predictions 71 4.4.3 Search 71 4.4.4 Statistics 71 4.4.5 Reports 72 4.5 Future Work 73 5 the importance of disordered protein in cellu- lar life 75 5.1 Abstract 75 5.2 Disorder in each domain of life 75 5.3 Disorder and its association with posttranslational mod- ifications 77 5.4 Disorder and Alternative Splicing 79 5.5 Disorder and Eukaryotic Linear Motifs 80 5.5.1 What are Linear Motifs 80 5.5.2 ELMs in D2P2 81 5.5.3 Calculating self-information of motifs 81 5.6 Disorder Predictors Compared 84 5.7 Community impact of D2P2 86 5.7.1 CASP10 86 5.7.2 Pfam 86 iv new genomes 87 6 protein domain analysis of the symbiodinium sp. a1 genome 89 6.1 Abstract 89 6.2 Introduction and Background 90 6.2.1 KAUST and Reef Genomics 90 6.2.2 What is Symbiodinium 91 6.2.3 Relevance of Symbiodinium to global environ- mental science 91 6.2.4 Unusual genetics of dinoflagellates 92 6.3 Assessing Gene Annotation Quality 93 6.3.1 Several rounds of gene-calling 93 6.3.2 Using SUPERFAMILY to evaluate the state of gene- calling 94 6.4 Domain Content 98 6.4.1 Functional enrichment 98 6.4.2 Transcription factors 100 6.4.3 RISC, Dicer and Argonautes 101 6.4.4 Histone content 101 6.5 A FucT-like family domain identified for the first time in Eukaryotes 102 6.5.1 FucT-like from Helicobacter pylori or just a FuT from everywhere? 103 6.5.2 Structure of the putative FucT-like domain 104 6.5.3 Toxin reuse and host immune response evasion 107
  • 14. xiv contents 6.6 Phylogenetic Placement 108 6.6.1 The sTOL method 108 6.6.2 Tree building 109 6.6.3 The Chromalveolate hypothesis 112 6.6.4 Relevance of Symbiodinium’s placement on the tree 113 6.7 Conclusion 114 v future work 115 7 directions for future research 117 7.1 Protein domain decomposition 117 7.2 Conserved protein disorder 119 7.2.1 Conserved in sequence 119 7.2.2 Describing and categorising disordered regions within conserved domains 121 7.2.3 Characterising the TPC1 C-terminal conserved dis- order domain 122 7.3 Geographic distribution of protein structure 123 7.4 Closing remarks 126 bibliography 129 Bibliography 129 vi additional material 157 a tpr-like homologous proteins found in chloro- phyta 159 b symbiodinium additional data 161 b.1 Domain Annotation 161 b.2 RISC, Dicer and Argonautes 166 b.3 RAxML Tree of FucT-like related proteins 166 vii published papers 171 c d2 p2 : database of disordered protein predictions 173 d a daily-updated tree of (sequenced) life as a ref- erence for genome research 183 e the evolution of human cells in terms of pro- tein innovation 195
  • 15. L I S T O F F I G U R E S Figure 1 All states of Spinach Thylakoid soluble phos- phoprotein (TSP9). 9 Figure 2 Schematic of GRB2 and its short type disordered regions. 11 Figure 3 A sketch of the IP3 signal transduction cascade found in Mammals. 19 Figure 4 The IP3 binding-core region of human ITPR1 in the bound state. 22 Figure 5 Schematic of the Human ITPR1 protein as shown in the D2P2 resource. 25 Figure 6 A phylogenetic tree of IP3BC presence in Plan- tae from the SUPERFAMILY database. 27 Figure 7 V. carteri f. nagariensis disordered IP3BC insert ELM assignment. 34 Figure 8 Phosphoinositide metabolism in plants versus human 40 Figure 9 Disordered and structural architecture of the atTPC1 protein. Highlighted is an N-terminal trafficking motif region, and a conserved C-terminal disor- dered region. 44 Figure 10 Eukaryotic Linear Motif results for atTPC1 fil- tered by relevancy to Arabidopsis thaliana. 45 Figure 11 Structure prediction of the atTPC1 disordered tail yields multiple structural alignments. 47 Figure 12 TM-Alignments of the best I-TASSER model of atTPC1 48 Figure 13 Novel domain adjacency network for Arabidop- sis thaliana contains several Inositide binding re- lated structures. 51 Figure 14 Structural depiction of a plausible transient state after PIKK activated folding in the disordered tail of atTPC1. 55 Figure 15 Putative TPCR knockout experimental results. 57 Figure 16 PIK-like Kinase knockout experiment results. 58 Figure 17 A to scale venn diagram of the SUPERFAMILY and D2P2 sequence library in green versus all UniProt sequences in red. 67 Figure 18 Schematic of how the D2P2 predictor consensus is calculated (see Figure 21 on page 72 for a real example). 68 xv
  • 16. xvi List of Figures Figure 19 Simple Python function to calculate the shared coverage between two regions of assigned pro- tein annotation. 69 Figure 20 A schematic for efficient calculation of overlap- ping coverage between two protein annotations. 70 Figure 21 D2P2 graphical reports with yet to be released data for the Ensembl Human genome annotated. 72 Figure 22 Disordered amino acid coverage per genome grouped by domain of life for each prediction method in D2P2. 75 Figure 23 Disordered amino acid coverage inside SUPER- FAMILY predicted SCOP domains per genome, grouped by domain of life for each prediction method in D2P2. 76 Figure 24 Posttranslational modifications and their associ- ation with disordered amino acids 78 Figure 25 Relevant equations for calculating entropy meas- ures. 82 Figure 26 Distribution of self information for all Eukaryotic Linear Motifs in whole genomes from D2P2. D = 0.1734, p-value < 2.2e-16 84 Figure 27 Distribution of self information rate for all Eu- karyotic Linear Motifs in whole genomes from D2P2. D = 0.3349, p-value < 2.2e-16 85 Figure 28 A graph comparing the differing extents protein predictors assign intrinsic disorder, or smaller regions of disorder. 85 Figure 29 Microscope photography of Symbiodinium microdri- aticum Fredenthal (CCMP2467) 90 Figure 30 The number of genes called with protein an- notation and domain assignment in each round of gene calling. 95 Figure 31 A bar chart of unique domain assignment di- versity during Symbiodinium gene calling. 96 Figure 32 Affect of calling longer genes later in the as- sembly and the affect on domain assignment length. 96 Figure 33 Coverage of domain assignment as the assembly progressed. 97 Figure 34 FucT-like related sequences maximum likelihood tree. 103 Figure 35 Structural alignment using TM-align of the I- TASSER Symbiodinium FucT protein model 104 Figure 36 H. pylori FucT binding sites are conserved in the I-TASSER model from Symbiodinium. 105
  • 17. Figure 37 A subtree from the tree built with Symbiodinium where placement is constrained to the NCBI clas- sification polytopes. 110 Figure 38 A cladogram showing the subtree from the tree built with Symbiodinium free to be placed any- where in the whole sTOL tree of life. 111 Figure 39 Results of domain decomposition using TF-IDF to rank importance. 118 Figure 40 Sequence 89704.m00121 from Trichomonas vaginalis contains an example of the largest disorder cluster at its N-terminus. 120 Figure 41 An evolutionary schematic of pre-BEACH PH- like domains. 122 Figure 42 Map of all sequenced species from the GBIF re- source 123 Figure 43 Global distribution of the domain architecture of atTPC1 versus hsITPR1 for 106 Eukaryotic genomes. 124 Figure 44 ITPR-like homologous proteins found in Chloro- phyta. 160 Figure 45 Dicer domain architectures from the draft Sym- biodinium genome. 167 Figure 46 Argonaute domain architectures from the draft Symbiodinium genome. 168 Figure 47 Large scale sequence tree built for FucT-like re- lated proteins. 169 L I S T O F TA B L E S Table 1 Five functional regions of ITPR as defined by Bosanac et al. (2005) 23 Table 2 ITPR-like homologs found in Chlorophyta with E-values reported for the IP3BC domain i.e. the IP3 receptor type 1 binding core and the MIR domain being assigned together. 29 Table 3 Seed sequences used to build both whole-protein and IP3BC HMMER models. 30 Table 4 Summary results of searching IPTR-like protein models against 25 plant genomes. All “MIR-like” results are likely Mannosyltransferases. Those genomes marked with a * are of draft quality at the time of writing. 32 xvii
  • 18. xviii List of Tables Table 5 Eukaryotic Linear Motifs assigned to the C-terminal disordered region of atTPC1 near to the di-Arginine WD40 binding motif. 46 Table 6 Alternate domains that bind Inositol phosphatides in Arabidopsis thaliana. 50 Table 7 Putative PH+BEACH containing proteins of Ara- bidopsis thaliana with domain combinations im- plying a possible TPCR role. 53 Table 8 Amino acid probabilities used in self-information calculations based on the D2P2 sequence library propensities. 82 Table 9 Reproduction of Pfam coverage from Table 2 of “The challenge of increasing Pfam coverage in the human proteome” (Mistry et al., 2013) 86 Table 10 Molecular function GO-terms assigned using the dcGO resource only found in Symbiodinium versus an algal background. 99 Table 11 Table of transcription factor related domains for each round of gene calling. 100 Table 12 List of Symbiodinium Histone proteins. 102 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 161 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 162 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 163 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 164 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 165 Table 13 Domains present in the Symbiodinium genome after Modeller gene models were included in with the data freeze release. 166
  • 19. Part I B A C K G R O U N D In this part I present some introduction and background knowledge assumed throughout the rest of this thesis. Any reader familiar with the general concept of what a pro- tein molecule is would not lose out skipping to the later sections of this part discussing disordered proteins and related properties.
  • 20.
  • 21. 1I N T R O D U C T I O N A N D A S S U M E D K N O W L E D G E “I still somewhat shudder at the thought that highly efficient, purposive, organizational elements, like the pro- teins, should originate in a random process. Yet many effi- cient and purposive media, e.g., language, or the national economy, also look statistically controlled, when viewed from a suitably limited aspect. On balance, I would there- fore say that your argument is quite strong.” From a personal letter from John von Neumann to George Gamow, 25th July 1955. 1.1 domain description of proteins 1.1.1 What is a protein Proteins are biomolecules, polymers formed of amino acid residues held together by peptide bonds. Each amino acid residue has a one-to- one relationship with an equivalent triplet (or codon) of nucleotides in the parent gene’s DNA, with the sequence of amino acids being the same as the sequence of codons transcribed from a gene. The ulti- mate directionality of information being stored in DNA, transcribed to messenger RNA intermediates and ultimately translated to amino acid sequence by a ribosome, was most famously codified in the cent- ral dogma of molecular biology by Crick (1970). However, the full information content of a protein coding gene is only fully apparent at the level of the expressed protein. Each protein has a unique and spe- cific three-dimensional structure that directly relates to its biological function. The final shape or conformation a protein assumes starting from a linear chain is referred to as a protein’s fold. 1.1.2 Proteins fold A nascent protein will self organise through self-interaction and in- teraction with the surrounding solvent to reach a native ground state in the range of milliseconds to a second time scales. The speed of a protein fold was surprising to early researchers given the size and organisational complexity of known structures, especially given the naive assumption that all conformational arrangements of a protein are explored before randomly reaching a native state. 3
  • 22. 4 introduction and assumed knowledge Both the problem with the speed of protein folding and the solution is best communicated through Levinthal’s paradox (Levinthal, 1969). Levinthal starts from the assumption that proteins fold through a ran- dom process i.e. all proteins are random-coils until they reach their native state. To find the minimum Gibbs free energy structure of a 150 amino acid protein having 450 degrees of freedom, a space of ~10300 conformations would need to be inspected, given a bond angle accuracy of one-tenth of a radian. The time for a protein to naturally reach its native and stable fold is at a maximum on the order of two seconds, so only ~108 conformations could be visited by a protein. From this thought experiment Levinthal went on to conclude that local short-range interaction and stability between amino acids must limit the total number of possible conformations visited during the folding process as it progresses. Levinthal wasn’t wrong. In the simplified case of globular proteins in solution, an exact fold is largely due to interaction between dif- fering residues locally. However, early on in the folding process con- formational restriction is dominated by global hydrophobic effects caused by the aqueous environment. The hydrophobic effect “packs” residues with hydrophobic side chains towards the interior of the pro- tein to form a dry core, as this is a more energetically favourable con- formation. This idea is referred to more broadly as the hydrophobic collapse hypothesis of protein folding, with the initial collapsed in- termediate referred to as a molten-globule. These two processes give rise to what is known as a protein’s secondary and tertiary structure, primary structure being the defined sequence of amino acids. Other models and processes have been both observed and put forward for protein folding. For example, many metal ion containing proteins are driven by nucleated folding initiated by electrostatic interaction of side chains with the ion. The process of protein folding can be more broadly conceptualised by the folding funnel hypothesis, which suggests as a protein folds through a space of intermediates it travels down a “pathway” des- cending a funnel (or minima) in the Gibbs free energy landscape of possible conformations. The resultant amino acid arrangement min- imises the global free energy for the tertiary structure of the protein. Small local minima along the sides of this theoretical folding funnel produce a rough space of transient conformers with intermediate en- sembles of secondary structure similar to the thought experiments of Levinthal. Secondary structure arises from local effects due to backbone hy- drogen bonding between amino acids, after the initial collapse of a protein. Backbone hydrogen bonding forms several recognisable sec- ondary structures such as turns, helix and strand (Kabsch and Sander, 1983). Although secondary structure is tightly related to exact se- quence, secondary structures can be arranged in alternate ensembles,
  • 23. 1.1 domain description of proteins 5 forming approximately the same tertiary structure when folded. Even if portions of sequence are displaced relative to one another the same resultant tertiary structure will be formed. It is this observation that brings us to protein domains. 1.1.3 Domains and evolution Protein domains are self-stable sub units of proteins that fold to well aligned tertiary structure given the same sets of secondary structure motifs, but not necessarily recognisably similar sequences of amino acids. Domains can be thought of as atomic parts of large scale struc- ture common between either different proteins or homologous (evol- utionarily related) proteins in different species. Domain structure of- ten relates to specific protein function. Individual domains can be copied between two genes, resulting in the transfer of a functional protein sub-unit. A common example of more promiscuous domains of this kind, transfer functions relating to the formation of protein quaternary structure. Protein quaternary structure is formed when several polypeptide chains bind together, at the interface between well folded domains to form a complex (Bennett et al., 1995; Heringa and Taylor, 1997). The WD40-repeat [scop:50979] domain is a great example which forms a beta-propellor structure, facilitating protein quaternary structure with various degrees of radial symmetry. Domain tertiary structure as defined by the superfamily classific- ation is recognised by the conserved arrangement of a protein’s hy- drophobic core and the 3D arrangement of its secondary structures. This is robust in terms of an evolutionary process by definition. Large scale change to a protein’s core would prevent efficient folding to a given tertiary structure. The specific definition of any superfamily I mention throughout this thesis comes from the manually curated Structural Classification of Proteins (SCOP) database. The concept of how a domain arises has also been formally modelled in studies such as Bornberg-Bauer and Chan (1999). Bornberg-Bauer and Chan (B&C) postulate that thermodynamic sta- bility of a protein sequence is negatively correlated with its muta- tional plasticity, or in other words positively correlated with the se- quence conservation. B&C propose the concept of “super funnels” in the space of protein sequence similarity to define domains. Within super funnels point mutations in protein sequence smoothly affect thermodynamic stability. Well folded and efficient domain families lie at local minima near the bottom of these selective funnels. Any point mutation adversely affects thermodynamic stability from these minima and thus must be under selective pressure unrelated to effi- cient folding to move from this conserved sequence. The birth of a new fold in this theoretical paradigm can be thought of as a random sequence starting just within a selective horizon of a super funnel,
  • 24. 6 introduction and assumed knowledge with any selection pressure to fold efficiently pushing a sequence to- wards the canonical domain sequence. In the SCOP definition and hierarchy of domain classification, su- perfamilies are those domains that share recognisably common an- cestry for the folding of their hydrophobic core. Pethica et al. (2012) demonstrated that domains at the family level of SCOP classification in general agree with traditional sequence-only phylogenetic recon- struction of protein families. Domains although conserved in structure still undergo modifica- tion and elaboration at the level of each amino acid, but more im- portant to evolutionary understanding are the genetic mechanisms allowing for more rapid movement of whole domains between pro- teins (Bork, 1991). A point mutation in DNA can result in a non- synonymous change to a codon. Non-synonymous changes manifest as either a single substitution of an amino acid species at the given protein locus, or introduce a premature end to the protein in the case of introducing a stop codon. Additional large-scale structural changes in DNA such as chromosomal inversion and translocation can cause less localised changes to a given gene; resulting in protein trunca- tion, elongation, fusion and fission. Another event that is common to all life is that of DNA replication slippage, slippage of a DNA poly- merase can cause the extension of a protein. This is most readily ob- served as terminal duplication in a protein sequence, and especially visible in proteins with conserved and repetitive C-terminal structure. Finally the phenomena of exon shuffling seen exclusively in the Euk- arya domain is the long range movement of exons from one gene to an unrelated gene and can occur due to transposon activity or sexual recombination. When these events in DNA map to boundaries between domains (specifically regions capable of self folding) the likelihood of a biolo- gical function being retained is higher and thus selection for retaining these events is also higher. If an event occurs breaking up a region of protein sequence that is capable of self folding this structure can be- come irreparably destabilised and unable to retain function in any gene affected by the event. Similarly for point mutations, if made at the surface of a protein structure within an active site. However, outside of an active site structural changes can slowly accrue with minimal cost to fitness due to a lack of selective pressure, plausibly creating new active sites with time. When whole genes are duplicated this is a key mechanism for functional repurposing of the new gene. Point mutations within the hydrophobic core of a protein fold could disrupt the ability of a protein to fold; either by the introduction of a hydrophilic species or with a species that’s side chain cannot be accommodated such as Arginine. If folding is negatively affected to the point of completely disrupting the ability of a domain to attain its native state then it is highly likely all function is lost and the mutation
  • 25. 1.1 domain description of proteins 7 is selected against, or a domain is lost in the population. This is the basis of the domain perspective of protein evolution. New domain folds are created rarely at evolutionary time scales and by chance. In general more events exist that duplicate or cause the loss of a do- main. However, even with domain duplication it’s estimated that only between 0.4 and 4% of SCOP domain combinations at the superfam- ily level of classification are created convergently, ignoring tandem repeats (Gough, 2005). A final consideration is that of multi-domain composition and how this can affect higher order selective processes of a protein. Vogel et al. (2004) define the “supra-domain” by observing in 131 sequenced gen- omes common arrangements of domain pairs of which they found 1,400 examples, and 166 examples of the less common domain triplet that are conserved with an abundance of additional domain partners. A key definition of a conserved supra-domain versus just witnessing an abundance of copy number for a pair or triplet is the observation that for any random domain pair you might see 3.0 other superfamily species of domain partner, versus a true duplet supra-domain having 5.5 superfamily partners on average, or 7.2 in the case of over repres- ented supra-domains such as those made up of P-loop domains. 1.1.4 Domain structure prediction Since secondary structure mostly arises from local effects between near by residues it is possible to predict this with a reasonable ac- curacy ~80% (Pollastri and McLysaght, 2005) from a given protein se- quence. If domains form from secondary structure motifs, it should be possible from predicting secondary structure to predict a classifica- tion of the domain. Work has been done on structural classification of existing proteins SCOP (Murzin et al., 1995) with the current version of the database (1.75 June 2009) taking 38,221 protein structures from the Protein Data Bank (PDB) (Rose et al., 2011) and classifying 110,800 domains. SCOP defines a hierarchy of classification that at the family level approximates relationships found through traditional phylogen- etic techniques based on protein sequence alone. The most broad category is the Class of a protein, this is related to the secondary structures that are present in a protein domain, such as alpha-helices or beta-sheets. So proteins that mostly or only con- tain alpha helices come under the alpha-helix Class. Below a Class membership is the Fold which separates protein space based on how secondary structures fold together. Then with an increasing level of discrimination, the relatedness of the 3D position and orientation of each secondary structure within a domain can be used to form the Superfamily classification, where all proteins share a common evol- ution. To further classify proteins, sequence or primary structure is used. Domains that have similar 3D structure might have several dif-
  • 26. 8 introduction and assumed knowledge ferent orderings in a sequence of the constituent secondary structure motifs. It is from this detailed view that a Family level classification is given, so proteins in the same family have very close evolution- ary history, and their included domains are equally tightly related by association. From a detailed classification of all the protein domains currently described, it is possible to see how often new structures fall into already classified areas of this protein space. Chothia (1992) estim- ated that approximately 1,000 superfamilies were required to com- pletely classify all proteins at that time. Furthermore a third of all newly described protein structures fall into previously recognised su- perfamilies. As new structures are described through X-ray crystallography and similar means it should become increasingly possible to statistically infer through homology of sequence alone what Family or Super- family classification a protein should be given. Gough et al. (2001) describe a method using Hidden Markov Models to predict SCOP classifications through sequence homology. Following this in 2002 the method was released to the public domain in the form of the SUPER- FAMILY online resource(Gough, 2002b). When sufficient sequence ho- mology exists SUPERFAMILY annotates all proteins in all currently sequenced genomes with relevant SCOP classifications (Gough, 2006). It is this work and resource that is instrumental to the methods de- scribed in the rest of this thesis. 1.2 intrinsically unstructured protein 1.2.1 What is disordered protein structure Intrinsically disordered or unstructured proteins exist as highly flex- ible polypeptide chains in vivo behaving as an ensemble of conform- ational states with no stable tertiary structure (Uversky et al., 2000). Regions of IDP can exist as unfolded chains or molten globules with well-developed secondary structure and often function through trans- ition between differently folded states (Uversky, 2002). To some the concept of unstructured protein being functional might not fit well with accepted theory on enzyme action, and the concept of exact 3D structure implying function. However, even theory surrounding active site function in well structured compact globular enzymes in- cludes the addition of flexibility and structural rearrangement with the move from Hermann Fischer’s lock and key model to Daniel Koshland’s induced fit model. Here the dynamic switch in structure of the enzyme in the bound state with the selected for substrate in- duces the desired exact 3D structure to perform a necessary function. Essentially the argument for functional unstructured or disordered protein is an extension of this argument. Fold transitions to an exact
  • 27. 1.2 intrinsically unstructured protein 9 3D arrangement through time yields the desired function. Already the fold dynamics of well structured proteins have been known to fol- low pathways of folding that have many flexible intermediates, these are however transient rather than a protein permanently remaining in a semi-molten state. Disordered regions are highly enriched for many forms of posttrans- lational modifications. A good example is the Thylakoid soluble phos- phoprotein (TSP9) from Spinach (Figure 1), which was shown to flip between a folded trans-thylakoid-membrane state and an unfolded but stromal accessible state with the addition of phosphorous groups in two key disordered regions (Song et al., 2006). This bistable state between the central helix being transmembrane or free in the stroma also facilitates another key property of disordered proteins, that of activated protein-complex formation. When the helix is accessible a much larger protein complex is able to form at the interface of the thylakoid membrane to perform a task. Figure 1: A single stable helix in black separates two disordered segments shown in blue and green-red of the Spinach Thylakoid soluble phosphoprotein (TSP9). All known states from NMR structure 2fft from PDB are shown superposed and aligned about the central helix. This figure has previously appeared in a special issue of Current opinion in structural biology (Gough and Dunker, 2013) Mechanisms for functional conformational transition include: bind- ing with other proteins, nucleic acids, and various small molecules; as well as the addition of numerous posttranslational modifications such as phosphorylation, acetylation and methylation. Phosphoryla- tion especially has been shown to be important for inducing fold transitions (Iakoucheva et al., 2004; Song et al., 2006). With a par- ticularly striking example found in Cytoplasmic Linker-associated Protein 2 (CLASP2), where a series of phosphorylated sites disrupt “molecular velcro” holding microtubule networks together through
  • 28. 10 introduction and assumed knowledge retraction of disordered regions from a series of arginine residues interacting with the added phosphorous-groups (Kumar et al., 2012). Biological functions of known IDPs are varied and their roles in- clude: instigation of protein complex formation, molecular recogni- tion as seen in nucleoporins of the nuclear pore complex (Yamada et al., 2010), signal transduction, transcriptional regulation and many other functions related to interaction and regulated activity (Dunker et al., 2008; Dyson and Wright, 2005). Much work has been done on producing classification and annota- tion of known unstructured regions from 3D experimental data found in the PDB (Rose et al., 2011), such as the DisProt (Sickmeier et al., 2007) and IDEAL (Fukuchi et al., 2012) resources. However, the past focus on structured protein domains has limited the total number of described IDP regions. For example: the current release of DisProt (6.00 2012-07-01) describes 667 proteins containing 1,467 verified dis- ordered regions; and the IDEAL database (as of 2012-05-09) describes 209 disordered proteins in detail, 97 of which have been experiment- ally verified to be structured and disordered over the same region un- der different conditions; and also MobiDB (Di Domenico et al., 2012) has applied a method for identifying mobile regions from NMR struc- tures (Martin et al., 2010) to 26,933 proteins (v1.2.1 as of 2012-11-01). Due to the biases of structural resolution and the relative ease in predicting disordered state compared with de-novo fold prediction, many algorithms have been developed to discover novel regions of disorder from amino acid sequence alone (He et al., 2009; Peng and Kurgan, 2012b). For full discussion on protein disorder prediction please see Part iii. 1.2.2 Disordered regions as flexible domain linkers The concept of domain linkers is not new, and perhaps predates any mention of protein disorder. However, their generality and import- ance in protein regulation has only become increasingly apparent within the last decade. Large scale attempts at classifying and charac- terising these regions was attempted, with a notable example being George and Heringa (2002). The properties of these linkers would later become known as “short” type disorder, with lengths reaching a maximum of around thirty-five amino acids. Although longer re- gions of disorder can still act as flexible links between domains, they usually contain additional regulatory and interaction motifs rather than existing purely as spacers for improving fold dynamics of their neighbouring domains.
  • 29. 1.2 intrinsically unstructured protein 11 Figure 2: A schematic of the domain arrangement of human GRB2 [UP:P62993]. The DisProt track at the top of the figure reflects ex- perimental results from published literature. We can see a slight peak in flexibility predicted in DynaMine around the second ex- perimental disordered region, as well as some associated phos- phorylation sites postfixing this region regulating function. Even very short flexible linkers two-to-seven residues long have been found to be significant such as in the study by Yuzawa et al. (2001). They showed that Growth factor receptor-bound protein 2 or GRB2 in humans is a monomer in solution with the C-terminal SH3 domain joined to the middle SH2 domain by a flexible linker. The study com- pared their own NMR based structures to the previously attained X- ray crystal structures that predicted GRB2 to exist in a single compact dimeric state with both the N- and C-terminal SH3 domains being in close contact. However, NMR evidence showed that the C-terminal SH3 domain existed in a multitude of states from extended to bound to the other SH3 domain. GRB2 functions as an adapter protein mediating signal transduc- tion from cell membrane bound receptors such as the epidermal growth factor receptor (EGFR) to internal Ras/MAP kinase-signaling cas- cades. The central SH2 domain binds to phosphotyrosine motifs (RXXK) specific to its target protein kinases, the N-terminal SH3 domain binds
  • 30. 12 introduction and assumed knowledge to the membrane bound receptor, with the C-terminal SH3 domain dynamically folding to bring kinase, receptor and a target protein into contact. Single point mutations in SH2 domain of GRB2 increases its af- finity of binding to given R-X-X-K targets by 40 times, this affords the SH2 domains specificity to a given kinase as well as selectivity through small changes to these sequences. This is a desirable trait for repurposing duplicate copies of a gene, as new GRB proteins could be used to rewire receptors to other pathways with only minor mutation events. Additionally, both SH3 domains bind Proline rich motifs in both receptor and target proteins. These motifs are often referred to as Eukaryotic Linear Motifs (ELMs) or Short Linear Mo- tifs (SLiMS) and the concept of modification to or interaction with these regions causing structural rearrangement are often referred to as Molecular Recognition Features (MoRFs). More about disordered protein regions and these introductory concepts is discussed in Part iii. 1.2.3 Synergy between domains and disordered regions In the previous section I highlighted that the SH3 domain of GRB2 in- teracted with a specific motif or pattern of residues in another protein, in that example a kinase. However, these motifs known by various names such as Linear Eukaryotic Motifs (ELM), or Molecular Recog- nition Features (MoRF) or simply linear motifs, interact with domain partners of various kinds and not always to mediate specificity of en- zyme action; but also the formation of large protein complexes, and mediating signal transduction through structured interaction. For ad- ditional information on motifs and resources available to study them please see Section 5.5. An important area to consider the synergistic interactions between disordered protein and globular domains is in viral pathology. In a recent study by Hagai et al. (2014) 2,208 viral genomes were in- spected from 536 prokaryotic hosts and 1,672 eukaryotic hosts. The protein annotations for each virus were annotated with ELM motifs that mimicked conserved motif arrangements of the host mediating protein interactions. The incidence of linear motif mimicry was en- riched more in eukaryotic species, with especially higher enrichment in animals than plants. Looking at expectation of co-occurance of mo- tifs in whole proteins - both in viral and host genomes - Hagai et al. demonstrated that mimicked motifs co-occurred as much as would be expected from randomly selected disordered segments. This can be interpreted as motif mimicry within viruses is more likely to be a convergent and competitive phenomena, rather than the product of horizontal gene transfer from the host. In this paradigm ELM inter- action provides an easily evolvable and flexible strategy for viruses
  • 31. 1.2 intrinsically unstructured protein 13 to interact and inhibit or promote specific processes of the host cell. This highlights the importance of the properties of motif rich disorder to the host too, as direct exploitation of this system of regulation and interaction is being increasingly elaborated and incorporated in meta- zoans. Tight synergistic coupling of disordered regions and domains within the same peptide is common too. With common examples described from proteins involved in cell-signalling processes, especially where auto-inhibition is observed. The disordered regions adjacent to do- mains related to signal transduction provide fine tuned inhibition of protein activation or activity. This is achieved through alternative splicing of IM length as well as the inclusion of conserved motif re- gions specific to the active site of a neighbouring domain (Trudeau et al., 2013). The non-binding loop portions of IM regions are highly enriched with sites of posttranslational modification such as phos- phorylation. This modification allows for further fine tuning and reg- ulation given profiles of cell type specific co-expressed kinases, given the example of phosphorylation. Additional activator proteins can interact directly with the IM region through further conserved mo- tifs, rather than needing to be specific to each associated signalling domain. These properties make the disordered IM regions relatively easy to adapt for rewiring of pathways both over evolutionary time given a gene, or in short time periods at each regulatory stage of the expressed protein in a cell, transcription, splicing and translation, and posttranslational processing and modification. Although originally in- tended as a description of evolutionarily coupled globular domains, disordered auto-inhibitory modules (IM) form what could also be considered supra-domains within signalling proteins. A final general relationship between disordered regions and well structured globular domains is through chaperone activity (function- ing as a catalyst to the folding process of a macromolecule). More than 33% of protein chaperones are known to be disordered with more than 50% of RNA chaperones being found in the disordered state too (van der Lee et al., 2014). An interesting concrete example from Foit et al. (2013) presents the unusual properties of HdeA, one of the most abundantly expressed proteins found in the periplasmic space of the bacteria Escherichia coli. E. coli is routinely found travel- ling through the digestive system of animals, as such it requires a high tolerance for acidic environments. At neutral pH HdeA is found in a well folded dimeric state packed at high concentration in the periplasm, where it does not function as a chaperone. On entering the acidic environment inside a host HdeA dimers separate due to charge imbalance from the drop in pH and unfold. In this unfolded state HdeA monomers “wrap” surrounding proteins binding to their surface and help to maintain tertiary structure. On the increase in pH when E. coli moves through to the intestinal environment HdeA
  • 32. 14 introduction and assumed knowledge slowly releases its partner proteins. This slow release over several minutes is implicated in the chaperone activity disfavouring aggrega- tion and promoting correct folding of the target protein. The phenomena of including mixtures of disordered segments and globular domains in a proteins structural architecture is not limited to auto-inhibition and signalling. Babu et al. (2012) provide a concise view on the continuum of structured state both for a single domain transitioning between states, and for the composition of disorder and domains. The addition of disordered segments affords diversity of function and interaction of the parent protein, with minimal cost in the transfer of material between genes due to the much reduced con- straint of a disordered region not needing to fold. This general opin- ion on the synergistic composability of both domains and disorder provides the basis from which most of my thesis develops. For further reading on how disorder functions on its own, and in the context of structural domains I highly recommend reading the comprehensive review “Classification of Intrinsically Disordered Regions and Pro- teins” by van der Lee et al. (2014).
  • 33. Part II I N O S I TO L P H O S P H AT E S I G N A L L I N G I N V I R I D I P L A N TA E This part outlines a bioinformatic and theoretical under- taking to identify and describe what has happened to the Inositol triphosphate signalling pathway in plants. All work is my own apart from any experimental laboratory result in Chapter 3 which is presented here at the permission of both Dr Jean-Charles Isner of the University of Bristol Guard Cell Lab and Dr Elodie Marchadier who was pre- viously a member of the Guard Cell Lab before moving to the french National Institute for Agricultural Research (INRA).
  • 34.
  • 35. 2I N O S I TO L - 1 , 4 , 5 - T R I P H O S P H AT E M E D I AT E D C A L C I U M R E L E A S E I N P L A N T S 2.1 abstract In this chapter I present a search of vascular land plants for proteins homologous to the Inositol-triphosphate Receptor (ITPR) of humans. A specific focus is placed on Arabidopsis thaliana to facilitate experi- mental validation. Both root hair growth and stomatal opening dy- namics in plants are controlled and regulated by ionic Calcium (Ca2+) release from internal stores. This cellular function is similar to the Inositol-triphosphate mediated release of Ca2+ in human muscle and neural tissues. In these mammalian systems the ITPR gene produces tetrameric protein complexes that are the primary channel for gated ion release from the endoplasmic reticulum. Although plants appear to have similar releases of Calcium to the ITPR mediated Calcium release of mammals, I demonstrate in this chapter that no direct gene homolog for ITPR can be found. However, ITPR-like proteins are readily identifiable in green algae and various protists, suggesting an ancient creation for ITPR and a general trend of loss throughout the eukaryotic domain. Unlike previous studies to find an ITPR homolog in plants I return to looking at the assembled genomes of thirty two plants from the Phytozome project, as well as check for bad gene predictions where some fragmented homologous protein sequence is identifiable. The chapter is concluded with the discovery that the tetrameric channel domain of ITPR has distant homology to only a single pro- tein in Arabidopsis thaliana and other land plants. The TPC1 protein in plants is already known to form dimeric Calcium ion channels in the tonoplast, however this is still formed from a tetramer of the chan- nel domain as in ITPR. This deep homology detection of the channel domain from ITPR raises the question of: If TPCs have functionally replaced ITPRs in plants, how are they regulated? One plausible an- swer to this is developed in the next chapter. 17
  • 36. 18 inositol-1,4,5-triphosphate mediated calcium release in plants 2.2 introduction and background “No matter how much evidence for the phosphoinos- itide signalling pathway in plants has been gathered, the final step has not been realized as the gene corresponding to plant InsP3-R [ITPR] has not yet been identified.” –Ondˇrej Krinke et al 2007. This chapter outlines a deep search for a gene that resembles the mammalian Inositol-1,4,5-triphosphate receptor in plants. I will show that in green algae (Chlorophyta) a homologous ITPR gene is readily identifiable, but in land plants (Embryophyta) this gene is absent, in- dicative of gene loss at the root of Embryophtes. I will demonstrate this gene loss is not due to poor gene calling, and that not even a pseudogene of similar sequence and domain composition to mam- malian ITPR is likely to exist. Later in Chapter 3 on page 39 I develop my own hypothesis of what might have happened in land plants to accommodate for the loss of ITPR. 2.2.1 Inositol-1,4,5-triphosphate signalling at the heart of Ca2+ ion release from internal stores in Mammals and more broadly Metazoans Inositol 1,4,5-triphosphate (IP3) acts as the secondary messenger in the signal transduction cascade of the canonical inositol-phospholipid signalling pathway of Metazoans. This signalling mechanism was first suggested by Berridge and Irvine (1984) in a review collecting all work at that time related to various inositol-phosphate metabol- ism and Ca2+ release. The inositol-phospholipid pathway takes an extracellular signal at low concentration such as the presence of a hor- mone, and converts it into an internal ionic Calcium signal that can more broadly affect the internal physiology of the cell in response to the extracellular signal. In Humans and other animals the Inositol- triphosphate Receptor (ITPR) protein families form tetrameric com- plexes (Maeda et al., 1991) that act as ligand gated Ca2+ channels in the endoplasmic reticulum (ER). The IP3 signal cascade begins with a transmembrane G protein- coupled receptor (GPCR) binding to an extracellular small molecule or protein. The bound GPCR in the presence of Guanosine triphos- phate (GTP) activates the cytosolic facing Gq-a-subunit. This in turn activates Phospholipase Cb (PLCB) which hydrolyses Phosphatidylinositol- 4,5-bisphosphate (PIP2) in the surrounding membrane. The hydro- lysis of PIP2 forms two products IP3 and Diacylglycerol (DAG). Con- ceptually we can imagine PLCB splitting the hydrophilic head of PIP2 from its fatty acid tails (DAG) in the membrane, leaving the soluble inositol-phosphatide ring IP3to move freely into the cytosol.
  • 37. 2.2 introduction and background 19 The liberated IP3 in the cytosol binds to each monomer of the ITPR tetrameric complex (Meyer et al., 1990) inducing a ligand-gated re- lease of stored Ca2+ from the lumen of the endoplasmic reticulum. Finally this Ca2+ wave activates one of several specific calcium de- pendent protein-kinases (PKCs) that localise to the DAG left behind in the plasma membrane. The PKC in question goes on to activate its end effect target proteins through phosphorylation of specific Serine and Threonine. Figure 3: A sketch of the Inositol-triphosphate signal transduction cascade pathway found in Metazoa. This figure is a reproduction of a fig- ure by Alberts et al. (2002) on page 860 of Molecular Biology of the Cell. There are fifteen described families of PKC in the Human genome (Nishizuka, 1995), each has several mature splice products with tissue specificity enabling a wide set of physiological end effects for a given signalling cascade. It’s worth noting that the chemistry taking place at the end of the Calcium cascade is often localised into nano-domains adjacent to the membrane where a kinase is bound to DAG from the initial onset of the extracellular signal. Please see Figure 3 on page 19 for a complete sketch of this signal transduction pathway. The IP3 signalling pathway.This system of extracellular signal transduction is important to health in humans. Of all FDA approved drugs (as of 2006) 26.8% target some form of GPCR and a further 7.9% target specific ligand-
  • 38. 20 inositol-1,4,5-triphosphate mediated calcium release in plants gated ion channels of which ITPR is a single example (Overington et al., 2006). So almost 35% of all known drugs target systems of a similar nature to the Inositol-triphosphate signal cascade pathway in Human with much overlap/crosstalk in the pathways and proteins involved in mediating a signal. The importance of these systems as drug targets predominantly comes from their ubiquity in cells of dif- ferent tissue types, and their modular reuse and altered dynamics within distinct pathways from each cell type. One might assume the diversity and specificity of Ca2+ signalling seen in various cells comes purely from the GPCR association, or from the presence of particular families of PKC as discussed previ- ously. However, the observed dynamics of cellular Calcium release are varied not just by the end effect or trigger of the signalling cas- cade, but per tissue type even when a cascade is similar in function (Berridge and Irvine, 1989).The mature ITPR channel has a vast number of variants. ITPR has been observed forming hetero-tetrameric complexes in rat livers (Joseph et al., 1995; Monkawa et al., 1995) formed by the mixing of translated transcripts of the three ITPR gene homologs ITPR1-3. The transcriptional products for ITPR1 in lab rats have also been found to produce seventeen distinct structural species of mature mRNA. It is possible from this single ITPR gene that 23,001 hetero- tetramer protein tertiary-structure variants could exist (Regan et al., 2005). Given there are several homologous ITPR genes in both rat and human, the total variation of ITPR tetrameric complexes becomes vast and is selectively expressed in each tissue. 2.2.2 Physiological evidence for Inositol-1,4,5-triphosphate mediated sig- nalling in Viridiplantae There is conflicting and inconclusive physiological evidence for the intermediate role of IP3 as a secondary messenger in higher land plants, leading to recent questions over the validity of assuming this system does exist. Of special note is that both Ryanodine (RyR) and ITPR channels are frequently mentioned in a very general and collect- ive manner in physiological experiments. These two genes also share very similar protein domain combinations and are similar enough that at greater evolutionary distances from Human it is non-trivial to distinguish homology. Krinke et al. (2007) have written a near exhaustive and authoritat- ive review of all experimental physiological evidence for IP3 medi- ated Ca2+ release in higher plants from the early 1990s to late 2006. Physiological evidence is abundant at the cellular level, but with only a few experiments specifically characterising an ITPR-like protein me- diating signal transduction. Krinke et al. present doubt over a mammalian ITPR-like protein being found in plants, and consider whether the reports are more
  • 39. 2.2 introduction and background 21 an artefact of the experimental methods. Their conclusion is that the wealth of physiological evidence of Calcium release induced by IP3 micro injection in guard cells, as well as more direct immunoblot evid- ence for structurally ITPR-like proteins such as presented by Muir and Sanders (1997) justifies the continued belief that an ITPR-like ligand gated channel exists in land plants. Krinke et al. briefly re- view previous domain homology analysis, and undertake their own whole-protein search which they present as inconclusive, leaving the question remaining open for the existence of a plant ITPR homolog. Krinke et al. also highlight the increased importance of IP3 as an in- termediate in signalling involving further phosphorylation of IP3 to IP5 and IP6 Inositol-phosphatides, as well as the alternate use of Di- acylglycerol in downstream pathways through phosphorylation me- diated by Diacylglycerol kinase. The latter is a system that competes with PKC activity in Metazoa. No conclusive molecular evidence for ITPR existing in plants exists. Sanders et al. (2002) offer a less comprehensive review of Ca2+ re- lease in plants, but include more discussion of the implications of physiological experiments. They discuss results relating to a putative RyR channel activated by Cyclic adenosine diphosphoribose (cADPR) and an unknown Nicotinic acid adenine dinucleotide phosphate-activated (NAADP) Ca2+ channel as existing due to physiological experiments, but with no candidate proteins yet described. Although these pro- posed ITPR and RyR-like channels have been characterised in the endomembranes of plants through electrophysiological and biochem- ical experiment, nothing is described at the molecular level to im- plicate the specific requirement of an ITPR-like ligand gated channel (Xiong et al., 2006). The uncharacterised NAADP-activated Ca2+ mentioned by Sanders et al. is also described in mammals with a recent characterisation of the Two Pore Channel (TPC) protein family being responsible for this function (Ruas et al., 2010; Calcraft et al., 2009). In Arabidopsis thaliana a single TPC gene is known, atTPC1, and this is a possible candidate for NAADP activity (Hedrich and Marten, 2011). In mammals NAADP and cADPR are produced by a singular dual-function enzyme CD38 [ensembl:ENSP00000226279] (Lee, 2011) and CD157 [ensembl:ENSP00000265016] both described as having this dual synthesis function specific to RyR activity. A plausible explanation for the lack of a RyR/ITPR candidate in plants could be the coinciding lack of synthesis of cADPR by a plant CD38-like protein. Using the SUPERFAMILY database we find that both CD38 and CD157 come from a distinct structural superfamily known as N-(deoxy)ribosyltransferase-like [scop:52309] in SCOP. We find that there is no protein that contains an N-(deoxy)ribosyltransferase- like (NRT) domain in any sequenced species of Viridiplantae. In- stances of NRT are highly prevalent in more complex metazoans and sporadic within fungi and some protists. This still leaves ques-
  • 40. 22 inositol-1,4,5-triphosphate mediated calcium release in plants tion over the presence of a RyR-like channel in plants, and how both NAADP and cADPR are synthesised. The key point we wish to com- ment on is that atTPC1 could perhaps explain the unknown NAADP activated Ca2+ release from the ER discussed by Sanders et al. rather than a conserved pathway involving RyR. This will be an important point expanded and discussed in Chapter 3. 2.2.3 Known structure of the Mammalian ITPR protein families Figure 4: The IP3 binding-core region of human ITPR1 in the bound state [pdb:1N4K] bottom in yellow is the beta trefoil of the MIR domain, top in red the side chains of an alpha helix from the armadillo-like IP3 Binding Core domain. ITPR proteins in mammals are on the order of ~2,700 amino acids in length with both transmembrane and globular cytosolic domains. The structure of a whole ITPR monomer has not been described to a high accuracy. The specific region that binds IP3 the “IP3-binding domain” has been structurally characterised through X-ray crystallography to a resolution of 2.2 Å by Bosanac et al. (2002) shown in Figure 4 on page 22 comprised of a beta-Trefoil domain and an Armadillo-like multi alpha-helical domain. An additional high resolution structure was resolved by Bosanac et al. (2005) for the most N-terminal “sup- pressor domain” region of ITPR, finding that this region contained an additional beta-Trefoil fold. In the SCOP classification of domains both beta-Trefoil domains have been classified in the same family and superfamily MIR domain [scop:82109]. The adjoining Armadillo-like
  • 41. 2.2 introduction and background 23 bundle found in the IP3-binding domain was classified as IP3 receptor type 1 binding core, domain 2 (IP3BC) [scop:100909]. This arrangement is more clearly understood from reviewing the domain architecture schematic shown in Figure 5 on page 25. IP3BC is always found alongside the MIR domain either forming a functional unit under selection or a supra-domain as defined by Vogel et al. (2004). IP3 is bound only at the interface of these two conserved structures. See Figure 4 on page 22 for a detailed view of this interface. The MIR domain is not always found alongside the IP3BC and is named after all the proteins it is present in: Mannosyltransferase (EC 2.4.1.109), Inositol 1,4,5-triphosphate Receptor, and Ryanodine Receptor (RyR). In Mannosyltransferase MIR occurs on its own as a single domain protein without the IP3BC adjacent. For complete details of the relation between SCOP/Pfam domains and the nomen- clature used by Bosanac et al. please see Table 1 on page 23. Bosnac et al. Nomenclature Residues SCOP Superfamilies Pfam Families Suppressor domain 1-226 MIR domain Inositol 1,4,5- triphosphate/ryanodine receptor IP3-binding domain 226-576 IP3 receptor type 1 binding core, domain 2; MIR domain MIR domain; RIH domain Modulatory and transducing domain 576-2100 IP3 receptor type 1 binding core, domain 2; ARM repeat PB004081; PB012613; RIH domain; PB017883; PB007236; PB001126; RIH assoc. Channel domain 2100-2590 N/A PB001430; Ion transport protein Coupling domain 2590-2743 N/A PB000285; PB000188 Table 1: The five functional regions of ITPR1 as defined by Bosanac et al. (2005), with the corresponding domain classifications from both SU- PERFAMILY and Pfam. The in-complex structure of a whole ITPR channel tetramer has been modelled at lower resolutions using Cryo-electron tomography by various groups and is summarised well in a review be Taylor et al. (2004). All models of complete channels agree on the IP3 binding core being the tip of a large cytosolic domain covering the transmembrane pore. From the tetrameric quaternary structure of ITPR it becomes clearer how both the suppressor region and IP3 binding core oper-
  • 42. 24 inositol-1,4,5-triphosphate mediated calcium release in plants ates. Suppressor action is formed from suppressor domains from each monomer reaching over the pore and binding to each other, when in contact with a secondary repressor protein. IP3being in the bound state also alters the angle between the IP3BC and MIR domains affect- ing the relative proximity of each copy of this region over the channel pore.
  • 43. 2.2 introduction and background 25 Figure 5: A wrapped schematic of the SUPERFAMILY (above the central black amino line) and Pfam (below the central amino line) domain architecture assignment for the ITPR1 Human protein from En- sembl [ENSP00000306253]. The five main functional regions as de- scribed by Bosanac et al. are noted between red boundaries on the amino acid ruler line. Additionally, the unmarked black and white stripe demarcates the transcript exon boundaries for this protein product.
  • 44. 26 inositol-1,4,5-triphosphate mediated calcium release in plants 2.3 ip3 bc and mir containing proteins across the tree of life The majority of evidence for the presence or absence of ITPR proteins in plants is based on whole-protein sequence homology using some variant of BLAST. This strategy has several limitations, the first being the general sensitivity of BLAST being reduced over HMM methods, as well as traditional BLAST variants not accounting for the domain assignment problem. Profile HMM search software such as HMMER has much higher rates of true positives for finding multi domain pro- teins than BLAST whilst accepting the same number of false positives (Eddy, 2011). Previously Krinke et al. (2007) did attempt a HMMER search for ITPR but used whole-protein similarity rather than focus- sing on specific regions of ITPR required for IP3 gated activity. From the known protein structure of the IP3 binding core region of ITPR1, the presence of IP3BC is required for IP3 binding activity and implies the protein is related to either ITPR or RyR channels. It stands to reason that any distantly related channel that is IP3 gated and evolutionary related to ITPR must at least contain this domain. Using the SUPERFAMILY (Gough et al., 2001) online resource one can readily identify proteins containing both the IP3BC and MIR out- side of Metazoans and more specifically within green plants. 2.3.1 ITPR-like protein readily identifiable in Chlorophyta Focussing just on plant genomes available in SUPERFAMILY 1.75 as of May 17th 2013, two species and three genomes can be identi- fied that contain the IP3BC domain. Moreover these proteins appear to have more general domain architecture homology to ITPR/RyR- like proteins . Figure 6 on page 27 shows the sTOL domain based phylogenetic tree (Fang et al., 2013) pruned to show Viridiplantae out grouped by Cyanidioschyzon merolae, which is the only example species from Rhodophyta in SUPERFAMILY.
  • 45. 2.3 ip3 bc and mir containing proteins across the tree of life 27 Figure 6: A phylogenetic tree of IP3BC presence in Plantae from the SU- PERFAMILY database retrieved 2013-05-17. Cyanidoschyzon merolae is an extremophilic red alga and is the only genome represented from Rhodophyta. The next branch is between Embryophyta (top) and Chlorophyta (bottom). Highlighted in red and green are the three genomes where a homolog of mammalian ITPR or perhaps RyR is readily identifiable from domain presence. Table 2 on page 29 presents the domain assignments in Chloro- phyta for the putative algal ITPR proteins highlighted in Figure 6 on page 27. Included are the expect values taken from the SUPER- FAMILY resource: any assignment with an E-value less than 1E-4 is considered as a significant hit. Interestingly the two strains of Vol- vox carteri have differing domain assignments. It appears as if the
  • 46. 28 inositol-1,4,5-triphosphate mediated calcium release in plants protein annotation in Volvox carteri v199 is truncated either as a res- ult of poor assembly or a lack of conservation in ITPR even at the level of strains of the same species in algae. I believe the former is more likely as the SUPERFAMILY assignment statistics for V. carteri v199 show that domain species diversity is reduced, there are fewer protein sequences annotated, with a 3% reduction in the total num- ber of those sequences with any domain assignment. Considering the ITPR homolog just looks like one half of the protein from the V. carteri f. nagariensis strain I suspect that longer genes have not been correctly assembled or called for this genome. This is a common and easily overlooked problem with low-depth short-read assembly using a next-generation sequencing technology. It is interesting to see an ITPR homolog in Chlorophyta as you would expect to find a similarly well preserved gene in other clades of Chlorophyta or even the Embryophyta if this gene had been strongly conserved. Already this leads me to suspect that Embryophytes pro- foundly lack a direct gene homolog to ITPR. Loss of the gene is a likely hypothesis but other events could have lead to ITPR being highly altered or repurposed throughout all plants and only being preserved in the Volvocales. To be sure that the ITPR gene was lost I propose a full search for anything that resembles the various com- ponents of the ITPR domain architecture to discount fission events or large inserts and rearrangements in the IP3BC preventing a standard HMM search from recognising this sequence. 2.4 searching for proteins that have some homology to itpr in embryophytes 2.4.1 Model building To perform a deep search of Embryophytes for an ITPR homolog I cre- ated an ensemble of whole-protein and IP3BC region specific hidden Markov models (HMM). The power of SUPERFAMILY to find dis- tantly related domain homology is in its ability to use many HMM models from multiple seed sequences to a single SCOP superfamily. This increases the space of sequences recognisable as distantly related improving sensitivity over building a single HMM model (Gough, 2002a). In addition to the three Chlorophyta homologs mentioned in the previous section, other non-mammalian ITPR homologs can be iden- tified using the SUPERFAMILY resource. From each of the the se- quences the IP3BC region was cut out and used to build a single HMM model, along with another whole-protein model. In total forty models were built using ten different proteins from seven species, see Table 3 on page 30 for a detailed list.
  • 47. 2.4 searching for proteins that have some homology to itpr in embryophytes 29 GenomeSequenceIDSCOPPfamE-value Volvoxcarteri f.nagariensis jgi|Volca1|121781| estExt_fgenesh5_synt.C_1280001 MIRdomain[scop:82109]; IP3receptortype1bindingcore,domain 2[scop:100909]; ARMrepeat[scop:48371] Inositol1,4,5-trisphosphate/ryanodine receptor[pfam:PF08709.6]; PB007702; RIHdomain[pfam:PF01365.16]; PB006047; PF08454.6 1.06E-16 Chlamydomonas reinhardtii jgi|Chlre4|153750| Chlre2_kg.scaffold_67000026 MIRdomain; IP3receptortype1bindingcore,domain 2; Voltage-gatedpotassiumchannels [scop:81324] Inositol1,4,5-trisphosphate/ryanodine receptor; RIHdomain; PF08454.6 1.03E-22 Volvoxcarteri v199 Vocar20006779m| PACid:23131843 ARMrepeat; IP3receptortype1bindingcore,domain 2 RIHdomain;N/A Table2:ITPR-likehomologsfoundinChlorophytawithE-valuesreportedfortheIP3BCdomaini.e.theIP3receptortype1bindingcoreandtheMIR domainbeingassignedtogether.
  • 48. 30 inositol-1,4,5-triphosphate mediated calcium release in plants Species Sequence ID Length IP3BC region Homo sapiens sp|Q14643|ITPR1_HUMAN 2758 225-578 Homo sapiens sp|Q14571|ITPR2_HUMAN 2701 225-577 Homo sapiens sp|Q14573|ITPR3_HUMAN 2671 223-574 Amphimedon queenslandica Aqu1.228673|PACid:15727201 2654 222-561 Mucor circinelloides jgi|Mucci1|82895|fgeneshMC_pg.5_#_852 2596 207-538 Phycomyces blakesleeanus jgi|Phybl1|80438|estExt_fgeneshPB_pg.C_620017 2551 217-546 Emiliania huxleyi jgi|Emihu1|209909|gm1.4000297 2712 210-478 Volvox carteri f. nagariensis jgi|Volca1|121781|estExt_fgenesh5_synt.C_1280001 3167 212-533 Chlamydomonas reinhardtii jgi|Chlre4|153750|Chlre2_kg.scaffold_67000026 3140 173-514 Table 3: Seed sequences used to build both whole-protein and IP3BC HM- MER models. For each seed sequence an HMM profile was iteratively built us- ing the jackhmmer software from the HMMER3 suite (Eddy, 2009). Jackhmmer was used with the --max option removing all heuristic filtering which slows the model building time but improves sensitiv- ity. For each seed sequence models from iteration three and five of jackhmmer were used in any search to increase model diversity. The UniRef100 non redundant sequence library was used as a background for iterative alignment and profile building. 2.4.2 Search in twenty-five Embryophytes Twenty five plant genomes were available in the SUPERFAMILY pro- tein library at the time of writing. The majority of species represen- ted were Angiosperms: eighteen Eudicots, and five Monocots; as well as single examples from the Lycopodiophyta and Bryophyta. Many of these came from the Phytozome collection(Goodstein et al., 2012), others were taken from their respective source project websites. Each HMM was searched against each genome using hmmsearch with a relaxed expect value acceptance threshold of E  0.01. No sig- nificant results were found for any IP3BC model. All whole-protein models failed to find anything resembling the complete protein with no better hits to the IP3BC than the models specific to this region. However, some homology to the full protein was found in every spe-
  • 49. 2.4 searching for proteins that have some homology to itpr in embryophytes 31 cies searched specific to the MIR domain and the ’Channel domain’ as defined by Bosanac et al. (2005) in mammalian ITPR. From investigating all of the proteins with any homology to ITPR1 in mammals it’s clear they fall into two broad categories: those with a single MIR domain that look to be likely Mannosyltransferases rather than ITPR or RyR, and a second family of Two Pore Channel (TPC) like Calcium ion channels. This second class is most interesting as the channel domain as defined by Pfam in ITPR1 in all whole-protein models aligns to the channel domains of the target plant TPC pro- teins. More importantly TPC-like proteins are the only protein to have homology to the ITPR channel found in plants, other Calcium ion channels in the same Pfam are not recovered with models built from ITPR-like homologs from around the tree of life. A complete listing of results and the domains recovered per spe- cies can be found in Table 4 on page 32. Of special note is the re- lationship between the number of TPC-like proteins to the number of Ca2+ channel domains observed hitting the whole-protein ITPR models. The TPC-like targets have two copies of the channel domain found in ITPR, both are hit by the models with slight bias to the most C-terminal domain having better alignment.
  • 50. 32 inositol-1,4,5-triphosphate mediated calcium release in plants GenomeTax.NoSeq.MIRCa2+ARMOtherTPC-likeBestHitSeq.ID ArabidopsisthalianaEudi.642101TPC-likeAT4G03560.1 ArabidopsislyrataEudi.533001TPC-likejgi|Araly1|911837|scaffold_603894.1 CaricapapayaEudi.411111TPC-likeevm.TU.supercontig_848.1 TheobromacacaoEudi.412021TPC-likeCGD0009922 MedicagotruncatulaEudi.521200MIR-likeIMGA|CU019603_1.2 GlycinemaxEudi.1542451TPC-likeGlyma13g37280.1 *LotusjaponicusEudi.432001TPC-likechr3.CM0160.230.nc FragariavescaEudi.1222181MIR-likegene15020 *MalusxdomesticaEudi.1121080Ca2+chan.MDP0000334484 PrunuspersicaEudi.824132TPC-likeppa001980m PopulustrichocarpaEudi.822321TPC-likejgi|Poptr1_1|837245|estExt_fgenesh4_pm.C_1420009 ManihotesculentaEudi.982011TPC-likecassava12501.valid.m1 *RicinuscommunisEudi.515012TPC-like29983.m003190 *SolanumlycopersicumEudi.312011TPC-likeSL1.00sc05390_201.1.1 VitisviniferaEudi.522021TPC-likeGSVIVT01020549001 AquilegiacoeruleaEudi.1173200MIR-likeAcoGoldSmith_v1.011018m|PACid:18153340 CucumissativusEudi.1266303TPC-likeCucsa.147640.1 MimulusguttatusEudi.1072011TPC-likemgf006798m BrachypodiumdistachyonMono.982001TPC-likeEG:BRADI2G46510.1 Oryzasativassp.japonicaMono.1324182TPC-likeLOC_Os01g48680.2|13101.m05051|protein Zeamayssubsp.maysMono.624022MIR-likeGRMZM2G144367_P01 SorghumbicolorMono.932051TPC-likejgi|Sorbi1|5053513|Sb03g031110 *PhoenixdactyliferaMono.532000MIR-likePDK_20s1454741g001 SelaginellamoellendorffiiSpikem.1183000MIR-likejgi|Selmo1|444987|estExt_fgenesh2_pg.C_500138 PhyscomitrellapatensMosses17320139TPC-likejgi|Phypa1_1|94200|fgenesh1_pg.scaffold_256000041 Table 4: Summary results of searching IPTR-like protein models against 25 plant genomes. All “MIR-like” results are likely Mannosyltrans- ferases. Those genomes marked with a * are of draft quality at the time of writing.
  • 51. 2.4 searching for proteins that have some homology to itpr in embryophytes 33 2.4.3 ITPR Calcium channel domain has homology to Two-pore Channels of Embryophyta So far I have focussed on the likelihood of the IP3BC+MIR supra- domain being present in land plants, and that this form of IP3 bind- ing activity exists. A key result from HMMER search of whole-protein models of ITPR is that the Calcium channel domain is readily iden- tifiable in the TPC-like family of proteins in most Embryophytes se- quenced to date with results looking even stronger when taking into account the draft quality of genomes not containing TPC-like pro- teins. It is clear from the results above that TPC proteins are a very good candidate for finding a Calcium channel related to mammalian ITPR. I will go into much more detail in the following chapter on the signi- ficance of TPC proteins having remote homology to ITPR. 2.4.4 Disordered binding core in algal ITPR-like proteins The longer length of the aligned IP3BC region in M. circinelloides, V. carteri and C. reinhardtii seen in Table 3 is due to disordered inserts between the MIR and IP3 binding core SCOP domains. Please see Figure 44 on page 160 for a full domain architecture schematic of the two Chlorophyta ITPR-like proteins, where the variable disordered insert is clearly visible. Looking in greater detail at one of these insert regions from V. carteri f. nagariensis one of the disordered regions is of low sequence complexity and Glycine and Alanine rich: ATGPGGAGGGGPGGEGAPAGAGGEFGGGAGAAAAAAAVSGGGGGEMQNVLHSADEDVLYDEA From eukaryotic linear motif assignment via ELM (Dinkel et al., 2012) shown in Figure 7 on page 34 I find that the final Serine of this insert could be phosphorylated by a CK2 protein kinase, localising to the sequence VLHSADE. CK2 like ITPR is found as a tetrameric as- sembly in vivo (Niefind et al., 2001), so is perhaps a key regulator of this ITPR-like protein in V. carteri f. nagariensis. Although ELM has not assigned a region of interaction at the point of the CK2 Serine phos- phorylation, it’s quite likely this is a phosphorylation activated MoRF region or “adhesive region” as expressed by Meggio and Pinna (2003) in this context, perhaps interacting with a 14-3-3 protein or one of the many other domains that interact with phosphoserine linear motifs in signalling proteins. This is likely a novel interaction motif and worth investigating ex- perimentally. See Yaffe and Elia (2001) for a review of the increasingly diverse array of signalling domains associated with phosphoserine other than 14-3-3. Conversely the SH2 ligand binding motif at the end of this region requires a phosphotyrosine to be present, so either a novel kinase binding adjacent to the YDEA motif needs to be present
  • 52. 34 inositol-1,4,5-triphosphate mediated calcium release in plants or this is a false positive assignment with that region still plausibly binding PDZ domain containing proteins. Figure 7: V. carteri f. nagariensis disordered IP3BC insert ELM assignment, much of which is predicted to be a MoRF region by ANCHOR as well. I have given some brief detail of this single disorder insert in the IP3BC domain of V. carteri f. nagariensis. This region is not conserved in sequence in M. circinelloides or C. reinhardtii but is conserved in its disordered state predicted by the IUPred (Dosztányi et al., 2005) disorder predictor. I suggest that these disordered insert regions are an adaptation that has led to the continued inclusion of ITPR in those specific species that still retain a recognisable ITPR-like protein. One explanation for this is that these channels perhaps no longer bind IP3 due to the separation of the MIR and IP3BC domains, but in- stead allow the channel to be regulated by secondary signalling path- ways mediated by MoRF interactions rather than IP3. Another hypo- thesis is that with phosphorylation these disordered regions contract due to favourable folding conditions and the formation of secondary structure, leading to the MIR and IP3BC domains becoming adjacent forming the IP3 binding site on activation. These sorts of inserted-disorder functions are described for PIP2 binding activity in the signalling protein BIN1, given a ten residue MoRF insert (Kojima et al., 2004; Weatheritt and Gibson, 2012) that has an SH3 binding motif interlaced with a PIP2 binding region. How- ever, this insertion is at the level of transcriptional variance within a single organism, rather than divergence of sequence between species. My expectation is that the overall mechanism is still relatively analog- ous in the disparate ITPR-like proteins seen around the eukaryotic tree of life.
  • 53. 2.5 exhaustive search of embryophyta genomes for an itpr-like protein 35 2.5 exhaustive search of embryophyta genomes for an itpr-like protein Previously I discarded MIR containing proteins from Embryophytes as likely originating from Mannosyltransferase coding genes. How- ever, these could be bad gene calls, with larger ITPR-like genes be- ing present at these positions. Many plant genome assemblies rely on Arabidopsis models for their gene calling and annotation so it is possible bias has slipped in and continued to cause problems. Addi- tionally, TPC-like genes could perhaps be missing further regulatory regions not currently seen. Neither of these scenarios are likely, but to be sure that more identifiable ITPR-like genes do not exist this must be ruled out. I have already shown that a homologous protein to mammalian ITPR can be readily found within green algae, specifically the order Volvocales. It is possible some remnant of an ITPR gene still exists within the full nucleotide sequence of vascular plants, either as a pseudogene or a fragment that has not decayed beyond the detection of sequence homology. Understanding how and approximately when the ITPR gene was lost from vascular land plants is important in un- derstanding what has happened in these organisms to accommodate the loss of this gene. I propose the use of HMM profile based search against all six-frame translated open reading frames (ORFs) of a given genome as the least conservative and most exhaustive method of locating any likely se- quence fragment matching the ITPR binding core. In a similar fash- ion to how the ’protein2genome’ model from exonerate (Slater and Birney, 2005) allows one to search a protein sequence against a gen- ome, independent of intron inserts and frame shifts. The advantage of my proposed technique over exonerate is that it uses a more sens- itive profile technique for protein homology detection, the disadvant- age is that frame shifts and phase changes between ORFs will not be accounted for. However, gene calling has already been performed on genomes with completed protein annotation. This study is an attempt to do the least conservative search to identify any small piece of evid- ence to disprove a default assumption that ITPR has been truly lost in Embryophyta. 2.5.1 Checking for ITPR-like pseudogenes Search of nucleotide sequence in plants supports absence of IP3binding site domains. Hard masked assembled nucleotide files were taken from release nine of the Phytozome (Goodstein et al., 2012) plant genome collection1. These genomic DNA files were then six-frame translated to attain all plausible ORFs that can be produced in or outside of a known gene. To create the translated amino acid sequence the sixpack tool from 1 Phytozome files: ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/
  • 54. 36 inositol-1,4,5-triphosphate mediated calcium release in plants EMBOSS 6.3.1 (Rice et al., 2000) was used with a minimum ORF size of seven amino acids, breaking on stop codons. Each translated ORF was scored against the fifth iteration IP3BC model, built from the Human ITPR1 seed sequence from previous protein searches. hmmsearch from the HMMER3 suite (Eddy, 2009) was used to perform the search, accepting sequence matches with an expect value less than or equal to 0.01. The results for the translated nucleotide search were consistent with the search of protein annotations found in Table 4 on page 32 apart from many genomes either had no significant hits and only fragments of the MIR region were recovered. There does not appear to be any recognisable remnant of the IP3BC domain in any vascular plant in- cluded in the study. 2.5.2 Checking for bad or incomplete gene predictions Table 4 on page 32 shows that six sequences were found with do- mains related to ITPR in Arabidopsis thaliana. Using the TAIR10 inter- active genome browser (Lamesch et al., 2012) the flanking region for every target gene to its neighbouring genes both up and down stream within the same chromosome were selected and saved as a nucleotide FASTA file. Each of these chromosome regions were six-frame trans- lated using the procedure outlined in the previous section, but against all ITPR models as used in previous protein sequence searches. Align- ments to whole-protein ITPR models showed no new homology from the translated sequence, therefore it is highly unlikely any additional exons for a larger gene are present around MIR containing proteins or TPC-like proteins. 2.6 conclusion From the widespread but sparse distribution of recognisable ITPR homologs over all eukaryotes it appears that the IP3BC domain was present in the last universal common ancestor to all eukaryotes (LECA). Since this time ITPR has been lost multiple times early in the develop- ment of each kingdom, other than Metazoa. I propose that ITPR was lost approximately 725–1200 million years ago in the evolutionary his- tory of modern land plants, with the divergence of Streptophyta from Chlorophyta. This puts the loss of the IP3BC supra-domain coincid- ental with the development of green land plants (Becker and Marin, 2009). Within Chlorophyta the mode of evolution was still loss of ITPR, with only the Chlorophyceae or perhaps just the Volvocales specifically retaining a recognisable ITPR gene or IP3BC domain. The phylogenetic placement of recognisable conserved homologous se- quence supports this hypothesis, and the profound lack of any sig-
  • 55. 2.6 conclusion 37 nal from any Embryophyte nucleotide sequence does not raise any significant doubt of an ancient loss event common to all land plants. All conserved and recognisable IP3BC regions outside of Metazoa appear to have disordered inserts between the MIR and IP3 bind- ing core SCOP domains, perhaps facilitating additional regulation through disorder-order protein interaction. These regions are only conserved as disorder, with different linear motifs recognisable in different species, even species more closely related such as the C. re- inhardtii and V. carteri. This suggests that ITPR has been lost, apart from in species that have repurposed this protein, perhaps facilitated by additional protein interactions within the binding core allowing cross talk with alternative signalling pathways in these species. Further work is necessary to support this idea, but certainly it’s clear that within Metazoa the IP3BC aligns free of these disordered inserts, and the sparse representation of this domain elsewhere in eukaryotes requires this extra structure to be present. Physiological evidence for Ca2+ signalling in root hairs and sto- matal guard cells suggests that Embryophytes still have a signalling system relatively analogous to mammalian IP3 signalling. Which sys- tem of proteins is performing this role and if IP3 specifically is in- volved remains an open question. The next chapter goes on to sug- gest my working hypothesis for what might have happened since the loss of ITPR in ancestral land plants.
  • 56.
  • 57. 3A N A LT E R N AT I V E H Y P O T H E S I S F O R I N O S I TO L P H O S P H AT E I N D U C E D C A L C I U M R E L E A S E I N E M B RY O P H Y T E S 3.1 introduction In the previous chapter I demonstrated there appears to be no direct whole protein homolog of ITPR found in Embryophyta. There is how- ever some evidence of distant relation between the channel domain of ITPR and TPC-like proteins of plants. Based on a working hypo- thesis that TPC is the channel providing the cell physiology seen in the majority of studies reporting ITPR activity this chapter develops an alternative theory for how inositide signalling can still be involved. 3.2 phosphoinositide metabolism in embryophytes is very different to that found in mammals Previously in Chapter 2 I mentioned that there are 15 families of PKC homologs, these can be subdivided into three groups based on their requirements for activation. The conventional group of PKCs PKCa-g require both DAG and Ca2+ for activation, there is additionally a novel group that localise to DAG in the membrane but do not re- quire Ca2+ release for kinase activity PKCd,e,h,j. The third group of PKC genes are atypical in that the active sites associated with both DAG and Ca2+ binding are either cleaved, alternately spliced from the protein or otherwise non-functional PKCz,i,N1,N2,N3 these PKCs are in- stead involved in Phosphatidylinositol (3,4,5)-triphosphate (PIP3) re- lated signalling as the C2 domain of PKCs has conserved affinity for PIP3 in all families (Nishizuka, 1995). All PKCs have affinity for PIP3 but no documented Phosphatidylin- ositide 3-kinases (PI3K) that can metabolise PIP2 to PIP3 have been documented to date in plants (Lee et al., 2008). Additionally not a single PKC from any of the ubiquitous protein families found in Mammals has been identified in plants through sequence homology. This brings into question the validity of even looking for an ITPR- IP3-like Ca2+ release channel in higher plants as there are in theory no end effect kinases that are associated with the canonical case in mammals. For an overview of the PI to PIPx pathways that do and don’t exist in plants please see Figure 8 on page 40. Physiological experiments still show some form of IP3 mediated re- lease is perhaps present. In this chapter I call into question the causal mechanism of IP3 as the direct mediator of Ca2+ release from internal 39
  • 58. 40 an alternative hypothesis for inositol phosphate induced calcium release in Figure 8: Phosphoinositide metabolism in plants versus human. Figure re- produced from the book chapter “Phospholipid Signalling in Root Hair Development” written by Takashi Aoyama (Aoyama, 2009). Blue blocks represent metabolites present in both humans and plants, green solid arrows and black text represent pathways and associated enzymes present in both plants and humans. Arrows dashed in red with accompanying grey text are those pathways only found in humans. stores in plants. Instead I propose a novel mechanism for Ca2+ re- lease involving atTPC1 in-complex-with or downstream-of a novel regulatory protein I will refer to as the Two Pore Channel Regulator subunit protein or TPCR. 3.3 tpcs as an alternative channel to itpr mediated in- ositol phosphate signalling in embryophytes Previously in Chapter 2 I showed that TPCs are not only the protein with highest sequence homology to ITPR in Embryophytes, but they are also related specifically through their channel forming domain, albeit distantly. I believe that this relationship is more than just a reflection of a shared domain, but specifically a shared context of these channels in respect to Inositolphosphatide signalling. This idea has already been postulated by Wheeler and Brownlee (2008) in their paper “Ca2+ signalling in plants and green algae – chan- ging channels” along with many other plausible candidates. However, no concrete evidence has been presented as to which of many channel families has taken on the role of ITPR. I believe that TPCs are a strong candidate, but unusually through an unproven relationship with a second hypothetical regulatory protein. This argument involves the development of several emerging current opinions around the role of intrinsic disorder and posttranslational modification. However, from my development of the D2P2 database discussed in Part II of this thesis, I feel increasingly reassured this is not too far removed con- ceptually from what is likely happening in plants.