• Save
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

on

  • 1,071 views

Motivation: ...

Motivation:

Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this work. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use UniProt Knowledge Base (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations.

Results:

By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.


For more information available at the authors website: www.michaeljbell.co.uk

Statistics

Views

Total Views
1,071
Views on SlideShare
578
Embed Views
493

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 493

http://homepages.cs.ncl.ac.uk 493

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • For example this is an analysis over Wikipedia. And we find that word occurrence size ranked by the word broadly obeys a power law. Taken from - http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png
  • We can relate these power laws to Zipf's principle of least effort. This states that... Point about reader and author. Different texts resolve this in different ways. By taking the exponenet of the regression line – alpha – we can see that Wikipedia has is least effort is placed on the curator.
  • The first step of our approach is to extract the necessary data from UniProtKB. Our extraction process consists of 4 key steps. Firstly we obtain each version of Swiss-Prot and TrEMBL and then extract just those lines that hold comments. We then extract all the words from this data, and remove topic and block headings. We can then count how frequently each word occurs, with the output being a list of all words and their occurrence.
  • We can then apply a power law distribution to this data. The result of which is a graph, as shown here. The graph is actually represented as a cumulative distribution function, and is shown on logarithmic scales. Along the X axis we have the size of a word – that is how frequently it occurs, whilst along the Y axis we have the probability of a word occurring X or more times. This graph isn’t straight forward – so as an example, the top left point represents that the probability a word occurs once is 1, as only words that occur within the corpus are used. Conversely, the point at the bottom right shows that the probability of a word occuring over 100,000 times is very small. Using this approach we can now initially apply it to Swiss-Prot
  • The first question to ask is – does Swiss-Prot obey a power law? And it does boradly appear to, yes. However, there is a distinct structure or kink in the tail of the graph in a number of versions. So the question here is, what is this kink?
  • Copyright statement added to every entry in a version. Therefore we see these statements here.Sort out graphsIt turns out to be copyright statements. This shows that using this approach we can detect the introduction of data with no biological significance. It also shows that our approach is acting as a measure of quality, albeit for detecting artifacts.
  • Although Swiss-Prot obeys a power-law, does it relate to the quality of biological knowledge? One way we can address this quesiton is to compare automated and manual annotation.As shown previously, we can make the assumption that swiss-prot acts as a more mature resource than trembl. So by analysing annotations at similar points in time between the two resources, we would expect swiss-prot to act as a more mature resource.
  • By overlaying the graphs for TrEMBL and Swiss-Prot we can more clearly see how they mature over time. It is clear from this slide that they appear to diverge over time, with TrEMBL showing higher levels or re-use and swiss-prot showing a richer use of vocabulary. So it does indeed suggest our approach is acting as a measure of quality. However, the main analytical value from these graphs from the alpha value
  • So we can show the alpha value over time for both swiss-prot and trembl. This also provides a clearer image of effort over time.We can see how both databases show a decrease over time – that is they appear to be becoming least effort for the annotator – although this progression is much more irregular in TrEMBL.This view shows two major disjuncts in TrEMBL – which appears to coincide with changes to the underlying annotation process in TrEMBL. One possible explanation for this decrease is due to the age of entries.
  • So one possibility for this decrease is that entries, on average, getting older as the database is getting older. This isn’t the case however, entries are getting younger. This is mainly due to new records being added exponentially – outnumbering the old records.So we ask if the decrease happens because entries are, on average, getting older? However this isn't true as actually average age isn't getting older and is decreasing over time.
  • So here we want to abstract form the size of the database and ask how are individual records maturing?This isn’t straight forward – essentially we need a set of records which relate to a defined set of proteinsTherefore, we extract those entries that are common in SWP 9 and UPSP 15, providing a span of over 20 years.
  • Again, highlight slope of graph we are looking at – and that we are looking at the subsets.Like with the database as a whole, we again see a decrease. However, this isn’t as low as the remainder of the database.
  • Re-iterate earlier question – and that this is another way to look at it. All annotations are approx of same age, as they are new.
  • As we see a decrease in the mature entries – how do the new entries fare?Similarly they decrease over time, which is the same pattern as all other graphsWhy do we see all of these decreases...?
  • The protocol was recently published (2011) and again shows advantage of using UniProtKB, as well documented.Need to be careful we don't say “standardisation == poor quality”, it is just something that can explain it, rather than a definitive answer. Rather, it is more likely that trying to be consistent has lost some of the more “personal” annotations to entries, and thus become more generic.TOO WORDY
  • Try finish on a high here. Give a very quick and brief recap of the main idea and points, and how it is “easy” and could be useful for both curators and end users alike.
  • Word Cloud is from UniProtKB/Swiss-Prot Version 15
  • Number of competing models were considered. However, Power Law distribution gives a good balance between model parsimony and fit.Only deal with discrete power law distribution here – which has the probability mass function described.To fit the power-law distribution we followed a Bayesian paradigm.Xmin, determined using the BIC criteria, was set to 50 throughout.

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB Presentation Transcript

  • 1. An approach to describe and analyse bulk annotation quality Michael J Bell*, Colin Gillespie, Daniel Swan and Phillip Lord *m.j.bell1@ncl.ac.ukwww.michaeljbell.co.uk
  • 2. Talk Outline• Annotation Quality? Why UniProtKB?• Data extraction• Applying power laws• Analysing Swiss-Prot and TrEMBL annotation• Discussion and Conclusion• QuestionsMichael J Bell @mj_bell Newcastle University 2m.j.bell1@ncl.ac.uk
  • 3. Annotation Quality in UniProtKBMichael J Bell @mj_bell Newcastle University 3m.j.bell1@ncl.ac.uk
  • 4. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-UDE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRYRP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSVRX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQRA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFARRT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIPRT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPTRL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 5. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-UDE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRYRP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSVRX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQRA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFARRT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIPRT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPTRL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 6. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-UDE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRYRP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSVRX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQRA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFARRT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIPRT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPTRL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 7. Functional Annotation• Annotation is overloaded: – Here we mean “high level” • Knowledge associated with the data • Aimed at the human readerMichael J Bell @mj_bell Newcastle University 7m.j.bell1@ncl.ac.uk
  • 8. Michael J Bell @mj_bell Newcastle University 8m.j.bell1@ncl.ac.uk
  • 9. Swiss-Prot Entry P26367 – PAX6_HUMAN (Homo sapiens) 43 SentencesMichael J Bell @mj_bell Newcastle University 9m.j.bell1@ncl.ac.uk
  • 10. Michael J Bell @mj_bell Newcastle University 10m.j.bell1@ncl.ac.uk
  • 11. TrEMBL Entry A4PBK5 – A4PBK5_9METZ (Ephydatia fluviatilis) 1 SentenceMichael J Bell @mj_bell Newcastle University 11m.j.bell1@ncl.ac.uk
  • 12. Annotation Quality• Annotation is highly variable – E.g. Automated Vs. Manual• Current approaches rely upon specific database structure/features – Ontology – Evidence Codes• Can we develop a metric based on free text?Michael J Bell @mj_bell Newcastle University 12m.j.bell1@ncl.ac.uk
  • 13. Why UniProtKB?• UniProtKB is well known and established• Number of technical reasons: – UniProtKB composed of TrEMBL and Swiss-Prot – Historical version – Cross species• Lack of gold standardMichael J Bell @mj_bell Newcastle University 13m.j.bell1@ncl.ac.uk
  • 14. Applying Power LawsMichael J Bell @mj_bell Newcastle University 14m.j.bell1@ncl.ac.uk
  • 15. Investigating Word Occurrences• Extract word occurrence from all annotationMichael J Bell @mj_bell Newcastle University 15m.j.bell1@ncl.ac.uk
  • 16. Investigating Word Occurrences• Extract word occurrence from all annotation 1. Protein 2. Proteins 3. Chains 4. Chain 5. Sequence 6. Enzyme 7. ComplexMichael J Bell @mj_bell Newcastle University 16m.j.bell1@ncl.ac.uk
  • 17. Word Occurrences in Wikipedia Taken from: http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.pngMichael J Bell @mj_bell Newcastle University 17m.j.bell1@ncl.ac.uk
  • 18. Zipf’s Principle of Least Effort• Take word occurrences and apply to Zipf’s Principle of Least Effort• Human nature to take path of least effort when achieving a goal α Value Examples in literature Least effort for α < 1.6 Advanced Schizophrenia, Young children - 1.6 < α < 2 Military Combat Texts, Wikipedia, Web pages listed on the open Annotator directory project α=2 Single author texts Equal 2 < α < 2.4 Multi author texts Audience α > 2.4 Fragmented discourse schizophrenia -Michael J Bell @mj_bell Newcastle University 18m.j.bell1@ncl.ac.uk
  • 19. Data ExtractionMichael J Bell @mj_bell Newcastle University 19m.j.bell1@ncl.ac.uk
  • 20. The Model & Resulting Graphs• Power Law Distribution• Logarithmic scales• X-axis – Size• Y-Axis – Probability• A point represents probability a word will occur X or more times• E.g. upper left most point: – Probability word occurs once = 10^0Michael J Bell @mj_bell Newcastle University 20m.j.bell1@ncl.ac.uk
  • 21. Does UniProtKB obey a power-law?• Broadly, yes. However, distinct structure?Michael J Bell @mj_bell Newcastle University 21m.j.bell1@ncl.ac.uk
  • 22. The removal of copyright• Development of two slopes – As seen in mature resourcesMichael J Bell @mj_bell Newcastle University 22m.j.bell1@ncl.ac.uk
  • 23. Quality of Biological Knowledge?• How does automated annotation compare to manual annotation? – i.e. TrEMBL Vs. Swiss-Prot• Assume Swiss-Prot acts as a more mature resource than TrEMBL• Analyse this by comparing annotations at equivalent points in timeMichael J Bell @mj_bell Newcastle University 23m.j.bell1@ncl.ac.uk
  • 24. Viewing over timeMichael J Bell @mj_bell Newcastle University 24m.j.bell1@ncl.ac.uk
  • 25. Viewing over time• Show just alpha values• Appears to be becoming optimised (least effort) for annotatorMichael J Bell @mj_bell Newcastle University 25m.j.bell1@ncl.ac.uk
  • 26. Annotation Maturity• Does this decrease happen because entries are, on average, getting older?Michael J Bell @mj_bell Newcastle University 26m.j.bell1@ncl.ac.uk
  • 27. Annotation Maturity• Want to abstract from size and analyse how individual records are maturing• Need essentially a set of records which relate to a defined set of proteins• Therefore extract entries common in both Swiss-Prot V9 and UniProtKB V15Michael J Bell @mj_bell Newcastle University 27m.j.bell1@ncl.ac.uk
  • 28. Annotation maturityMichael J Bell @mj_bell Newcastle University 28m.j.bell1@ncl.ac.uk
  • 29. Analysing new annotations• Mature entries are decreasing• How are new annotations impacted?• Take annotations from entries that appear for the first time in a given database versionMichael J Bell @mj_bell Newcastle University 29m.j.bell1@ncl.ac.uk
  • 30. The impact of new annotationsMichael J Bell @mj_bell Newcastle University 30m.j.bell1@ncl.ac.uk
  • 31. Explanation for the decrease?• Annotation curation involves identifying similar entries• Annotations between these entries are standardised• Is this standardisation changing the way entries are annotated? – Subsequently placing the least effort onto the annotator?Michael J Bell @mj_bell Newcastle University 31m.j.bell1@ncl.ac.uk
  • 32. Conclusions• Approach acting as a quality measure – Detection of artefacts – Distinction between TrEMBL and Swiss-Prot• Annotations in UniProtKB are becoming optimised for the annotator rather than the reader – Constant increase of data & pressure on curators – Also true for existing and new annotationsMichael J Bell @mj_bell Newcastle University 32m.j.bell1@ncl.ac.uk
  • 33. Summary• The biological community lacks a generic quality metric that allows biological annotation to be quantitatively assessed and compared.• Here we investigated word reuse within bulk textual annotation and related it to Zipfs Principle of Least Effort.• Straight forward approach once data extracted• Holds promise of being useful for curators and end usersMichael J Bell @mj_bell Newcastle University 33m.j.bell1@ncl.ac.uk
  • 34. Colin Gillespie, Daniel SwanThank You! and Phillip LordMany thanks go to: Allyson Lister1 Daniel Barrell2 Michael Bell UniProt Helpdesk1 Newcastle m.j.bell1@ncl.ac.uk University, UK2 EBIMichael J Bell @mj_bell m.j.bell1@ncl.ac.uk Newcastle University www.michaeljbell.co.uk 34