An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

1. An approach to describe and analyse bulk annotation quality Michael J Bell*, Colin Gillespie, Daniel Swan and Phillip Lord *m.j.bell1@ncl.ac.uk www.michaeljbell.co.uk

2. Talk Outline • Annotation Quality? Why UniProtKB? • Data extraction • Applying power laws • Analysing Swiss-Prot and TrEMBL annotation • Discussion and Conclusion • Questions Michael J Bell @mj_bell Newcastle University 2 m.j.bell1@ncl.ac.uk

3. Annotation Quality in UniProtKB Michael J Bell @mj_bell Newcastle University 3 m.j.bell1@ncl.ac.uk

4. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL. AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD. DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD. DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL. DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB. DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD. GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB. OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL. OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD. OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL. OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD. RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1. RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2. RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS. RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain. RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N. RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd. RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1. RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1. RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1. RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2. RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1. RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1. RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1. RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level; RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein; RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box; RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation; RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation. RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6. RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187. RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired. RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox. RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich. RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich. RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a). RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531. RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075). RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671). RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64; RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //

7. Functional Annotation • Annotation is overloaded: – Here we mean “high level” • Knowledge associated with the data • Aimed at the human reader Michael J Bell @mj_bell Newcastle University 7 m.j.bell1@ncl.ac.uk

8. Michael J Bell @mj_bell Newcastle University 8 m.j.bell1@ncl.ac.uk

9. Swiss-Prot Entry P26367 – PAX6_HUMAN (Homo sapiens) 43 Sentences Michael J Bell @mj_bell Newcastle University 9 m.j.bell1@ncl.ac.uk

10. Michael J Bell @mj_bell Newcastle University 10 m.j.bell1@ncl.ac.uk

11. TrEMBL Entry A4PBK5 – A4PBK5_9METZ (Ephydatia fluviatilis) 1 Sentence Michael J Bell @mj_bell Newcastle University 11 m.j.bell1@ncl.ac.uk

12. Annotation Quality • Annotation is highly variable – E.g. Automated Vs. Manual • Current approaches rely upon specific database structure/features – Ontology – Evidence Codes • Can we develop a metric based on free text? Michael J Bell @mj_bell Newcastle University 12 m.j.bell1@ncl.ac.uk

13. Why UniProtKB? • UniProtKB is well known and established • Number of technical reasons: – UniProtKB composed of TrEMBL and Swiss-Prot – Historical version – Cross species • Lack of gold standard Michael J Bell @mj_bell Newcastle University 13 m.j.bell1@ncl.ac.uk

14. Applying Power Laws Michael J Bell @mj_bell Newcastle University 14 m.j.bell1@ncl.ac.uk

15. Investigating Word Occurrences • Extract word occurrence from all annotation Michael J Bell @mj_bell Newcastle University 15 m.j.bell1@ncl.ac.uk

16. Investigating Word Occurrences • Extract word occurrence from all annotation 1. Protein 2. Proteins 3. Chains 4. Chain 5. Sequence 6. Enzyme 7. Complex Michael J Bell @mj_bell Newcastle University 16 m.j.bell1@ncl.ac.uk

17. Word Occurrences in Wikipedia Taken from: http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png Michael J Bell @mj_bell Newcastle University 17 m.j.bell1@ncl.ac.uk

18. Zipf’s Principle of Least Effort • Take word occurrences and apply to Zipf’s Principle of Least Effort • Human nature to take path of least effort when achieving a goal α Value Examples in literature Least effort for α < 1.6 Advanced Schizophrenia, Young children - 1.6 < α < 2 Military Combat Texts, Wikipedia, Web pages listed on the open Annotator directory project α=2 Single author texts Equal 2 < α < 2.4 Multi author texts Audience α > 2.4 Fragmented discourse schizophrenia - Michael J Bell @mj_bell Newcastle University 18 m.j.bell1@ncl.ac.uk

19. Data Extraction Michael J Bell @mj_bell Newcastle University 19 m.j.bell1@ncl.ac.uk

20. The Model & Resulting Graphs • Power Law Distribution • Logarithmic scales • X-axis – Size • Y-Axis – Probability • A point represents probability a word will occur X or more times • E.g. upper left most point: – Probability word occurs once = 10^0 Michael J Bell @mj_bell Newcastle University 20 m.j.bell1@ncl.ac.uk

21. Does UniProtKB obey a power-law? • Broadly, yes. However, distinct structure? Michael J Bell @mj_bell Newcastle University 21 m.j.bell1@ncl.ac.uk

22. The removal of copyright • Development of two slopes – As seen in mature resources Michael J Bell @mj_bell Newcastle University 22 m.j.bell1@ncl.ac.uk

23. Quality of Biological Knowledge? • How does automated annotation compare to manual annotation? – i.e. TrEMBL Vs. Swiss-Prot • Assume Swiss-Prot acts as a more mature resource than TrEMBL • Analyse this by comparing annotations at equivalent points in time Michael J Bell @mj_bell Newcastle University 23 m.j.bell1@ncl.ac.uk

24. Viewing over time Michael J Bell @mj_bell Newcastle University 24 m.j.bell1@ncl.ac.uk

25. Viewing over time • Show just alpha values • Appears to be becoming optimised (least effort) for annotator Michael J Bell @mj_bell Newcastle University 25 m.j.bell1@ncl.ac.uk

26. Annotation Maturity • Does this decrease happen because entries are, on average, getting older? Michael J Bell @mj_bell Newcastle University 26 m.j.bell1@ncl.ac.uk

27. Annotation Maturity • Want to abstract from size and analyse how individual records are maturing • Need essentially a set of records which relate to a defined set of proteins • Therefore extract entries common in both Swiss-Prot V9 and UniProtKB V15 Michael J Bell @mj_bell Newcastle University 27 m.j.bell1@ncl.ac.uk

28. Annotation maturity Michael J Bell @mj_bell Newcastle University 28 m.j.bell1@ncl.ac.uk

29. Analysing new annotations • Mature entries are decreasing • How are new annotations impacted? • Take annotations from entries that appear for the first time in a given database version Michael J Bell @mj_bell Newcastle University 29 m.j.bell1@ncl.ac.uk

30. The impact of new annotations Michael J Bell @mj_bell Newcastle University 30 m.j.bell1@ncl.ac.uk

31. Explanation for the decrease? • Annotation curation involves identifying similar entries • Annotations between these entries are standardised • Is this standardisation changing the way entries are annotated? – Subsequently placing the least effort onto the annotator? Michael J Bell @mj_bell Newcastle University 31 m.j.bell1@ncl.ac.uk

32. Conclusions • Approach acting as a quality measure – Detection of artefacts – Distinction between TrEMBL and Swiss-Prot • Annotations in UniProtKB are becoming optimised for the annotator rather than the reader – Constant increase of data & pressure on curators – Also true for existing and new annotations Michael J Bell @mj_bell Newcastle University 32 m.j.bell1@ncl.ac.uk

33. Summary • The biological community lacks a generic quality metric that allows biological annotation to be quantitatively assessed and compared. • Here we investigated word reuse within bulk textual annotation and related it to Zipf's Principle of Least Effort. • Straight forward approach once data extracted • Holds promise of being useful for curators and end users Michael J Bell @mj_bell Newcastle University 33 m.j.bell1@ncl.ac.uk

34. Colin Gillespie, Daniel Swan Thank You! and Phillip Lord Many thanks go to: Allyson Lister1 Daniel Barrell2 Michael Bell UniProt Helpdesk 1 Newcastle m.j.bell1@ncl.ac.uk University, UK 2 EBIMichael J Bell @mj_bell m.j.bell1@ncl.ac.uk Newcastle University www.michaeljbell.co.uk 34

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Similar to An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB (20)

Recently uploaded

Recently uploaded (20)

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Editor's Notes