AbstractDB & ProteinComplexDB: A database of protein complexes and their abstracts Wagied Davids, PhD Banting & Best Dept. of Medical Research, Dept. of Medical Genetics and Microbiology, Donnelly CCBR, 160 College Street, University of Toronto
My Expertise Comparative Evolutionary Genomics Detection and Identification sequence homologues Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP) Horizontal Gene Transfer in Bacteria Graph-theoretic analysis of biological and literature-derived gene networks Analysis of Sequence-Structure of functional variants Text-mining: Construction of literature-derived pathways and networks involving disease genes. Analysis of microarray gene expression: Differential gene expression Gene-Drug profiles Gene regulation network construction. Protein Structure - Function analysis of prioritized candidate disease genes by mappingmutation hotspots onto 3D protein structures.
Presentation OverviewAbstractDB – database of abstracts pertainingto protein complexesOnline PubMed abstract curation tool.ProteinComplexDB- database of extractedprotein complexes
Existing Protein Complex DatabasesOnly 2 high quality human-curated ProteinComplex databases available.Both are products from MIPS - (MunichInformation Centre for Protein Sequences,Germany) (http://mips.gsf.de/genre/proj/yeast/)MIPS-Yeast Protein Complex catalogueCORUM- Mammalian Protein Complexcatalogue.
Importance of Network Biology, Protein Complexes and Disease Proteins rarely function in isolation. Instead, proteins participate in: protein interactions e.g. phosphorylation form part of protein complexes e.g. mre11-rad50- nsb1 act together forming pathways e.g. Signalling cascades From a System Biology perspective: “Cancer – aberrant state of a biological network.”
Fanconi Anaeami Core Protein Complex FA core protein complex:(FANCA, B, C, E, F, G, M and L)Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007
Fanconi anaeami FA severe human recessive disorder. Defect in genes chromosomal aberrations and sensitivity DNA intra-strand cross-links (ICLs). 13 FA proteins may constitute a pathway for dna damage repair of DNAintra-strand cross-links. Evolutionary conservation of FA genes from humans to worms andzebrafish. C. elegans Functional homologs: brc-2 (FANCD1/BRCA2); fcd-2 (FANCD-2); dog-1 (FANCJ/BRIP1); Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity,sterility.
Project Conception 3. ....and 2. Would be Experimental good if it good methods too! 1. Relevant for identify Protein Complexes gene/proteinand their interactions names for me! 4. ...mmh ... If it could search & validate my curations... Q. Which search engine for ....I would not do anything....! PROTEIN COMPLEXES ?
Comparison criteriaRelevance: Protein complexes and protein interactionsNamed Entity Recognition (NER): genes, proteins, cell lines, cell types, experimental methods, discriminatory wordsUser-interactivity (UI) Construct curations of protein complexes Validate by searching against known protein complex and protein interaction databases.
Q. FeasibilityQ1. How much information is contained withinunstructured text from PubMed abstracts forextracting protein complexes?Q2. In the absence of complete knowledge, is aperfect solution desired or a good startingpoint?Q3. What about large-scale high-throughputstudies which are not referenced in abstracts ortext documents?
CORUM protein complex database 1200 1000Count of PubMed Identifiers 800 600 400 200 0 SSS MSS LSS Category SSS: 2-5 protein complex MSS: 6-10 protein complex LSS: >= 11 protein complex members members members Small-scale studies (SSS) account for 76% (1024/1346) of protein complexes derived from the literature-curated CORUM database.
Manual curation – Steps involvedFind all articles related to protein complexes.Identify by eye gene/protein names.Identify terms establishing a relationshipbetween proteinsMake inference on whether or not to include anew member to an existing protein complex .
Q. Why not use PubMed Search Engine ?PubMed search engines retrieval modelcalled pmra.pmra is a Topic-based content similaritymodel.PubMed search engine focusses on“relatedness” rather than relevance.i.e the probability a user wants to examine a particular document given known interest in another document
AimUse literature-derived information to: Rank documents according to protein complex relevance score. Assign confidence scores to protein interactions. Provide an updated catalogue of protein complexesOur initial step towards our goal is to develop a “Recommender system” forranking abstracts with relevance to protein complexes. Our hypothesisAbstracts discussing protein complexes can be distinguished from non-relevant abstracts based on the frequency distribution of words in a hand-curated data set on protein complexes versus a data set of backgroundword frequencies
Our methodOur method is based on a Naïve Bayesian classifier usingdiscriminatory words5.Discriminatory words - a selected subset of high scoring wordsthat characterize abstracts discussing protein complexes.The discriminatory words include both high and low frequencywords that distinguish abstracts discussing protein complexes.Our use of a “stopword” list removes high frequency non-informative words, e.g. “the”, “a”, “of”, “for”.
Our modelAssume Poisson word model:Probability of observing a given word in a document:n = Count of word occurrencesN = Total number of words in a set of training abstractsf = Dictionary word frequency Using the 500 most significant words, we constructed a discriminatory word list of 80 words for scoring abstracts.
Does the abstract discuss protein complexes or Not?Calculate log-likelihood score for individual abstract by summing overall discriminatory words.FN,i : dictionary frequency of discriminatory wordFI,i : frequency of discriminatory word in training abstract
Our systemOur system consists of the following components: A set of PubMed abstracts from 1965 - 2008 retrieved with the query “protein complex”; A Bayesian probabilistic method for calculating an articles relevance in discussing protein complexes, using word occurrences found in the training set; A method for extracting gene/protein names using a biological named entity recognizer – ABNER6; A Wiki resource to enable scientists to evaluate and revise the data.
Query terms used for construction of protein complex abstract data sets Query Term No. of abstract retrieved “protein complex” 499918 “cell cycle” AND “protein complex” 19360 “chromatin remodeling” AND “protein 238 complex” “DNA repair” AND “protein complex” 325 (including abstracts published 1965 - 2008)
Performance Evaluation i. Apoptosis ii. Cell cycleiii. Chromatin remodeling iv. DNA repair
A text-based Protein Assay Named Entity Recognition for identifying gene and protein names A challenging task due to the irregularities and ambiguities in gene and protein nomenclature. Synonyms and versioning of dbxref.
Online Annotation Tool for PubMed abstractBiological entities recognised: Protein DNA RNA CELL LINE CELL TYPE
PMID:10871607 SentenceId Cscore ABNER GeneTagger KEX Sentence 1 1.5 0 0.12 0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination. 2 0.62 0.06 0.06 0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells. 3 -0.31 0.05 0.1 0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange. 4 -1.11 0 0.12 0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells. 5 1.25 0 0.14 0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the pre 6 2.01 0.06 0.17 0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2. 7 3.47 0.13 0.13 0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct. 8 0.66 0.06 0.06 0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRC 4 0.25 3 0.2 2 0.15 Cscore Cscore 1 ABNER GeneTagger 0.1 KEX 0 0.05 -1 -2 0 1 2 3 4 5 6 7 8 Sentence Id
Syntax Parsing - semantic relations among words
Example Scenario Q. What are the members of the FEAR complex ? 1. Keyword: FEAR 2. List of Abstract Relevant to FEAR protein complex FEAR complex Similar Article cdc14,esp1,cdc5 CONDESIN explicit sentencesmc2 -8 and smc4 -1 FEAR complex cdc14,esp1,cdc5, spo12,fob1 explicit sentences Validation ProteinCompleDb
ConclusionWe have undertaken an initial step towards developing: a “Recommender system” for ranking abstracts with relevance to protein complexes. a Curation Tool for extracting Protein Complexes from literatureWe are in the process of: Constructing a database of Protein Complexes, and Linking Protein Complexes to Pathways and Disease phenotypes.Ultimate aim of understanding biological mechanisms behindcomplex Disease phenotypes
AcknowledgementsZhang Zhang and lab members:• Ivan Borozan• Dong (Derek) Dong• Matthew Fagnani• Yunchen Gong• Sumedha Gunewardena• Gabe Musso• Renqiang Min• Sanaa Mahmood• Jingjing Li• Yu Liu• Apostolos Lydakis• Lee Zamparo