SlideShare a Scribd company logo
AbstractDB & ProteinComplexDB:
 A database of protein complexes
        and their abstracts

                               Wagied Davids, PhD
  Banting & Best Dept. of Medical Research,
  Dept. of Medical Genetics and Microbiology,
  Donnelly CCBR, 160 College Street,
  University of Toronto
My Expertise
 Comparative Evolutionary Genomics
        Detection and Identification sequence homologues
        Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP)
        Horizontal Gene Transfer in Bacteria
        Graph-theoretic analysis of biological and literature-derived gene networks
        Analysis of Sequence-Structure of functional variants
 Text-mining:
        Construction of literature-derived pathways and networks involving disease
        genes.
 Analysis of microarray gene expression:
        Differential gene expression
        Gene-Drug profiles
        Gene regulation network construction.
 Protein Structure - Function analysis of prioritized candidate disease genes by mapping
mutation hotspots onto 3D protein structures.
Presentation Overview

AbstractDB – database of abstracts pertaining
to protein complexes
Online PubMed abstract curation tool.
ProteinComplexDB- database of extracted
protein complexes
Existing Protein Complex Databases

Only 2 high quality human-curated Protein
Complex databases available.
Both are products from MIPS - (Munich
Information Centre for Protein Sequences,
Germany)
     (http://mips.gsf.de/genre/proj/yeast/‫)‏‬
MIPS-Yeast Protein Complex catalogue
CORUM- Mammalian Protein Complex
catalogue.
Importance of Network Biology, Protein
       Complexes and Disease
 Proteins rarely function in isolation.
 Instead, proteins participate in:
  protein interactions e.g. phosphorylation
  form part of protein complexes e.g. mre11-rad50-
  nsb1
  act together forming pathways e.g. Signalling
  cascades
 From a System Biology perspective:
 “Cancer – aberrant state of a biological network.”
Fanconi Anaeami Core Protein Complex
        FA core protein complex:(FANCA, B, C, E, F, G, M and L)




Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007
Fanconi anaeami
 FA severe human recessive disorder.
 Defect in genes chromosomal aberrations and sensitivity DNA intra-
strand cross-links (ICLs).
 13 FA proteins may constitute a pathway for dna damage repair of DNA
intra-strand cross-links.
 Evolutionary conservation of FA genes from humans to worms and
zebrafish.
 C. elegans Functional homologs:
   brc-2 (FANCD1/BRCA2);
   fcd-2 (FANCD-2);
   dog-1 (FANCJ/BRIP1);
 Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity,
sterility.
Project Conception
                                             3. ....and
                          2. Would be      Experimental
                         good if it good   methods too!
   1. Relevant for           identify
 Protein Complexes        gene/protein
and their interactions   names for me!

                                                  4. ...mmh ...
                                              If it could search
                                                   & validate
                                                my curations...
   Q. Which search engine for                 ....I would not do
                                                  anything....!
    PROTEIN COMPLEXES ?
Comparison criteria
Relevance:
 Protein complexes and protein interactions


Named Entity Recognition (NER):
 genes, proteins, cell lines, cell types, experimental
 methods, discriminatory words


User-interactivity (UI)‫‏‬
 Construct curations of protein complexes
 Validate by searching against known protein
 complex and protein interaction databases.
Q. Feasibility
Q1. How much information is contained within
unstructured text from PubMed abstracts for
extracting protein complexes?
Q2. In the absence of complete knowledge, is a
perfect solution desired or a good starting
point?
Q3. What about large-scale high-throughput
studies which are not referenced in abstracts or
text documents?
CORUM protein complex database
CORUM protein complex database
                              1200




                              1000
Count of PubMed Identifiers




                               800




                               600




                               400




                               200




                                0
                                            SSS                                            MSS                                LSS
                                                                                      Category

                                         SSS: 2-5 protein complex   MSS: 6-10 protein complex    LSS: >= 11 protein complex
                                         members                    members                      members




                                Small-scale studies (SSS) account for 76% (1024/1346) of protein
                                complexes derived from the literature-curated CORUM database.
Manual curation – Steps involved

Find all articles related to protein complexes.
Identify by eye gene/protein names.
Identify terms establishing a relationship
between proteins
Make inference on whether or not to include a
new member to an existing protein complex .
Search using NCBI
     PubMed
Q. Why not use PubMed Search
           Engine ?
PubMed search engine's retrieval model
called pmra.
pmra is a Topic-based content similarity
model.
PubMed search engine focusses on
“relatedness” rather than relevance.
i.e the probability a user wants to examine a particular
   document given known interest in another document
From Document clusters
   to Protein Clusters
Corpus
     of
 Documents

    Document
     Clusters




Protein Clusters
(Protein Complexes
& their Interactions)‫‏‬
AbstractDb
User Interface - AbstractDb
Aim
Use literature-derived information to:
 Rank documents according to protein complex relevance score.
 Assign confidence scores to protein interactions.
 Provide an updated catalogue of protein complexes
Our initial step towards our goal is to develop a “Recommender system” for
ranking abstracts with relevance to protein complexes.




                   Our hypothesis
Abstracts discussing protein complexes can be distinguished from non-
relevant abstracts based on the frequency distribution of words in a hand-
curated data set on protein complexes versus a data set of background
word frequencies
Our method

Our method is based on a Naïve Bayesian classifier using
discriminatory words5.
Discriminatory words - a selected subset of high scoring words
that characterize abstracts discussing protein complexes.
The discriminatory words include both high and low frequency
words that distinguish abstracts discussing protein complexes.
Our use of a “stopword” list removes high frequency non-
informative words, e.g. “the”, “a”, “of”, “for”.
Our model
Assume Poisson word model:



Probability of observing a given word in a document:
n = Count of word occurrences
N = Total number of words in a set of training abstracts
f = Dictionary word frequency


          Using the 500 most significant words, we constructed
        a discriminatory word list of 80 words for scoring abstracts.
Does the abstract discuss protein
         complexes or Not?




Calculate log-likelihood score for individual abstract by summing over
all discriminatory words.

FN,i : dictionary frequency of discriminatory word

FI,i : frequency of discriminatory word in training abstract
Our system

Our system consists of the following components:
 A set of PubMed abstracts from 1965 - 2008 retrieved with the
 query “protein complex”;
 A Bayesian probabilistic method for calculating an article's
 relevance in discussing protein complexes, using word occurrences
 found in the training set;
 A method for extracting gene/protein names using a biological
 named entity recognizer – ABNER6;
 A Wiki resource to enable scientists to evaluate and revise the data.
Query terms used for construction of protein
        complex abstract data sets

                 Query Term                     No. of abstract
                                                  retrieved

  “protein complex”                                 499918

  “cell cycle” AND “protein complex”                 19360

  “chromatin remodeling” AND “protein                 238
  complex”

  “DNA repair” AND “protein complex”                  325

                        (including abstracts published 1965 - 2008)‫‏‬
Validation of Bayesian classification of PubMed abstracts
               using hand-curated data sets

   Data set    Positives   Negatives   Accuracy   Precision   Recall   F-measure



Apoptosis        138          94         0.89       0.93      0.89       0.91


Cell cycle       600         702         0.96       0.97      0.94       0.96


Chromatin
remodelling
                 155          81         0.83       0.93      0.84       0.88


DNA repair       203         122         0.9        0.96      0.88       0.92




  Accuracy= (TP+TN)/(TP+FP+FN+TN)
  Precision= TP/(TP+FP)
F −measure= 2∗Precision∗Recall / Precision+Recall
  Recall= TP/(TP+FN)
  F-measure= 2 * Precision * Recall/ (Precision + Recall)
Performance Evaluation




    i. Apoptosis                 ii. Cell cycle




iii. Chromatin remodeling    iv. DNA repair
A text-based Protein Assay
   Named Entity Recognition for identifying gene
    and protein names
   A challenging task due to the irregularities and
    ambiguities in gene and protein nomenclature.
   Synonyms and versioning of dbxref.
Online Annotation Tool for PubMed abstract


Biological entities recognised:
 Protein
 DNA
 RNA
 CELL LINE
 CELL TYPE
PMID:10871607
 SentenceId Cscore    ABNER GeneTagger KEX Sentence
      1         1.5      0     0.12    0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination.
      2        0.62    0.06    0.06    0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells.
      3       -0.31    0.05     0.1    0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange.
      4       -1.11      0     0.12    0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells.
      5        1.25      0     0.14    0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the pre
      6        2.01    0.06    0.17    0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2.
      7        3.47    0.13    0.13     0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct.
      8        0.66    0.06    0.06    0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRC




          4                                                                                                                                                  0.25



          3
                                                                                                                                                             0.2


          2
                                                                                                                                                             0.15
                                                                                                                                                                         Cscore
 Cscore




          1                                                                                                                                                              ABNER
                                                                                                                                                                         GeneTagger
                                                                                                                                                             0.1         KEX
          0


                                                                                                                                                             0.05
          -1



          -2                                                                                                                                                 0
                1              2                  3                  4                  5                  6                  7                  8
                                                                         Sentence Id
Syntax Parsing - semantic relations among words
Example Scenario
   Q. What are the members of the FEAR complex ?

 1. Keyword: FEAR     2. List of Abstract Relevant to
                           FEAR protein complex




                          FEAR complex
  Similar Article        cdc14,esp1,cdc5
   CONDESIN              explicit sentence
smc2 -8 and smc4 -1

                                                 FEAR complex
                                         cdc14,esp1,cdc5, spo12,fob1
                                               explicit sentences

               Validation
           ProteinCompleDb
Conclusion
We have undertaken an initial step towards developing:
 a “Recommender system” for ranking abstracts with relevance
 to protein complexes.
 a Curation Tool for extracting Protein Complexes from
 literature
We are in the process of:
 Constructing a database of Protein Complexes, and
 Linking Protein Complexes to Pathways and Disease
 phenotypes.


Ultimate aim of understanding biological mechanisms behind
complex Disease phenotypes
Acknowledgements
Zhang Zhang and lab members:
•   Ivan Borozan
•   Dong (Derek) Dong
•   Matthew Fagnani
•   Yunchen Gong
•   Sumedha Gunewardena
•   Gabe Musso
•   Renqiang Min
•   Sanaa Mahmood
•   Jingjing Li
•   Yu Liu
•   Apostolos Lydakis
•   Lee Zamparo

More Related Content

What's hot

Linking Linked Data CSHALS2013
Linking Linked Data CSHALS2013Linking Linked Data CSHALS2013
Linking Linked Data CSHALS2013
Nadia Anwar
 
Biological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical ModelsBiological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical Models
CTBE - Brazilian Bioethanol Sci&Tech Laboratory
 
Bioinformatics published article
Bioinformatics published articleBioinformatics published article
Bioinformatics published article
Pulak Kumar
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
Scott Dawson
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
Karan Veer Singh
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
Prof. Wim Van Criekinge
 
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Abdelrahman Hosny
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
Sucheta Tripathy
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
IJERD Editor
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Belda
eugenibc
 
2010 11-22 bcmb02-print_grayscale
2010 11-22 bcmb02-print_grayscale2010 11-22 bcmb02-print_grayscale
2010 11-22 bcmb02-print_grayscale
MateenMuzafar
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
Naima Tahsin
 
2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge
Prof. Wim Van Criekinge
 
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Lorenz Lo Sauer
 
B4OS-2012
B4OS-2012B4OS-2012
Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
Yifan Peng
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
cscpconf
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Biological databases
Biological databasesBiological databases
Biological databases
Prasanthperceptron
 
In silico discovery of dna methyltransferase inhibitors (1)
In silico discovery of dna methyltransferase inhibitors (1)In silico discovery of dna methyltransferase inhibitors (1)
In silico discovery of dna methyltransferase inhibitors (1)
angelicagonzalez10
 

What's hot (20)

Linking Linked Data CSHALS2013
Linking Linked Data CSHALS2013Linking Linked Data CSHALS2013
Linking Linked Data CSHALS2013
 
Biological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical ModelsBiological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical Models
 
Bioinformatics published article
Bioinformatics published articleBioinformatics published article
Bioinformatics published article
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Belda
 
2010 11-22 bcmb02-print_grayscale
2010 11-22 bcmb02-print_grayscale2010 11-22 bcmb02-print_grayscale
2010 11-22 bcmb02-print_grayscale
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge2015 bioinformatics wim_vancriekinge
2015 bioinformatics wim_vancriekinge
 
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
Protein Interaction Reporters : Protein-Protein Interactions (PPI) elucidated...
 
B4OS-2012
B4OS-2012B4OS-2012
B4OS-2012
 
Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Biological databases
Biological databasesBiological databases
Biological databases
 
In silico discovery of dna methyltransferase inhibitors (1)
In silico discovery of dna methyltransferase inhibitors (1)In silico discovery of dna methyltransferase inhibitors (1)
In silico discovery of dna methyltransferase inhibitors (1)
 

Similar to Research presentation-wd

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
Sardar Harpreet Kalsi
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
Protein databases
Protein databasesProtein databases
Protein databases
bansalaman80
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
contactsoorya
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
KAUSHAL SAHU
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
Lars Juhl Jensen
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
DrGopaSarma
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
Hafiz Muhammad Zeeshan Raza
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Ncbi
NcbiNcbi
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
Jackie Wirz, PhD
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
Vinitha Nair
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
Vinitha Nair
 
proteomics
 proteomics proteomics
proteomics
vruddhi desai
 
Session i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmcSession i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmc
USD Bioinformatics
 
NIH-mar2604.rm.ppt
NIH-mar2604.rm.pptNIH-mar2604.rm.ppt
NIH-mar2604.rm.ppt
Chandrakanth R
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 

Similar to Research presentation-wd (20)

Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
Protein databases
Protein databasesProtein databases
Protein databases
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Ncbi
NcbiNcbi
Ncbi
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
proteomics
 proteomics proteomics
proteomics
 
Session i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmcSession i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmc
 
NIH-mar2604.rm.ppt
NIH-mar2604.rm.pptNIH-mar2604.rm.ppt
NIH-mar2604.rm.ppt
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 

Research presentation-wd

  • 1. AbstractDB & ProteinComplexDB: A database of protein complexes and their abstracts Wagied Davids, PhD Banting & Best Dept. of Medical Research, Dept. of Medical Genetics and Microbiology, Donnelly CCBR, 160 College Street, University of Toronto
  • 2. My Expertise Comparative Evolutionary Genomics Detection and Identification sequence homologues Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP) Horizontal Gene Transfer in Bacteria Graph-theoretic analysis of biological and literature-derived gene networks Analysis of Sequence-Structure of functional variants Text-mining: Construction of literature-derived pathways and networks involving disease genes. Analysis of microarray gene expression: Differential gene expression Gene-Drug profiles Gene regulation network construction. Protein Structure - Function analysis of prioritized candidate disease genes by mapping mutation hotspots onto 3D protein structures.
  • 3. Presentation Overview AbstractDB – database of abstracts pertaining to protein complexes Online PubMed abstract curation tool. ProteinComplexDB- database of extracted protein complexes
  • 4. Existing Protein Complex Databases Only 2 high quality human-curated Protein Complex databases available. Both are products from MIPS - (Munich Information Centre for Protein Sequences, Germany) (http://mips.gsf.de/genre/proj/yeast/‫)‏‬ MIPS-Yeast Protein Complex catalogue CORUM- Mammalian Protein Complex catalogue.
  • 5. Importance of Network Biology, Protein Complexes and Disease Proteins rarely function in isolation. Instead, proteins participate in: protein interactions e.g. phosphorylation form part of protein complexes e.g. mre11-rad50- nsb1 act together forming pathways e.g. Signalling cascades From a System Biology perspective: “Cancer – aberrant state of a biological network.”
  • 6. Fanconi Anaeami Core Protein Complex FA core protein complex:(FANCA, B, C, E, F, G, M and L) Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007
  • 7. Fanconi anaeami FA severe human recessive disorder. Defect in genes chromosomal aberrations and sensitivity DNA intra- strand cross-links (ICLs). 13 FA proteins may constitute a pathway for dna damage repair of DNA intra-strand cross-links. Evolutionary conservation of FA genes from humans to worms and zebrafish. C. elegans Functional homologs: brc-2 (FANCD1/BRCA2); fcd-2 (FANCD-2); dog-1 (FANCJ/BRIP1); Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity, sterility.
  • 8. Project Conception 3. ....and 2. Would be Experimental good if it good methods too! 1. Relevant for identify Protein Complexes gene/protein and their interactions names for me! 4. ...mmh ... If it could search & validate my curations... Q. Which search engine for ....I would not do anything....! PROTEIN COMPLEXES ?
  • 9. Comparison criteria Relevance: Protein complexes and protein interactions Named Entity Recognition (NER): genes, proteins, cell lines, cell types, experimental methods, discriminatory words User-interactivity (UI)‫‏‬ Construct curations of protein complexes Validate by searching against known protein complex and protein interaction databases.
  • 10. Q. Feasibility Q1. How much information is contained within unstructured text from PubMed abstracts for extracting protein complexes? Q2. In the absence of complete knowledge, is a perfect solution desired or a good starting point? Q3. What about large-scale high-throughput studies which are not referenced in abstracts or text documents?
  • 12. CORUM protein complex database 1200 1000 Count of PubMed Identifiers 800 600 400 200 0 SSS MSS LSS Category SSS: 2-5 protein complex MSS: 6-10 protein complex LSS: >= 11 protein complex members members members Small-scale studies (SSS) account for 76% (1024/1346) of protein complexes derived from the literature-curated CORUM database.
  • 13. Manual curation – Steps involved Find all articles related to protein complexes. Identify by eye gene/protein names. Identify terms establishing a relationship between proteins Make inference on whether or not to include a new member to an existing protein complex .
  • 15.
  • 16. Q. Why not use PubMed Search Engine ? PubMed search engine's retrieval model called pmra. pmra is a Topic-based content similarity model. PubMed search engine focusses on “relatedness” rather than relevance. i.e the probability a user wants to examine a particular document given known interest in another document
  • 17. From Document clusters to Protein Clusters
  • 18. Corpus of Documents Document Clusters Protein Clusters (Protein Complexes & their Interactions)‫‏‬
  • 20. User Interface - AbstractDb
  • 21.
  • 22.
  • 23. Aim Use literature-derived information to: Rank documents according to protein complex relevance score. Assign confidence scores to protein interactions. Provide an updated catalogue of protein complexes Our initial step towards our goal is to develop a “Recommender system” for ranking abstracts with relevance to protein complexes. Our hypothesis Abstracts discussing protein complexes can be distinguished from non- relevant abstracts based on the frequency distribution of words in a hand- curated data set on protein complexes versus a data set of background word frequencies
  • 24. Our method Our method is based on a Naïve Bayesian classifier using discriminatory words5. Discriminatory words - a selected subset of high scoring words that characterize abstracts discussing protein complexes. The discriminatory words include both high and low frequency words that distinguish abstracts discussing protein complexes. Our use of a “stopword” list removes high frequency non- informative words, e.g. “the”, “a”, “of”, “for”.
  • 25. Our model Assume Poisson word model: Probability of observing a given word in a document: n = Count of word occurrences N = Total number of words in a set of training abstracts f = Dictionary word frequency Using the 500 most significant words, we constructed a discriminatory word list of 80 words for scoring abstracts.
  • 26. Does the abstract discuss protein complexes or Not? Calculate log-likelihood score for individual abstract by summing over all discriminatory words. FN,i : dictionary frequency of discriminatory word FI,i : frequency of discriminatory word in training abstract
  • 27. Our system Our system consists of the following components: A set of PubMed abstracts from 1965 - 2008 retrieved with the query “protein complex”; A Bayesian probabilistic method for calculating an article's relevance in discussing protein complexes, using word occurrences found in the training set; A method for extracting gene/protein names using a biological named entity recognizer – ABNER6; A Wiki resource to enable scientists to evaluate and revise the data.
  • 28. Query terms used for construction of protein complex abstract data sets Query Term No. of abstract retrieved “protein complex” 499918 “cell cycle” AND “protein complex” 19360 “chromatin remodeling” AND “protein 238 complex” “DNA repair” AND “protein complex” 325 (including abstracts published 1965 - 2008)‫‏‬
  • 29. Validation of Bayesian classification of PubMed abstracts using hand-curated data sets Data set Positives Negatives Accuracy Precision Recall F-measure Apoptosis 138 94 0.89 0.93 0.89 0.91 Cell cycle 600 702 0.96 0.97 0.94 0.96 Chromatin remodelling 155 81 0.83 0.93 0.84 0.88 DNA repair 203 122 0.9 0.96 0.88 0.92 Accuracy= (TP+TN)/(TP+FP+FN+TN) Precision= TP/(TP+FP) F −measure= 2∗Precision∗Recall / Precision+Recall Recall= TP/(TP+FN) F-measure= 2 * Precision * Recall/ (Precision + Recall)
  • 30. Performance Evaluation i. Apoptosis ii. Cell cycle iii. Chromatin remodeling iv. DNA repair
  • 31. A text-based Protein Assay  Named Entity Recognition for identifying gene and protein names  A challenging task due to the irregularities and ambiguities in gene and protein nomenclature.  Synonyms and versioning of dbxref.
  • 32. Online Annotation Tool for PubMed abstract Biological entities recognised: Protein DNA RNA CELL LINE CELL TYPE
  • 33. PMID:10871607 SentenceId Cscore ABNER GeneTagger KEX Sentence 1 1.5 0 0.12 0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination. 2 0.62 0.06 0.06 0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells. 3 -0.31 0.05 0.1 0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange. 4 -1.11 0 0.12 0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells. 5 1.25 0 0.14 0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the pre 6 2.01 0.06 0.17 0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2. 7 3.47 0.13 0.13 0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct. 8 0.66 0.06 0.06 0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRC 4 0.25 3 0.2 2 0.15 Cscore Cscore 1 ABNER GeneTagger 0.1 KEX 0 0.05 -1 -2 0 1 2 3 4 5 6 7 8 Sentence Id
  • 34. Syntax Parsing - semantic relations among words
  • 35. Example Scenario Q. What are the members of the FEAR complex ? 1. Keyword: FEAR 2. List of Abstract Relevant to FEAR protein complex FEAR complex Similar Article cdc14,esp1,cdc5 CONDESIN explicit sentence smc2 -8 and smc4 -1 FEAR complex cdc14,esp1,cdc5, spo12,fob1 explicit sentences Validation ProteinCompleDb
  • 36. Conclusion We have undertaken an initial step towards developing: a “Recommender system” for ranking abstracts with relevance to protein complexes. a Curation Tool for extracting Protein Complexes from literature We are in the process of: Constructing a database of Protein Complexes, and Linking Protein Complexes to Pathways and Disease phenotypes. Ultimate aim of understanding biological mechanisms behind complex Disease phenotypes
  • 37. Acknowledgements Zhang Zhang and lab members: • Ivan Borozan • Dong (Derek) Dong • Matthew Fagnani • Yunchen Gong • Sumedha Gunewardena • Gabe Musso • Renqiang Min • Sanaa Mahmood • Jingjing Li • Yu Liu • Apostolos Lydakis • Lee Zamparo