Research presentation-wd

AbstractDB & ProteinComplexDB:
A database of protein complexes
and their abstracts

Wagied Davids, PhD
Banting & Best Dept. of Medical Research,
Dept. of Medical Genetics and Microbiology,
Donnelly CCBR, 160 College Street,
University of Toronto

My Expertise
Comparative Evolutionary Genomics
Detection and Identification sequence homologues
Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP)
Horizontal Gene Transfer in Bacteria
Graph-theoretic analysis of biological and literature-derived gene networks
Analysis of Sequence-Structure of functional variants
Text-mining:
Construction of literature-derived pathways and networks involving disease
genes.
Analysis of microarray gene expression:
Differential gene expression
Gene-Drug profiles
Gene regulation network construction.
Protein Structure - Function analysis of prioritized candidate disease genes by mapping
mutation hotspots onto 3D protein structures.

Presentation Overview

AbstractDB – database of abstracts pertaining
to protein complexes
Online PubMed abstract curation tool.
ProteinComplexDB- database of extracted
protein complexes

Existing Protein Complex Databases

Only 2 high quality human-curated Protein
Complex databases available.
Both are products from MIPS - (Munich
Information Centre for Protein Sequences,
Germany)
(http://mips.gsf.de/genre/proj/yeast/‫)‏‬
MIPS-Yeast Protein Complex catalogue
CORUM- Mammalian Protein Complex
catalogue.

Importance of Network Biology, Protein
Complexes and Disease
Proteins rarely function in isolation.
Instead, proteins participate in:
protein interactions e.g. phosphorylation
form part of protein complexes e.g. mre11-rad50-
nsb1
act together forming pathways e.g. Signalling
cascades
From a System Biology perspective:
“Cancer – aberrant state of a biological network.”

Fanconi Anaeami Core Protein Complex
FA core protein complex:(FANCA, B, C, E, F, G, M and L)

Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007

Fanconi anaeami
FA severe human recessive disorder.
Defect in genes chromosomal aberrations and sensitivity DNA intra-
strand cross-links (ICLs).
13 FA proteins may constitute a pathway for dna damage repair of DNA
intra-strand cross-links.
Evolutionary conservation of FA genes from humans to worms and
zebrafish.
C. elegans Functional homologs:
brc-2 (FANCD1/BRCA2);
fcd-2 (FANCD-2);
dog-1 (FANCJ/BRIP1);
Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity,
sterility.

Project Conception
3. ....and
2. Would be Experimental
good if it good methods too!
1. Relevant for identify
Protein Complexes gene/protein
and their interactions names for me!

4. ...mmh ...
If it could search
& validate
my curations...
Q. Which search engine for ....I would not do
anything....!
PROTEIN COMPLEXES ?

Comparison criteria
Relevance:
Protein complexes and protein interactions

Named Entity Recognition (NER):
genes, proteins, cell lines, cell types, experimental
methods, discriminatory words

User-interactivity (UI)‫‏‬
Construct curations of protein complexes
Validate by searching against known protein
complex and protein interaction databases.

Q. Feasibility
Q1. How much information is contained within
unstructured text from PubMed abstracts for
extracting protein complexes?
Q2. In the absence of complete knowledge, is a
perfect solution desired or a good starting
point?
Q3. What about large-scale high-throughput
studies which are not referenced in abstracts or
text documents?

CORUM protein complex database

CORUM protein complex database
1200

1000
Count of PubMed Identifiers

800

600

400

200

0
SSS MSS LSS
Category

SSS: 2-5 protein complex MSS: 6-10 protein complex LSS: >= 11 protein complex
members members members

Small-scale studies (SSS) account for 76% (1024/1346) of protein
complexes derived from the literature-curated CORUM database.

Manual curation – Steps involved

Find all articles related to protein complexes.
Identify by eye gene/protein names.
Identify terms establishing a relationship
between proteins
Make inference on whether or not to include a
new member to an existing protein complex .

Q. Why not use PubMed Search
Engine ?
PubMed search engine's retrieval model
called pmra.
pmra is a Topic-based content similarity
model.
PubMed search engine focusses on
“relatedness” rather than relevance.
i.e the probability a user wants to examine a particular
document given known interest in another document

From Document clusters
to Protein Clusters

Corpus
of
Documents

Document
Clusters

Protein Clusters
(Protein Complexes
& their Interactions)‫‏‬

Aim
Use literature-derived information to:
Rank documents according to protein complex relevance score.
Assign confidence scores to protein interactions.
Provide an updated catalogue of protein complexes
Our initial step towards our goal is to develop a “Recommender system” for
ranking abstracts with relevance to protein complexes.

Our hypothesis
Abstracts discussing protein complexes can be distinguished from non-
relevant abstracts based on the frequency distribution of words in a hand-
curated data set on protein complexes versus a data set of background
word frequencies

Our method

Our method is based on a Naïve Bayesian classifier using
discriminatory words5.
Discriminatory words - a selected subset of high scoring words
that characterize abstracts discussing protein complexes.
The discriminatory words include both high and low frequency
words that distinguish abstracts discussing protein complexes.
Our use of a “stopword” list removes high frequency non-
informative words, e.g. “the”, “a”, “of”, “for”.

Our model
Assume Poisson word model:

Probability of observing a given word in a document:
n = Count of word occurrences
N = Total number of words in a set of training abstracts
f = Dictionary word frequency

Using the 500 most significant words, we constructed
a discriminatory word list of 80 words for scoring abstracts.

Does the abstract discuss protein
complexes or Not?

Calculate log-likelihood score for individual abstract by summing over
all discriminatory words.

FN,i : dictionary frequency of discriminatory word

FI,i : frequency of discriminatory word in training abstract

Our system

Our system consists of the following components:
A set of PubMed abstracts from 1965 - 2008 retrieved with the
query “protein complex”;
A Bayesian probabilistic method for calculating an article's
relevance in discussing protein complexes, using word occurrences
found in the training set;
A method for extracting gene/protein names using a biological
named entity recognizer – ABNER6;
A Wiki resource to enable scientists to evaluate and revise the data.

Query terms used for construction of protein
complex abstract data sets

Query Term No. of abstract
retrieved

“protein complex” 499918

“cell cycle” AND “protein complex” 19360

“chromatin remodeling” AND “protein 238
complex”

“DNA repair” AND “protein complex” 325

(including abstracts published 1965 - 2008)‫‏‬

Validation of Bayesian classification of PubMed abstracts
using hand-curated data sets

Data set Positives Negatives Accuracy Precision Recall F-measure

Apoptosis 138 94 0.89 0.93 0.89 0.91

Cell cycle 600 702 0.96 0.97 0.94 0.96

Chromatin
remodelling
155 81 0.83 0.93 0.84 0.88

DNA repair 203 122 0.9 0.96 0.88 0.92

Accuracy= (TP+TN)/(TP+FP+FN+TN)
Precision= TP/(TP+FP)
F −measure= 2∗Precision∗Recall / Precision+Recall
Recall= TP/(TP+FN)
F-measure= 2 * Precision * Recall/ (Precision + Recall)

Performance Evaluation

i. Apoptosis ii. Cell cycle

iii. Chromatin remodeling iv. DNA repair

A text-based Protein Assay
 Named Entity Recognition for identifying gene
and protein names
 A challenging task due to the irregularities and
ambiguities in gene and protein nomenclature.
 Synonyms and versioning of dbxref.

Online Annotation Tool for PubMed abstract

Biological entities recognised:
Protein
DNA
RNA
CELL LINE
CELL TYPE

PMID:10871607
SentenceId Cscore ABNER GeneTagger KEX Sentence
1 1.5 0 0.12 0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination.
2 0.62 0.06 0.06 0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells.
3 -0.31 0.05 0.1 0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange.
4 -1.11 0 0.12 0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells.
5 1.25 0 0.14 0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the pre
6 2.01 0.06 0.17 0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2.
7 3.47 0.13 0.13 0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct.
8 0.66 0.06 0.06 0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRC

4 0.25

3
0.2

2
0.15
Cscore
Cscore

1 ABNER
GeneTagger
0.1 KEX
0

0.05
-1

-2 0
1 2 3 4 5 6 7 8
Sentence Id

Syntax Parsing - semantic relations among words

Example Scenario
Q. What are the members of the FEAR complex ?

1. Keyword: FEAR 2. List of Abstract Relevant to
FEAR protein complex

FEAR complex
Similar Article cdc14,esp1,cdc5
CONDESIN explicit sentence
smc2 -8 and smc4 -1

FEAR complex
cdc14,esp1,cdc5, spo12,fob1
explicit sentences

Validation
ProteinCompleDb

Conclusion
We have undertaken an initial step towards developing:
a “Recommender system” for ranking abstracts with relevance
to protein complexes.
a Curation Tool for extracting Protein Complexes from
literature
We are in the process of:
Constructing a database of Protein Complexes, and
Linking Protein Complexes to Pathways and Disease
phenotypes.

Ultimate aim of understanding biological mechanisms behind
complex Disease phenotypes

Acknowledgements
Zhang Zhang and lab members:
• Ivan Borozan
• Dong (Derek) Dong
• Matthew Fagnani
• Yunchen Gong
• Sumedha Gunewardena
• Gabe Musso
• Renqiang Min
• Sanaa Mahmood
• Jingjing Li
• Yu Liu
• Apostolos Lydakis
• Lee Zamparo

Research presentation-wd

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Research presentation-wd

Similar to Research presentation-wd (20)

Research presentation-wd