A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves

IEEE BIBE 2013

13rd IEEE International Conference on
Bioinformatics and Bioengineering,
11st November, Chania, Greece, EU

A Discrete Optimization Approach
for SVD Best Truncation Choice based
on ROC Curves
Davide Chicco, Marco Masseroli
davide.chicco@elet.polimi.it

Summary
1. The context & the problem
• Biomolecular annotations

• Prediction of biomolecular annotations
• SVD (Singular Value Decomposition)
• SVD Truncation

2. The proposed solution
• ROC Area Under the Curve comparison
• Truncation level choices
3. Evaluation
• Evaluation data set & results

4. Conclusions
“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves”

2

Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,
sometimes structured as ontologies, where every controlled
term of the vocabulary is associated with a unique
alphanumeric code
• The association of such a code with a gene or protein ID
constitutes an annotation
Biological function feature

Gene /
Protein
Annotation
gene2bff


3

Biomolecular annotations (2)
• The association of an information/feature with a gene or
protein ID constitutes an annotation

• Annotation example:
• gene: GD4
• feature: “is present in the mitochondrial membrane”
Biological function feature

Gene /
Protein
Annotation
gene2bff


4

Prediction of biomolecular annotations
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information

• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations
are extremely useful
• These lists could be generated softwares based that implement
Machine Learning algorithms


5

Annotation prediction through
Singular Value Decomposition – SVD

• Annotation matrix A  {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01

term02

term03

term04

…

termN

gene01

0

0

0

0

…

0

gene02

0

1

1

0

…

1

…

…

…

…

…

…

…

geneM

0

0

0

0

…

0


7

Annotation prediction through

• Annotation matrix A  {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01

term02

term03

term04

…

termN

gene01

0

0

0

0

…

0

gene02

0

1

1

0

…

1

…

…

…

…

…

…

…

geneM

0

0

0

0

…

0


8

Compute SVD:
A  U V T

A  U V T  U V T V TA  U V T
A U
A



Compute reduced rank approximation:
Ak  U k kkVk U kUkVkkkVkTU k kVkT
A AT    T 
A
k
Ak  U k kVkT



k
k

• An annotation prediction is performed by computing a reduced
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)


9

Singular Value Decomposition – SVD (2)
• Ak contains real valued entries related to the likelihood that
gene i shall be annotated to term j
For a certain real threshold τ:
if Ak(i,j) > τ, gene i is predicted to be annotated to term j
− The threshold τ can be chosen in order to obtain the
best predicted annotations [Khatri et al., 2005]


10

Singular Value Decomposition – SVD (3)
• It is possible to rewrite the SVD decomposition in an equivalent
form, such that the predicted annotation profile is given by:
ak,iT = aiT Vk VkT
where ak,iT is a row vector containing the predictions for gene i
• Note that Vk depends on the whole set of genes
• Indeed, the columns of Vk are a set of eigenvectors of the
global term-to-term correlation matrix T = ATA, estimated from
the whole set of available annotations


11

Evaluation of the prediction
To evaluate the prediction, we compare each A(i,j) element to its
corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

•

if A(i,j) = 1 & Ak(i,j) > τ:

AC: Annotation Confirmed
(AC <- AC+1)

•

if A(i,j) = 1 & Ak(i,j) ≤ τ:

AR: Annotation to be Reviewed
(AR <- AR+1)

•

if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed
(NAC <- NAC+1)

•

if A(i,j) = 0 & Ak(i,j) > τ:

AP: annotation predicted
(AP <- AP+1)


12

SVD truncation
• The main problem of truncated SVD: how to choose the
truncation?
• Where to truncate?

How to choose the k here?

13

New concept: Receiver Operating Characteristic
(ROC) curve
Starting from the annotation prediction evaluation factor we just
introduced
 AC: Annotation Confirmed
 AR: Annotation to be Reviewed
 NAC: No Annotation Confirmed
 AP: Annotation Predicted

Input

Output

Yes

Yes

Yes

No

No

No

No

Yes

We can design the Receiver Operating Characteristic curves for
every prediction:

 On the x, the annotation to be reviewed rate:
 On the y, the annotation predicted rate:

𝑨𝑹
𝑨𝑪+𝑨𝑹
𝑨𝑷
𝑨𝑷+𝑵𝑨𝑪

14

New concept: Receiver Operating Characteristic
(ROC) curve (2)

 On the y, the annotation confirmed rate:
 On the x, the annotation predicted rate:

𝑨𝑪
𝑨𝑪+𝑨𝑹
𝑨𝑷
𝑨𝑷+𝑵𝑨𝑪
15

SVD truncation choice
Algorithm:
1) Choose some possible truncation levels
2) Compute the Receiver Operating Characteristic for each
SVD prediction of those truncation levels
3) Compute the Area Under the Curve (AUC) of each ROC
4) Choose the truncation level of the ROC that has
maximum AUC


16

SVD truncation choice (2)
Algorithm:
maximum AUC

Quite easy!


17

SVD truncation choice (3)
Algorithm:

Quite challenging!

maximum AUC


18

Minimum AUC between all the ROCs of various
truncation levels
We cannot compute the SVD, its ROC and its AUC for every
truncation values because would be too expensive (for time
and resources).
Algorithm:
1) Since the matrix A(i,j) has m rows (genes) and n columns
(annotation terms), we take p = min(m, n)
2) Since r ≤ p is the number of non-zero singular values
along the diagonal of , the best truncation value is in the
interval [1; r]
3) newInterval = {1, r}
4) k = firstElement(newInterval)
5) step = length(newInterval) / numStep

19

truncation levels (2)
4. We make a sampling of all the N non-null singular values,
with constant sample intervals of size step (step=10% * N)

5. For every sampled singular value, we compute the SVD
and its corresponding ROC AUC for ACrate in [0%, 100%]
and APrate in [0%, 1%]


20

Given the first AUC, if the AUCs of all the three subsequent
samples decrease, we take it for the zoom next step
Local
Best
Index

zoom

This means we found a local maximum.


21

If the AUC differences of the last three singular values are
lower than gamma = 10%, , we take it for the zoom next step

Chosen
Index

zoom

This means that the AUCs do not grow up enough


22

Once we chose the index where to zoom, we re-run the
algorithm in the sub-interval

zoom

Until one of the previously described condition is satisfied
Or the maximum number of zooms (numZoom = 4) is reached

23

Example
Dataset: annotations with Gallus gallus genes and Biological
Process Gene Ontology terms


24

Results
• To evaluate the performance of our method, we used
annotations of
 terms: Biological process (BP), Cellular component (CC) and
Molecular function (MF) GO features
 organisms Bos Taurus, Danio rerio, Gallus gallus genes
• Available on July 2009 in an old version of the Gene Ontology


25

Results (2)
We then checked the, against the percentage of annotations
predicted percentage of annotations predicted with our SVD
method and our optimized truncation levelby the SVD method
and fixed truncation level (k=500) used by Draghici et al. in the
paper “A semantic analysis of the annotations of the human

genome” (2005)


26

Conclusions

Problem: SVD truncation in
the prediction of genomic
annotations context

Proposed solution: finding the
truncation level corresponding
to the maximum AUC of the
ROC curve, and it’s near to
zero


27

Conclusions (2)
•To avoid computing SVD for all the possible truncation levels
(too expensive!), we proposed an algorithm for the search of
local and global maxima, by zooming sub-intervals

•The best SVD truncation levels suggested by this algorithm for
our dataset (annotations of Bos Taurus, Danio Rerio, and Gallus
gallus genes, and GO terms) gave better results than other
truncation levels, in a reasonable time.


28

Future developments
• To obtain the best sampling, we could study the gradient
variations in the distribution of the AUC values for different
truncation levels and the histogram of the eigenvalues
• Our approach is not limited to the Gene Ontology and can be
applied to any controlled annotations


29

A Discrete Optimization Approach for SVD Best
Truncation Choice based on ROC Curves

Thanks for your attention!!!
www.DavideChicco.it

davide.chicco@elet.polimi.it


30

A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves

Recommended

Recommended

More Related Content

Similar to A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves

Similar to A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves (20)

Recently uploaded

Recently uploaded (20)

A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves