The 24th International Conference on Genome
Informatics (GIW2013), Dec. 16 2013.
Scalable prediction of compoundprotein interactions using minwise
hashing
Yasuo Tabei (PRESTO, JST)
Joint work with
Yoshihiro Yamanishi (Kyushu Univ.)
Drug target interactions
• Most drugs are small molecules that interact with
one or several target proteins
• Analyzing functional interactions between small
compounds and proteins plays an important role
in genomic drug discovery
Genome-wide prediction of unknown
compound-protein interactions
•Yamanishi, Y., et al, Bioinformatics (ISMB2008), 24:i232i240, 2008.
•Faulon et al., Bioinformatics, 24:225-233, 2008
•Jacob et al, Bioinformatics, 24:2149-2156, 2008
•Bleakley et al, Bioinformatics, 25:2397-2403, 2009.
Fingerprints (binary vectors) of
compound and protein
• Compounds represented by PubChem substructures
• Proteins represented by PFAM domains
4,137 elements
Fingerprint representation of
compound-protein pairs
• Tensor product of each compound and protein pair
– All possible products of compound substructures and
PFAM domains
• Observation: fingerprint representation
• Large number of high dimensional fingerprints:
– Number: 216 million (=35,366×6,111)
– Dimension: 771,756(=881×876)
Existing methods
• Pairwise Kernel SVM [Faulon et al.,2008]
– Kernel matrix of inner products between each pair of
fingerprints of compound-protein pair
Large time complexity:
(nc, np: the number of compounds/proteins)
Large working space:
• Linear SVM (Ex: LIBLINEAR(Lin et al., 2007))
– Use fingerprints of compound-protein pair as an input
Large training time and working space
• Challenge: Developing a scalable prediction of
large-scale compound-protein interactions
Overview of our method
• Basic idea: build compact fingerprints from
fingerprints of compound-protein pairs
– Leverage an idea behind MinHash (Minwise Hashing)
[Broder et al., 2000]
• Train linear classifiers using compact fingerprints
– Smaller working space for training
– Short training time
– The same classification accuracy as previous
methods
– Interpretability of features
MinHash [Brodal et al., 2000]
• Mapping a set into a string of length
1. Generate a permutation
2. Apply each permutation to a set
3. Compute minimum of
as k-th integer
4. Iterate steps 1-3 for
Ex)
1
2
3
• Conserve the Jaccard similarity in the original
space
Saving memory by additional hashing
• Drawback of MinHash: Need large bits for
storing each hashed value
• Reduce the hashed value to a smaller value
– Apply a random hash function h: {1,..,M} → {1,…,N}
(N << M) to each hashed value
• Collision probability is derived as follows:
• J(Si,Sj): Jaccard similarity
SVM using compact fingerprints
• Use L1- and L2-regularizations to prevent
overfitting
• MH-L1SVM (L1-regularization)
• MH-L2SVM (L2-regularization)
• Use an efficient optimization algorithm named
LIBLINEAR (Lin et al., 2007)
Other details
• Linear SVM with compact fingerprints simulates
non-linear SVM with pairwise kernels
– Can simulates non-linear SVM with linear SVM
• Can extract important features for predicting
compound-protein interactions
– Use reverse hashing functions
• See our paper for more details
Experiments
• 216 million compound-protein pairs that includes
300,202 interacting pairs
– Unbalanced data
• Use AUC score, training time and memory as
evaluation measures
• Compare MH-L1SVM and MH-L2SVM to L1SVM
and L2SVM
–
L1- and L2-regularized SVM with fingerprints
computed by tensor products
AUC score of MH-L1SVM by varying the
length of hashed strings l
Balanced dataset of 600,404 compound-protein pairs
Training time of MH-L1SVM for varying
the length of string (N=216)
Balanced dataset of 600,404 compound-protein pairs
Maximum AUC score
Memory for the number of compoundprotein pairs (l=10, N=216)
AUC score and training time on 216 million
compound-protein pairs
(l=10, N=216)
Measure
AUC score
Training
time (sec)
MH-L1SVM MH-L2SVM L1SVM
0.79
15,713
0.81-
L2SVM
-
10,054> 48hours > 48hours
Summary
• Scalable prediction of compound-protein
interactions using minwise hashing
• Applicable to 216 million compound protein pairs
• The same trends in the pair-wise cross
validation experiments can be observed in the
block-wise experiments (See our paper)
• Dataset and C++ implementation:
https://sites.google.com/site/interactminhash/
6000
The number of extracted features
1000
2000
3000
4000
L1SVM
0
Number of features
5000
L1LOG
0.0
0.5
1.0
1.5
2.0
Ratio of negative samples (log scale base 10)
2.5
AUC score on pair-wise cross validation
experiment (l=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)
MH-L1SVM MH-L2SVM L1SVM L2SVM
Ratio Number
1
600,404
0.78
0.79
0.79
0.8
5 1,801,212
0.79
0.80
0.81
0.81
10 3,302,222
0.79
0.80
0.81
0.81
25 7,805,252
0.79
0.80
0.81
0.81
50 15,310,302
0.79
0.81
0.81
0.81
100 30,320,402
0.79
0.810.81
250 75,350,702
0.79
0.810.81
Training time (sec) on pair-wise cross
validation experiments (l=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)
Ratio Number
MH-L1SVM MH-L2SVM L1SVM
L2SVM
1
600,404
29
28
188
387
5 1,801,212
172
38
1,655
963
10 3,302,222
448
2661
1,261
10,798
25 7,805,252
1,808
732
20,067
4,623
50 15,310,302
1,140
811
58,045
8,936
100 30,320,402
7,601
1,643> 24hours
16,608
250 75,350,702
25,060
4,631> 24hours
43,843
AUC score of MH-L2SVM by varying the
length l of hashed strings
Balanced dataset of 600,404 compound-protein pairs
Training time of MH-L2SVM for varying
the length of string
Balanced dataset of 600,404 compound-protein pairs
Maximum AUC score