GIW2013

862 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
862
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GIW2013

  1. 1. The 24th International Conference on Genome Informatics (GIW2013), Dec. 16 2013. Scalable prediction of compoundprotein interactions using minwise hashing Yasuo Tabei (PRESTO, JST) Joint work with Yoshihiro Yamanishi (Kyushu Univ.)
  2. 2. Drug target interactions • Most drugs are small molecules that interact with one or several target proteins • Analyzing functional interactions between small compounds and proteins plays an important role in genomic drug discovery
  3. 3. Genome-wide prediction of unknown compound-protein interactions •Yamanishi, Y., et al, Bioinformatics (ISMB2008), 24:i232i240, 2008. •Faulon et al., Bioinformatics, 24:225-233, 2008 •Jacob et al, Bioinformatics, 24:2149-2156, 2008 •Bleakley et al, Bioinformatics, 25:2397-2403, 2009.
  4. 4. Fingerprints (binary vectors) of compound and protein • Compounds represented by PubChem substructures • Proteins represented by PFAM domains 4,137 elements
  5. 5. Fingerprint representation of compound-protein pairs • Tensor product of each compound and protein pair – All possible products of compound substructures and PFAM domains • Observation: fingerprint representation • Large number of high dimensional fingerprints: – Number: 216 million (=35,366×6,111) – Dimension: 771,756(=881×876)
  6. 6. Existing methods • Pairwise Kernel SVM [Faulon et al.,2008] – Kernel matrix of inner products between each pair of fingerprints of compound-protein pair Large time complexity: (nc, np: the number of compounds/proteins) Large working space: • Linear SVM (Ex: LIBLINEAR(Lin et al., 2007)) – Use fingerprints of compound-protein pair as an input Large training time and working space • Challenge: Developing a scalable prediction of large-scale compound-protein interactions
  7. 7. Overview of our method • Basic idea: build compact fingerprints from fingerprints of compound-protein pairs – Leverage an idea behind MinHash (Minwise Hashing) [Broder et al., 2000] • Train linear classifiers using compact fingerprints – Smaller working space for training – Short training time – The same classification accuracy as previous methods – Interpretability of features
  8. 8. MinHash [Brodal et al., 2000] • Mapping a set into a string of length 1. Generate a permutation 2. Apply each permutation to a set 3. Compute minimum of as k-th integer 4. Iterate steps 1-3 for Ex) 1 2 3 • Conserve the Jaccard similarity in the original space
  9. 9. Saving memory by additional hashing • Drawback of MinHash: Need large bits for storing each hashed value • Reduce the hashed value to a smaller value – Apply a random hash function h: {1,..,M} → {1,…,N} (N << M) to each hashed value • Collision probability is derived as follows: • J(Si,Sj): Jaccard similarity
  10. 10. Collision probability for various Jaccard similarities J and additional hashings N
  11. 11. Procedure for building compact fingerprints
  12. 12. SVM using compact fingerprints • Use L1- and L2-regularizations to prevent overfitting • MH-L1SVM (L1-regularization) • MH-L2SVM (L2-regularization) • Use an efficient optimization algorithm named LIBLINEAR (Lin et al., 2007)
  13. 13. Other details • Linear SVM with compact fingerprints simulates non-linear SVM with pairwise kernels – Can simulates non-linear SVM with linear SVM • Can extract important features for predicting compound-protein interactions – Use reverse hashing functions • See our paper for more details
  14. 14. Experiments • 216 million compound-protein pairs that includes 300,202 interacting pairs – Unbalanced data • Use AUC score, training time and memory as evaluation measures • Compare MH-L1SVM and MH-L2SVM to L1SVM and L2SVM – L1- and L2-regularized SVM with fingerprints computed by tensor products
  15. 15. Two types of 5-fold cross validation
  16. 16. AUC score of MH-L1SVM by varying the length of hashed strings l Balanced dataset of 600,404 compound-protein pairs
  17. 17. Training time of MH-L1SVM for varying the length of string (N=216) Balanced dataset of 600,404 compound-protein pairs Maximum AUC score
  18. 18. Memory for the number of compoundprotein pairs (l=10, N=216)
  19. 19. AUC score and training time on 216 million compound-protein pairs (l=10, N=216) Measure AUC score Training time (sec) MH-L1SVM MH-L2SVM L1SVM 0.79 15,713 0.81- L2SVM - 10,054> 48hours > 48hours
  20. 20. The number of extracted features
  21. 21. Summary • Scalable prediction of compound-protein interactions using minwise hashing • Applicable to 216 million compound protein pairs • The same trends in the pair-wise cross validation experiments can be observed in the block-wise experiments (See our paper) • Dataset and C++ implementation: https://sites.google.com/site/interactminhash/
  22. 22. 6000 The number of extracted features 1000 2000 3000 4000 L1SVM 0 Number of features 5000 L1LOG 0.0 0.5 1.0 1.5 2.0 Ratio of negative samples (log scale base 10) 2.5
  23. 23. AUC score on pair-wise cross validation experiment (l=10, N=216) (Ratio of the number of non-interacting pairs to that of interacting pairs) MH-L1SVM MH-L2SVM L1SVM L2SVM Ratio Number 1 600,404 0.78 0.79 0.79 0.8 5 1,801,212 0.79 0.80 0.81 0.81 10 3,302,222 0.79 0.80 0.81 0.81 25 7,805,252 0.79 0.80 0.81 0.81 50 15,310,302 0.79 0.81 0.81 0.81 100 30,320,402 0.79 0.810.81 250 75,350,702 0.79 0.810.81
  24. 24. Training time (sec) on pair-wise cross validation experiments (l=10, N=216) (Ratio of the number of non-interacting pairs to that of interacting pairs) Ratio Number MH-L1SVM MH-L2SVM L1SVM L2SVM 1 600,404 29 28 188 387 5 1,801,212 172 38 1,655 963 10 3,302,222 448 2661 1,261 10,798 25 7,805,252 1,808 732 20,067 4,623 50 15,310,302 1,140 811 58,045 8,936 100 30,320,402 7,601 1,643> 24hours 16,608 250 75,350,702 25,060 4,631> 24hours 43,843
  25. 25. AUC score of MH-L2SVM by varying the length l of hashed strings Balanced dataset of 600,404 compound-protein pairs
  26. 26. Training time of MH-L2SVM for varying the length of string Balanced dataset of 600,404 compound-protein pairs Maximum AUC score

×