Framework for a suite of Co-clustering algorithms for predictive modeling on Hadoop<br />Vaijanath N. Rao<br />(vaijanath....
Presentation for<br />[CLIENT]   <br />Agenda<br /><ul><li>Introduction
Background
Some Approaches
Co-Clustering
Introduction
Related Work
Why Hadoop?
Goal
Our Framework
Conclusions and Future Work</li></li></ul><li>Presentation for<br />[CLIENT]   <br />Background<br />Modeling for Predicti...
Will a user B like this camera
Customer purchase decisions in an e-commerce setting</li></ul>And tons of other things…<br />
Presentation for<br />[CLIENT]   <br />Some Approaches<br /><ul><li>Collaborative filtering
User Based, Item Based, Model Based, Content Based, Hybrid (See [1], [2] ) etc
Latent Models
Probabilistic Latent Semantic Indexing [3,6]
 Matrix Factorization [4,7,8],
Probabilistic Discrete Latent Factor[5]
Co-clustering
Clustering along multiple axes: [9,10] etc; survey in [16]</li></li></ul><li>Presentation for<br />[CLIENT]   <br />Co-clu...
Presentation for<br />[CLIENT]   <br />Some Approaches<br /><ul><li>Bregman co-clustering  - Framework [11]
Information theoretic co-clustering [12]
Min sum squared co-clustering [13]
Scalable Framework based on Bregman framework[14]
DisCo [15]</li></li></ul><li>Presentation for<br />[CLIENT]   <br />Why Hadoop<br /><ul><li>Real world data – Huge
Large matrix to operate on(millions and millions of rows, millions of columns!)
Upcoming SlideShare
Loading in …5
×

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

2,056 views
1,935 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,056
On SlideShare
0
From Embeds
0
Number of Embeds
152
Actions
Shares
0
Downloads
54
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

  1. 1. Framework for a suite of Co-clustering algorithms for predictive modeling on Hadoop<br />Vaijanath N. Rao<br />(vaijanath.rao@teamaol.com)<br />Rohini Uppuluri<br />(rohini.uppuluri@teamaol.com)<br />
  2. 2. Presentation for<br />[CLIENT] <br />Agenda<br /><ul><li>Introduction
  3. 3. Background
  4. 4. Some Approaches
  5. 5. Co-Clustering
  6. 6. Introduction
  7. 7. Related Work
  8. 8. Why Hadoop?
  9. 9. Goal
  10. 10. Our Framework
  11. 11. Conclusions and Future Work</li></li></ul><li>Presentation for<br />[CLIENT] <br />Background<br />Modeling for Prediction<br /><ul><li>Will user A like this movie?
  12. 12. Will a user B like this camera
  13. 13. Customer purchase decisions in an e-commerce setting</li></ul>And tons of other things…<br />
  14. 14. Presentation for<br />[CLIENT] <br />Some Approaches<br /><ul><li>Collaborative filtering
  15. 15. User Based, Item Based, Model Based, Content Based, Hybrid (See [1], [2] ) etc
  16. 16. Latent Models
  17. 17. Probabilistic Latent Semantic Indexing [3,6]
  18. 18. Matrix Factorization [4,7,8],
  19. 19. Probabilistic Discrete Latent Factor[5]
  20. 20. Co-clustering
  21. 21. Clustering along multiple axes: [9,10] etc; survey in [16]</li></li></ul><li>Presentation for<br />[CLIENT] <br />Co-clustering<br />
  22. 22. Presentation for<br />[CLIENT] <br />Some Approaches<br /><ul><li>Bregman co-clustering - Framework [11]
  23. 23. Information theoretic co-clustering [12]
  24. 24. Min sum squared co-clustering [13]
  25. 25. Scalable Framework based on Bregman framework[14]
  26. 26. DisCo [15]</li></li></ul><li>Presentation for<br />[CLIENT] <br />Why Hadoop<br /><ul><li>Real world data – Huge
  27. 27. Large matrix to operate on(millions and millions of rows, millions of columns!)
  28. 28. Lot of computations</li></li></ul><li>Presentation for<br />[CLIENT] <br />Goal<br /><ul><li>Number of approaches, need for a common framework
  29. 29. To build a framework to fit in the multiple algorithms on hadoop
  30. 30. Easy framework for users to choose and use</li></li></ul><li>Presentation for<br />[CLIENT] <br />Overview<br />
  31. 31. Presentation for<br />[CLIENT] <br />Overview : Core Interfaces<br /><ul><li>Input vector (type, id, datavec, attributevec, cost, assignment)
  32. 32. Cluster ( vector, len)
  33. 33. Row Cluster
  34. 34. Column Cluster
  35. 35. Error Function (vector1, vector2)
  36. 36. Model (matrix)
  37. 37. Row Model
  38. 38. Column Model
  39. 39. Group Model
  40. 40. Objective Function (Model1, Model2)</li></li></ul><li>Presentation for<br />[CLIENT] <br />Overview : Core Utilities<br /><ul><li>Error Function:
  41. 41. Similarity
  42. 42. Cosine
  43. 43. Distance
  44. 44. Euclidian
  45. 45. Jaccard
  46. 46. Objective Function</li></li></ul><li>Presentation for<br />[CLIENT] <br />Currently we have<br /><ul><li>Graph Based Bi-clustering
  47. 47. Disco</li></li></ul><li>Presentation for<br />[CLIENT] <br />Disco Algorithm<br />
  48. 48. Presentation for<br />[CLIENT] <br />In the Framework<br />
  49. 49. Presentation for<br />[CLIENT] <br />Row Cluster Updator Job<br />
  50. 50. Presentation for<br />[CLIENT] <br />Column Cluster Updator Job<br />
  51. 51. Presentation for<br />[CLIENT] <br />Conclusions and Future Work<br /><ul><li>Implementing more algorithms
  52. 52. Easy to use examples and more documentation</li></li></ul><li>Presentation for<br />[CLIENT] <br />References<br />[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR, pages 230–237, 1999<br />[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004.<br />[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999.<br />[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007<br />[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007<br />[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007<br />[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008<br />[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008<br />[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000<br />[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625 – 628, 2005<br />
  53. 53. Presentation for<br />[CLIENT] <br />References (contd..)<br />[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--1986, 2007. <br />[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003<br />[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proc. SDM ’04, 2004<br />[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008<br />[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008<br />[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004<br />
  54. 54. Presentation for<br />[CLIENT] <br /> Thank you<br />

×