HBase for Dealing with Large Matrices

366 views

Published on

My presentation in HBaseCon 2013 entitled "Using Apache HBase for Large Matrices". This describes an HBase backed Mahout matrix, as input/output to machine learning algorithms

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
366
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

HBase for Dealing with Large Matrices

  1. 1. HBase for Dealing withLarge Matrices
  2. 2. Who am I?Leads data team at DilisimResearcher at Anadolu University
  3. 3. Machine LearningSome big problemsClassifying huge text collectionsRecommending to millions of usersPredicting links in a social network
  4. 4. Recommender SystemsRecommenders input large sparsematricesHow would you input a millions Xmillions matrix?
  5. 5. Recommender Systemsm users3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00…3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00…4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00…3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00…0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00…4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 ……………………………………………………………………………………………………………………… .…………………………………………………………………………………………………………………… .………………………………………………………………………………………………………………… … .n itemsInput
  6. 6. Recommender SystemsState-of-the-art recommender systems learnlarge modelsOne factor vector per each user and itemOne parameter vector (on side info) pereach user and item
  7. 7. Recommender Systemsm users3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 …3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 …4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 …1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 …3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 …0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 …………………………………………………………… .………………………………………………………… .………………………………………………………… .n itemsInputUser ModelItem Modelm x kn x k0.54 0.48 0.83 0.75 0.28 …0.02 0.29 0.99 0.85 0.68 …0.05 0.53 0.60 0.98 0.19 …0.52 0.47 0.50 0.12 0.98 …0.26 0.39 0.29 0.91 0.50 …0.15 0.43 0.66 0.07 0.51 …0.52 0.36 0.01 0.87 0.53 ……………………………. .………………………….. .…………………………... .0.93 0.78 0.56 0.77 0.75 …0.21 0.44 0.99 0.01 0.00 …0.04 0.42 0.36 0.72 0.19 …0.77 0.07 0.24 0.67 0.87 …0.42 0.79 0.62 0.80 0.79 …0.42 0.32 0.26 0.50 0.85 …0.94 0.76 0.93 0.34 0.46 ……………………………. .………………………….. .…………………………... .
  8. 8. Learning ProcessWhat does a machine learning algorithmrequire to do with that matrix?
  9. 9. Machine Learning - TechniquesBatch LearningAll parameters are updated once periteration
  10. 10. Machine Learning - TechniquesBatch LearningUpdates can be calculated in parallelusing MapReduce(SequenceFile might be enough)
  11. 11. Machine Learning - TechniquesBatch LearningOutput model should provide randomaccess to rows
  12. 12. Machine Learning - TechniquesOnline LearningParameters are updated per trainingexample
  13. 13. Machine Learning - TechniquesOnline LearningEach update results in updates ina rowNeeds random access while learning
  14. 14. Machine Learning - TechniquesOnline LearningOutput model should provide randomaccess to rows
  15. 15. Deployment ProcessHow do you decide to deploy a machinelearning model in production?
  16. 16. Machine Learning - DeploymentUsual processWorksgood?Deploy inproductionExperimenton prototypeYN
  17. 17. Machine Learning - DeploymentHow would you turn your prototype intoproduction easily?Common matrix interface for in-memory and persistent versions
  18. 18. HBase Backed MatrixImplements Mahout matrixDense or sparse
  19. 19. HBase Backed MatrixRandom access to cellsRandom access to rowsIteration over rowsLazy loading while iterating
  20. 20. HBase Backed MatrixCommon interface for prototype andproductEasy to deploy (Model already persisted)
  21. 21. HBase Backed MatrixMatrix operations with existing mahout-math library
  22. 22. Logical SchemaComposite row keys:12_0:12_9:12_22000:data:value:0.41data:value:0.41data:value:0.41
  23. 23. Logical SchemaComposite row keys:Row access by scanCell access by getAtomic row update should behandled in application
  24. 24. Logical SchemaRow indices as row keys12:data:0:0.41 data:22000:0.41data:9:0.41
  25. 25. Logical SchemaRow indices as row keysAtomic updates are handledautomatically
  26. 26. Speed – Cell access/writeGET SETrow index as row keycomposite row key
  27. 27. Speed – Row access/writeGET SETrow index as row keycomposite row key
  28. 28. Codegithub.com/gcapan/mahout/tree/hbase-matrix
  29. 29. Future WorkMatrixInputFormatMight replace SequenceFile basedMapReduce inputs
  30. 30. Future Work – A little digressionRecommender SystemsCalculating score for a user-itempair is easy with HBaseMatrix
  31. 31. Future Work – A little digressionRecommender Systemstop-N recommendation?All candidate items for a user inthe user row as a nested entity(See Ian Varleys HBase SchemaDesign)
  32. 32. Thank you!

×