Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning with Apache Mahout

12,974 views

Published on

Published in: Technology, Education

Machine Learning with Apache Mahout

  1. 1. Machine Learning with Apache Mahout
  2. 2. http://twitter.com/danielglauser http://www.linkedin.com/in/danglauser danglauser@gmail.com
  3. 3. What is Machine Learning?
  4. 4. What is Machine Learning?A branch of Artificial Intelligence
  5. 5. What is Machine Learning?A branch of Artificial IntelligenceCreative use of statistics
  6. 6. What is Machine Learning?A branch of Artificial IntelligenceCreative use of statisticsSmart decisions from large data sets
  7. 7. What is Machine Learning?A branch of Artificial IntelligenceCreative use of statisticsSmart decisions from large data setsAll of the above
  8. 8. Common Applications
  9. 9. Common Applications ?
  10. 10. SpamFiltering
  11. 11. CreditCardFraud
  12. 12. Medical Diagnostics
  13. 13. Search Engines
  14. 14. Sentiment Analysis
  15. 15. Math AlertIf you want to go big with Machine Learningmath is necessaryWhat math?
  16. 16. Statistics Discrete Math Linear Algebra Probability
  17. 17. Apache MahoutA platform for Machine LearningRoll your own algorithm, use the platformEasy integration with Hadoop
  18. 18. History• 2005 The Taste framework• 2008 Services built on Lucene
  19. 19. Mahout is composed of...Recommender EnginesClassificationClusteringFrequent itemsets
  20. 20. A brief intro to:Recommender EnginesClassificationClustering
  21. 21. RecommendationsFor a given set of input, make arecommendation
  22. 22. RecommendationsRank the best out of many possibilities
  23. 23. Recommenders are typicallyUser basedorItem based
  24. 24. NeighborhoodNearest N Users Threshold
  25. 25. Similarity
  26. 26. PearsonCorrelationSimilarity Produces a value between 1 and -1 Tendency of two series to move together
  27. 27. PearsonCorrelationSimilarity 1 - the two series are similar 0 - no similarity -1 - opposite similarity
  28. 28. PearsonCorrelationSimilarity Problems Doesn’t take into account how many items overlap between users Cannot find similarity between two users if they only have one item in common Undefined if two users have identical preferences
  29. 29. Similarity AlgorithmsPersonCorrelationSimilarityEuclidianDistanceSimilarityTanimotoCoefficientSimilarityLogLikelyhoodSimilarity
  30. 30. To the code!
  31. 31. How big is a Java Object?
  32. 32. GenericPreferenceuser id - long - 8 bytesitem id - long - 8 bytespreference value - float - 4 bytes
  33. 33. PreferenceArrayWhy not just use an array or an ArrayList?A little overhead x millions of items = a *lot* of overhead
  34. 34. GenericUserPreferenceArray item id - long - 8 bytes preference value - float - 4 bytes ] x millions - one user id - long - 8 bytes
  35. 35. Phew!
  36. 36. Clustering
  37. 37. Clustering
  38. 38. ClusteringSurface naturally occurring groups of dataA notion of similarity (and dissimilarity)
  39. 39. ClusteringAlgorithms do not require trainingStopping condition - iterate until closeenough
  40. 40. Common Clustering AlgorithmsK-MeansFuzzy K-MeansMeanshiftCentroid generationDirechlet clustering
  41. 41. Representing DataFeature SelectionVectorization
  42. 42. Feature SelectionFigure out what features of your data areinteresting
  43. 43. VectorizationRepresent the interesting features in an n-dimensional space
  44. 44. N-Dimensional SpaceEvery word in a group of documentsSize, shape, color of an object
  45. 45. N-Dimensional SpaceEvery word in a group of documentsSize, shape, color of an object
  46. 46. Representing VectorsDenseVectorRandomAccessSparseVectorSequentialAccessSparseVector
  47. 47. Representing VectorsDenseVector Random SeekRandomAccessSparseVectorSequentialAccessSparseVector
  48. 48. Hadoop SequenceFilesInput vectors SequenceFile(s) Initial SequenceFile(s) Centoids
  49. 49. K-Means50+ years old, in commonly used for 25 yearsSet the number of clusters - kWorks well even if you don’t pick a good - k
  50. 50. K-MeansGuess at initial placement of the centers (centroids) ]-Expectation - assign the nearest Wash,points to each centroid rinse,Maximization - reposition the centroid repeat
  51. 51. C1C2 C3
  52. 52. C1C2 C3
  53. 53. C1 C1C2 C3 C3 C2
  54. 54. C1 C3C2
  55. 55. C1 C3C2
  56. 56. C1 C1 C3 C3C2 C2
  57. 57. C1 C3C2
  58. 58. C1 C3C2
  59. 59. C1 C1 C3 C3 C2C2
  60. 60. C1 C3C2
  61. 61. C1 C3C2
  62. 62. C1 C1 C3 C3C2 C2
  63. 63. Stop! C1 C3C2
  64. 64. Clustering
  65. 65. Clustering
  66. 66. Classification
  67. 67. Classification
  68. 68. Classification
  69. 69. ClassificationBFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C
  70. 70. ClassificationBFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C
  71. 71. ClassificationBFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C Green
  72. 72. Attributes of Classification Algorithms Require training (supervised) Make a single decision with a very limited set of outcomes
  73. 73. ClassificationTypical answers naturally fit into categories
  74. 74. Examples of Classification Credit card fraud prediction Customer attrition Diabetes detector Search Engine
  75. 75. Training - learned process that produces a modelModel - output of the training algorithm
  76. 76. Predictor variable - input for classification modelTarget variable - what we are trying to predict
  77. 77. Classification
  78. 78. Common AlgorithmsStochastic Gradient Decent (SGD)Support Vector Machine (SVM)Naive BayesComplementary Naive BayesRandom Forrest
  79. 79. Going Distributed
  80. 80. OverheadParallel processing requires managementoverheadEspecially when spread over multiple machines
  81. 81. Vector SequenceFile Keys Values Implements Implements WritableComparable Writeable
  82. 82. JavaWritableComparable Comparable Writeable Serializable
  83. 83. Recap
  84. 84. RecommenderRank large datasets
  85. 85. ClusteringGroup your data
  86. 86. ClassificationTrain me to think like you
  87. 87. Integration with HadoopThrough SequenceFiles and Map/Reduce jobs
  88. 88. Resources
  89. 89. Resources
  90. 90. n-dimensional space http://en.wikipedia.org/wiki/File:Coord_system_CA_0.svgBatman http://www.flickr.com/photos/farukahmet/3005752670/sizes/l/in/photostream/duke http://kenai.com/projects/duke/pages/Homemahout logo http://mahout.apache.org/scalability diagram http://manning.com/owen/ Thanks!classification diagram http://manning.com/owen/phew http://www.flickr.com/photos/iain/1022210850/clouds http://www.flickr.com/photos/spazzo_1493/3682989696/spam http://www.flickr.com/photos/johotravels/4334224546/credit card http://www.flickr.com/photos/thetruthabout/4542026865/medical diagnostics http://www.flickr.com/photos/adrianclarkmbbs/3063516728/search engines http://www.flickr.com/photos/enda/144377951/angry http://www.flickr.com/photos/jmgasalla/3467458535/crystal ball http://www.flickr.com/photos/mache/142561526/glasses http://www.flickr.com/photos/nickwheeleroz/2220008689/coffee http://www.flickr.com/photos/mr_t_in_dc/2818254382/
  91. 91. http://twitter.com/danielglauser http://www.linkedin.com/in/danglauser danglauser@gmail.com

×