Apache Mahout – Driving the Yellow ElephantGrant IngersollTriHUG http://www.trihug.org
Anyone Here Use Machine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
TopicsWhat is Machine Learning?ML Use CasesWhat is Mahout?A Word on ScalingWhat can I do with it right now?Mahout and Hadoop: An Example
Amazon.comWhat is Machine Learning?Google News
Really it’s…“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?
Apache Mahouthttp://dictionary.reference.com/browse/mahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache License ;-)Or are research-oriented
Who uses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What does scalable mean?Ted Dunning (Mahout committer):As data grows linearly, either scale linearly in time or in machines2X data requires 2X time or 2X machines (or less!)Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsBe pragmatic
What Can I do with Mahout Right Now?
RecommendationsExtensive framework for collaborative filteringRecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, othersIt’s Valentine’s Day soon!
ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-ShiftDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation
CategorizationPlace new items into predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve BayesCompl. Naïve BayesDecision ForestsLinear RegressionSee Chapter 17 of Mahout in Action for Shop It To Me use case:
http://awe.sm/5FyNeFreq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket Analysishttp://www.amazon.com
EvolutionaryMap-Reduce ready fitness functions for genetic programmingIntegration with Watchmakerhttp://watchmaker.uncommons.org/index.phpProblems solved:Traveling salesmanClass discoveryMany othersCaveat: Hasn’t received as much attention as others
OtherPrimitive Collections!Math libraryVectors, Matrices, etc.Noise Reduction via Singular Value DecompositionExport from Lucene/Solr and other formats
Mahout and HadoopMost Mahout implementations are built on Map-ReduceMany also have sequential implementationsLinear Regression is blazingly fast without needing M/RLet’s look at how K-Means is implemented in Mahout
K-MeansClustering AlgorithmNicely parallelizable!http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-ReduceInput:Mahout Vectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
Map-Reduce IterationEach Iteration calculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
KMeansMapperDuring Setup:Load the initial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
KMeansReducerSetup:Load up clustersConvergence informationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
KMeansCombinerA Combiner is like a Map-side Reducer which helps save on IOJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapperSome applications only care about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>
SummaryMachine learning is all over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use HadoopKMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
Resourceshttp://mahout.apache.orghttp://cwiki.apache.org/MAHOUT{user|dev}@mahout.apache.orghttp://svn.apache.org/repos/asf/mahout/trunkhttp://hadoop.apache.org
Resources“Mahout in Action” by Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
ReferencesHAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpgTerminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpgMatrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpgGoogle News: http://news.google.comAmazon.com: http://www.amazon.comFacebook: http://www.facebook.comCouple: http://www.vlemx.com/Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/http://www.theregister.co.uk/2006/08/15/beer_diapers/DMOZ: http://www.dmoz.orgShopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Apache Mahout: Driving the Yellow Elephant

  • 1.
    Apache Mahout –Driving the Yellow ElephantGrant IngersollTriHUG http://www.trihug.org
  • 2.
    Anyone Here UseMachine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
  • 3.
    TopicsWhat is MachineLearning?ML Use CasesWhat is Mahout?A Word on ScalingWhat can I do with it right now?Mahout and Hadoop: An Example
  • 4.
    Amazon.comWhat is MachineLearning?Google News
  • 5.
    Really it’s…“Machine Learningis programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
  • 6.
    Common Use CasesRecommendfriends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?
  • 7.
    Apache Mahouthttp://dictionary.reference.com/browse/mahoutAn ApacheSoftware Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache License ;-)Or are research-oriented
  • 8.
  • 9.
    What does scalablemean?Ted Dunning (Mahout committer):As data grows linearly, either scale linearly in time or in machines2X data requires 2X time or 2X machines (or less!)Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsBe pragmatic
  • 10.
    What Can Ido with Mahout Right Now?
  • 11.
    RecommendationsExtensive framework forcollaborative filteringRecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, othersIt’s Valentine’s Day soon!
  • 12.
    ClusteringDocument levelGroup documentsbased on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-ShiftDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation
  • 13.
    CategorizationPlace new itemsinto predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve BayesCompl. Naïve BayesDecision ForestsLinear RegressionSee Chapter 17 of Mahout in Action for Shop It To Me use case:
  • 14.
    http://awe.sm/5FyNeFreq. Pattern MiningIdentifyfrequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket Analysishttp://www.amazon.com
  • 15.
    EvolutionaryMap-Reduce ready fitnessfunctions for genetic programmingIntegration with Watchmakerhttp://watchmaker.uncommons.org/index.phpProblems solved:Traveling salesmanClass discoveryMany othersCaveat: Hasn’t received as much attention as others
  • 16.
    OtherPrimitive Collections!Math libraryVectors,Matrices, etc.Noise Reduction via Singular Value DecompositionExport from Lucene/Solr and other formats
  • 17.
    Mahout and HadoopMostMahout implementations are built on Map-ReduceMany also have sequential implementationsLinear Regression is blazingly fast without needing M/RLet’s look at how K-Means is implemented in Mahout
  • 18.
  • 19.
    K-Means in Map-ReduceInput:MahoutVectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
  • 20.
    Map-Reduce IterationEach Iterationcalculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
  • 21.
    KMeansMapperDuring Setup:Load theinitial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
  • 22.
    KMeansReducerSetup:Load up clustersConvergenceinformationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
  • 23.
    KMeansCombinerA Combiner islike a Map-side Reducer which helps save on IOJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • 24.
    KMeansClusterMapperSome applications onlycare about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>
  • 25.
    SummaryMachine learning isall over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use HadoopKMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
  • 26.
  • 27.
    Resources“Mahout in Action”by Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
  • 28.
    ReferencesHAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpgTerminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpgMatrix:http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpgGoogle News: http://news.google.comAmazon.com: http://www.amazon.comFacebook: http://www.facebook.comCouple: http://www.vlemx.com/Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/http://www.theregister.co.uk/2006/08/15/beer_diapers/DMOZ: http://www.dmoz.orgShopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Editor's Notes

  • #5 A few things come to mind
  • #23 Convergence just checks to see how far the centroid has moved from the previous centroid