Intro to Apache MahoutGrant IngersollLucid Imaginationhttp://www.lucidimagination.com
Anyone Here Use Machine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
TopicsBackground and Use CasesWhat can you do in Mahout?Where’s the community at?ResourcesK-Means in Hadoop (time permitting)
Definition“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar contentFind associations/patterns in actions/behaviorsIdentify key topics/summarize textDocuments and CorporaDetect anomalies/fraudRanking search resultsOthers?
Apache MahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-orientedDefinition:http://dictionary.reference.com/browse/mahout
What does scalable mean to us?Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsOthers are already fast (SGD)Be pragmatic
Sampling of Who uses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What Can I do with Mahout Right Now?3C + FPM + O = Mahout
Collaborative FilteringExtensive framework for collaborative filtering (recommenders)RecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)All Map/ReduceDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation (M/R)
CategorizationPlace new items into predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve Bayes (M/R)Compl. Naïve Bayes (M/R)Decision Forests (M/R)Linear Regression (Seq. but Fast!)See Chapter 17 of Mahout in Action for Shop It To Me use case:
http://awe.sm/5FyNeFreq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket AnalysisMap/Reducehttp://www.amazon.com
OtherPrimitive Collections!Collocations (M/R)Math libraryVectors, Matrices, etc.Noise Reduction via Singular Value Decomp (M/R)
Prepare Data from Raw contentData Sources:Lucene integrationbin/mahout lucene.vector…Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in Mahout and the Iterator<Vector> classesDatabaseFile system
How to: Command LineMost algorithms have a Driver program$MAHOUT_HOME/bin/mahout.shhelps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the results or incorporate into applicationSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
What’s Happening Now?Unified Framework for Clustering and Classification0.5 release on the horizon (May?)Working towards 1.0 release by focusing on:Tests, examples, documentationAPI cleanup and consistencyGearing up for Google Summer of CodeNew M/R work for Hidden Markov Models
SummaryMachine learning is all over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use Hadoop
Resourceshttp://mahout.apache.orghttp://cwiki.apache.org/MAHOUT{user|dev}@mahout.apache.orghttp://svn.apache.org/repos/asf/mahout/trunkhttp://hadoop.apache.org
Resources“Mahout in Action” Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Taming Text” by Ingersoll, Morton, Farris“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and  Chris Dyer
K-MeansClustering AlgorithmNicely parallelizable!http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-ReduceInput:Mahout Vectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
Map-Reduce IterationEach Iteration calculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
KMeansMapperDuring Setup:Load the initial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
KMeansReducerSetup:Load up clustersConvergence informationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
KMeansCombinerJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapperSome applications only care about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>

Intro to Mahout -- DC Hadoop

  • 1.
    Intro to ApacheMahoutGrant IngersollLucid Imaginationhttp://www.lucidimagination.com
  • 2.
    Anyone Here UseMachine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
  • 3.
    TopicsBackground and UseCasesWhat can you do in Mahout?Where’s the community at?ResourcesK-Means in Hadoop (time permitting)
  • 4.
    Definition“Machine Learning isprogramming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
  • 5.
    Common Use CasesRecommendfriends/dates/productsClassify content into predefined groupsFind similar contentFind associations/patterns in actions/behaviorsIdentify key topics/summarize textDocuments and CorporaDetect anomalies/fraudRanking search resultsOthers?
  • 6.
    Apache MahoutAn ApacheSoftware Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-orientedDefinition:http://dictionary.reference.com/browse/mahout
  • 7.
    What does scalablemean to us?Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsOthers are already fast (SGD)Be pragmatic
  • 8.
    Sampling of Whouses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • 9.
    What Can Ido with Mahout Right Now?3C + FPM + O = Mahout
  • 10.
    Collaborative FilteringExtensive frameworkfor collaborative filtering (recommenders)RecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
  • 11.
    ClusteringDocument levelGroup documentsbased on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)All Map/ReduceDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation (M/R)
  • 12.
    CategorizationPlace new itemsinto predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve Bayes (M/R)Compl. Naïve Bayes (M/R)Decision Forests (M/R)Linear Regression (Seq. but Fast!)See Chapter 17 of Mahout in Action for Shop It To Me use case:
  • 13.
    http://awe.sm/5FyNeFreq. Pattern MiningIdentifyfrequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket AnalysisMap/Reducehttp://www.amazon.com
  • 14.
    OtherPrimitive Collections!Collocations (M/R)MathlibraryVectors, Matrices, etc.Noise Reduction via Singular Value Decomp (M/R)
  • 15.
    Prepare Data fromRaw contentData Sources:Lucene integrationbin/mahout lucene.vector…Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in Mahout and the Iterator<Vector> classesDatabaseFile system
  • 16.
    How to: CommandLineMost algorithms have a Driver program$MAHOUT_HOME/bin/mahout.shhelps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the results or incorporate into applicationSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
  • 17.
    What’s Happening Now?UnifiedFramework for Clustering and Classification0.5 release on the horizon (May?)Working towards 1.0 release by focusing on:Tests, examples, documentationAPI cleanup and consistencyGearing up for Google Summer of CodeNew M/R work for Hidden Markov Models
  • 18.
    SummaryMachine learning isall over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use Hadoop
  • 19.
  • 20.
    Resources“Mahout in Action”Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Taming Text” by Ingersoll, Morton, Farris“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
  • 21.
  • 22.
    K-Means in Map-ReduceInput:MahoutVectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
  • 23.
    Map-Reduce IterationEach Iterationcalculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
  • 24.
    KMeansMapperDuring Setup:Load theinitial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
  • 25.
    KMeansReducerSetup:Load up clustersConvergenceinformationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
  • 26.
    KMeansCombinerJust like KMeansReducer,but only produces partial sum of the cluster based on the data local to the Mapper
  • 27.
    KMeansClusterMapperSome applications onlycare about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>

Editor's Notes

  • #10 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • #26 Convergence just checks to see how far the centroid has moved from the previous centroid