Apache Mahout: Driving the Yellow Elephant

Apache Mahout – Driving the Yellow ElephantGrant IngersollTriHUG http://www.trihug.org

Anyone Here Use Machine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?

TopicsWhat is Machine Learning?ML Use CasesWhat is Mahout?A Word on ScalingWhat can I do with it right now?Mahout and Hadoop: An Example

Amazon.comWhat is Machine Learning?Google News

Really it’s…“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more

Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?

Apache Mahouthttp://dictionary.reference.com/browse/mahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache License ;-)Or are research-oriented

Who uses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

What does scalable mean?Ted Dunning (Mahout committer):As data grows linearly, either scale linearly in time or in machines2X data requires 2X time or 2X machines (or less!)Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsBe pragmatic

What Can I do with Mahout Right Now?

RecommendationsExtensive framework for collaborative filteringRecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, othersIt’s Valentine’s Day soon!

ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-ShiftDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation

CategorizationPlace new items into predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve BayesCompl. Naïve BayesDecision ForestsLinear RegressionSee Chapter 17 of Mahout in Action for Shop It To Me use case:

http://awe.sm/5FyNeFreq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket Analysishttp://www.amazon.com

EvolutionaryMap-Reduce ready fitness functions for genetic programmingIntegration with Watchmakerhttp://watchmaker.uncommons.org/index.phpProblems solved:Traveling salesmanClass discoveryMany othersCaveat: Hasn’t received as much attention as others

OtherPrimitive Collections!Math libraryVectors, Matrices, etc.Noise Reduction via Singular Value DecompositionExport from Lucene/Solr and other formats

Mahout and HadoopMost Mahout implementations are built on Map-ReduceMany also have sequential implementationsLinear Regression is blazingly fast without needing M/RLet’s look at how K-Means is implemented in Mahout

K-MeansClustering AlgorithmNicely parallelizable!http://en.wikipedia.org/wiki/K-means_clustering

K-Means in Map-ReduceInput:Mahout Vectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)

Map-Reduce IterationEach Iteration calculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper

KMeansMapperDuring Setup:Load the initial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others

KMeansReducerSetup:Load up clustersConvergence informationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster

KMeansCombinerA Combiner is like a Map-side Reducer which helps save on IOJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapperSome applications only care about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>

SummaryMachine learning is all over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use HadoopKMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce

Resourceshttp://mahout.apache.orghttp://cwiki.apache.org/MAHOUT{user|dev}@mahout.apache.orghttp://svn.apache.org/repos/asf/mahout/trunkhttp://hadoop.apache.org

Resources“Mahout in Action” by Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

ReferencesHAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpgTerminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpgMatrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpgGoogle News: http://news.google.comAmazon.com: http://www.amazon.comFacebook: http://www.facebook.comCouple: http://www.vlemx.com/Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/http://www.theregister.co.uk/2006/08/15/beer_diapers/DMOZ: http://www.dmoz.orgShopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Apache Mahout: Driving the Yellow Elephant

More Related Content

What's hot

Similar to Apache Mahout: Driving the Yellow Elephant

More from Grant Ingersoll

Apache Mahout: Driving the Yellow Elephant

Editor's Notes