Mahout's presentation at AlphaCSP's The Edge 2010

Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/

Machine Learning Mahout Recommender Engines Clustering Categorization Hadoop Agenda

Machine Learning Whatchatalkin' 'bout, Willis?

Recommender Engines Clustering Classification Well known use cases for: Machine Learning

Machine Learning Recommender Engines: Amazon

Machine Learning Recommender Engines: Facebook

Machine Learning Clustering: Google News

Machine Learning Classification: Spam Detection

Machine Learning Classification: Picasa face recognition

Because it’s interesting Because it makes money Why learning “Machine Learning”? Machine Learning

Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout

Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout

Mahout Mahout and its related projects

Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project

Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout

Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout

Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout

Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout

Approaches User based Item based Collaborative filtering vs Content-based recommendation Recommender Engines

Data model Users Items Preferences (ratings) ItemSimilarity UserSimilarity UserNeighborhood Recommender What do we need? Recommender Engines

Recommender Engines T-bone Chocolate Lettuce Rump http://www.flickr.com/photos/martinimike/3770274175/ http://www.flickr.com/photos/fotoosvanrobin/3182238046/ http://www.flickr.com/photos/this_girl_daydreams/3190110968/ http://www.flickr.com/photos/19998197@N00/3238445535/

Recommender Engines Kuki The Vegan Gilad Ariel

Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity

Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel

10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines

Clustering is about drawing lines Clustering

Possible weather conditions recognition Clustering CLUSTERING temperature wind direction humidity wind speed http://www.icons-land.com

Clustering Vector representation 25 / 50 = 0.5

Clustering Samples Generation 300 samples Mean: [0.0, 2.0] SD: 0.1 500 samples Mean: [1.0, 1.0] SD: 3.0 300 samples Mean: [1.0, 0.0] SD: 0.5

Clustering Iterations with Fuzzy K-Means

Clustering Clustering Discovery Original data generation Discovered clusters

Clustering CosineDistanceMeasure EuclideanDistanceMeasure MahalanobisDistanceMeasure ManhattanDistanceMeasure SquaredEuclideanDistanceMeasure TanimotoDistanceMeasure WeightedDistanceMeasure WeightedEuclideanDistanceMeasure WeightedManhattanDistanceMeasure

Categorization Categorization Steps

Our example: What do we want to do? Categorization Java Classifier Document Sport

Categorization Documents Preparation Label <tab> evidence1 <space> evidence2 BayesFileFormatter (Lucene’s Analyzers) Labeled Documents Training Test

Categorization Using the classifier

Categorization Categorization testing, the confusion matrix Summary ------------------------------------------------------- Correctly Classified Instances : 93 93% Incorrectly Classified Instances : 7 7% Total Classified Instances : 100 ======================================================= Confusion Matrix ------------------------------------------------------- java sport <--Classified as 56 3 | 59 java 4 37 | 41 sport

The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop

Hadoop Data Results Hadoop Compute Cluster

We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop

Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop

Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop

Hadoop Packaging a Job – The Maven solution pom.xml

Hadoop Job feeding Job Dataset Hadoop Compute Cluster

Hadoop We take the project’s dependencies

Hadoop Using an Ant task, we pack everything together

Mahout's presentation at AlphaCSP's The Edge 2010

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Mahout's presentation at AlphaCSP's The Edge 2010

Editor's Notes