Google knows what you want to search for, Amazon what you want to buy, and Facebook knows who your friends are before you connect to them. Are they reading your minds? the answer is yes.
In this session I'll teach you how to use Mahout, an Apache machine learning library, and read the minds of your users based on the information you have already gathered. And if your dataset is too big, we'll use Hadoop to analyze it on a distributed environment.
http://www.javaedge.com/jedge/abstracts.jsp
2. Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/
14. Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout
15. Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout
18. Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project
20. Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout
21. Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout
22. Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout
23. Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout
32. Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity
33. Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel
35. 10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines
52. The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop
54. We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop
55. Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop
56. Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop
62. Hadoop Running our Job Upload our job to the HDFS $ hadoopfs -put myjob.job /myjob.job $ hadoopfs -put dataset.dat /dataset.dat $ hadoop jar /myjob.jobc.a.RecommenderJob/dataset.dat/output.dat Upload the dataset to the HDFS jar class input output Run the job
63. Our dataset is too big and we need the results fast. Mahout gives us out of the box all we need to run on Hadoop. We can pack all together with Maven. Our Machine Learning algorithms are running on the cluster! Summary Hadoop
64. Mahout’s website Mahout in Action – May 2011 (est.) Introducing Apache Mahout @ IBM developerWorks References
65. Thank You Any questions? We appreciate your feedback
Editor's Notes
למידה חישובית
Make it clear that I don’t want the crowd to read the table, it’s only to generate an overwhelming sensation
מערכות המלצה
Strictly speaking, these are examples of “collaborative filtering” -- producing recommendations based on, and only based on, knowledge of users’ relationships to items. These techniques require no knowledge of the properties of the items themselves. This is, in a way, an advantage. This recommender framework couldn’t care less whether the “items” are books, theme parks, flowers, or even other people, since nothing about their attributes enters into any of the input.
UserSimilarity: Way to compare users (user based approach)ItemSimilarity: Way to compare items (items based approach)Recommender: Interface for providing recommendationsUserNeighborhood: Interface for computing a neighborhood of similar users that can then be used by the Recommenders