Jul. 27, 2013•0 likes## 102 likes

•76,889 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Technology

Education

This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.

Varad MeruFollow

Hadoop and Machine Learningjoshwills

Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita

Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London

Apache MahoutSave Manos

MahoutEdureka!

Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan

- { “Mahout” : “Scalable Machine Learning Library” } { “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”, “Twitter” : “@vrdmr” } 1
- { “Mahout” : “Introduction” } 2
- { “Introduction” : “History and Etymology” } • A Scalable Machine Learning Library built on Hadoop, written in Java. • Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” • Started as a Lucene sub-project. Became Apache TLP in April 2010. • Latest version out – 0.6 (released on 6th Feb 2012). • Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop. • Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten. • Taste Recommendation Framework was added later by Sean Owen. 3 Figure 1.1 Apache Mahout and its related projects within the Apache Foundation. Much of Mahout’s work has been to not only implement these algorithms conventionally, and scalable way, but also to convert some of these algorithms to work at scale on to Hadoop’s mascot is an elephant, which at last explains the project name! Mahout incubates a number of techniques and algorithms, many still in developm experimental phase. At this early stage in the project's life, three core themes are evident filtering / recommender engines, clustering, and classification. This is by no means all tha Mahout, but are the most prominent and mature themes at the time of writing. These the scope of this book. Chances are that if you are reading this, you are already aware of the interesting pot three families of techniques. But just in case, read on. 2
- { “Mahout” : “Machine Learning” } 4
- { “Machine Learning” : “Introduction” } “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience” • Branch of Artiﬁcial Intelligence • Design and Development of Algorithms • Computers Evolve Behavior based on Empirical Data . • Supervised Learning • Using Labeled training data, to create a Classiﬁer that can predict output for unseen inputs. • Unsupervised Learning • Using Unlabeled training data to create a function that can predict output. • Semi-Supervised Learning 5
- { “Machine Learning” : “Applications” } • Recommend Friends, Dates, Products to end-user. • Classify content into pre-deﬁned groups. • Find Similar content based on Object Properties. • Identify key topics in large Collections of Text. • Detect Anomalies within given data. • Ranking Search Results with User Feedback Learning. • Classifying DNA sequences. • Sentiment Analysis/ Opinion Mining • Computer Vision. • Natural Language Processing, • BioInformatics. • Speech and HandWriting Recognition. • Others ... 6
- {“Machine Learning”: “Challenges”} • BigData • Yesterdays Processing on next generation Data. • Time for Processing • Large and Cheap Storage 7 Size Classiﬁcation Tools Lines Sample Data Analysis and Visualization Whiteboard, bash,... KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R, Processing, bash,... MBs - low GBs Online Data Storage MySQL (DBs),... MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/ LAPACK,... MBs - low GBs Online Data Visualization Flare, AmCharts, Raphael, Protovis,... GBs - TBs - PBs Big Data Storage HDFS, HBase, Cassandra,... GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,...
- { “Machine Learning” : “Mahout for Big Data”} • Goal: “Be as Fast and Eﬃcient as possible given the intrinsic design of the Algorithm”. • Some Algorithms won’t scale to massive machine clusters • Others ﬁt logically on MapReduce framework like Apache Hadoop • Most Mahout implementations are MapReduce enabled • Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”. • The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library. • The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares) 8
- { “Mahout” : “Internals” } 9
- 10 { “Internals” : “Architecture” } Math% Vectors/Matrices/SVD% Recommenders%Clustering%Classiﬁca9on% Freq.% Pa>ern% Mining% Evolu9onary% Algorithms% U9li9es% Lucene/Vectorizer% Collec9ons% (primi9ves)% Apache% Hadoop% Applica9ons% Examples% Regression% Dimension% Reduc9on%
- • Scalable • Dual-Mode (Sequential and MapReduce Enabled) • Support for easy Extension. • Large Number of Data Source Enabled including the newer NoSQL variants. • It is a Java library. It is a framework of tools intended to be used and adapted by developers. • Advanced Implementations of Java’s Collections Framework for better Performance. 11 { “Internals” : “Features” }
- { “Mahout” : “Algorithms” } 12
- • Help Users ﬁnd items they might like based on historical behavior and preferences • Top-level packages deﬁne the Mahout interfaces to these key abstractions: • DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel • UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood. • Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender. 13 { “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
- 14 { “Algorithms” : “Recommender Systems”, “id” : “Example”} 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 Binary Values Recommendation Alice Bob John Jane Bill Steve Larry Don Jack
- 15 { “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”} 1 1/3 – 0.33 5/8 – 0.625 5/8 – 0.625 1/3 – 0.33 1 3/8 – 0.375 3/8 – 0.375 5/8 – 0.625 3/8 – 0.375 1 5/7 – 0.714 5/8 – 0.625 3/8 – 0.375 5/7 – 0.714 1 Tanimoto Coefﬁcient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
- 16 { “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”} 1 0.507 0.772 0.772 0.507 1 0.707 0.707 0.772 0.707 1 0.833 0.772 0.707 0.833 1 Cosine Coefﬁcient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
- • Assigning Data to discreet Categories. • Train a model on Labeled Data • Run the Model on new, Unlabeled Data • Classiﬁer: An algorithm that implements classiﬁcation, especially in a concrete implementation. • Classiﬁcation Algorithms • Maximum entropy classiﬁer • Naïve Bayes classiﬁer • Decision trees, decision lists • Support vector machines • Kernel estimation and K-nearest-neighbor algorithms • Perceptrons • Neural networks (multi-level perceptrons) 17 { “Algorithms” : “Classiﬁcation” , “id” : “Introduction”} Spam Not spam ?
- 18 { “Algorithms” : “Classiﬁcation” , “id” : “Naïve Bayes Example”} Train: Not Spam President Obama’s Nobel Prize Speech
- 19 { “Algorithms” : “Classiﬁcation” , “id” : “Naïve Bayes Example”} Train: Spam Spam Email Content
- 20 { “Algorithms” : “Classiﬁcation” , “id” : “Naïve Bayes Example”} Run “Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”
- 21 { “Algorithms” : “Classiﬁcation” , “id” : “Naïve Bayes in Mahout”} • Naïve Bayes is a pretty complex process in Mahout: training the classiﬁer requires four separate Hadoop jobs. • Training: • Read the Features • Calculate per-Document Statistics • Normalize across Categories • Calculate normalizing factor of each label • Testing • Classiﬁcation (ﬁfth job, explicitly invoked) algorithm through which the system will learn, and the variables used as input are key steps in the phase of building the classification system. The basic steps in building a classification system are illustrated in figure 13.2. Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc with new input examples to estimate the target variable. The figure shows two phases of the classification process, with the upper path representing training classification model and the lower path providing new examples for which the model will assign catego (the target variables) as a way to emulate decisions. For the training phase, input for the train
- • Grouping unstructured data without any training data. • Self learning from experience. • Small intra-cluster distance - Trying for local and global Minima • Large inter-cluster distance • Mahout’s Canopy Clustering map reduce algorithm is often used to compute initial cluster centroids. 22 { “Algorithms” : “Clustering” , “id” : “Introduction”}
- 23 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 24 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 25 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 26 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 27 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 28 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 29 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 30 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 31 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
- 32 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”} Cats Dogs
- 33 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} + C0 C1 C2 C3 M0 M1 M2 M3 IO0 IO1 IO2 IO3 R0 R1 FO0 FO1 chunks mappers Reducers MapPhaseReducePhase Shuffling Data
- • Assume: Number of Cluster is far lesser than Number of Points. • Therefore, |Clusters| << |Points| • Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids. 34 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} M0 M1 M2 M3 <clusterID, observation> R0 R1 Important arguments --maxIter --convergenceDelta --method
- 35 { “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”} Map phase: assign cluster IDs Reduce phase: reset centroids
- 36 { “Algorithms” : “Other Algorithms” } • Classiﬁcation ‣ Stochastic Gradient Descent ‣ Support Vector Machines ‣ Random Forests • Clustering ‣ Latent Dirichlet Allocation - Topic models ‣ Fuzzy K-Means - Points are assigned multiple clusters ‣ Canopy clustering - Fast approximations of clusters ‣ Spectral clustering - Treat points as a graph • Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions • Dimensionality Reduction • Regression
- 37 { “Algorithms” : “Future” } • Classiﬁcation ‣ Decision Trees such as J48 and ID3 • Clustering ‣ DBScan and CoWeb Clustering techniques • Evolutionary Algorithms ‣ Classical Genetic Algorithms • Association Rules ‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
- { “Mahout” : “Summary” } 38
- { “Summary”: “Apache Mahout” } 39 • Scalable Library
- 40 • Scalable Library • Three Primary Areas of Focus { “Summary”: “Apache Mahout” }
- 41 • Scalable Library • Three Primary Areas of Focus • Other Algorithms { “Summary”: “Apache Mahout” }
- 42 • Scalable Library • Three Primary Areas of Focus • Other Algorithms • All in your friendly neighborhood MapReduce { “Summary”: “Apache Mahout” }
- { “Mahout” : “Demo” } 43
- { “Mahout” : “Questions” } 44
- { “Mahout” : “References” } 45
- • Books • “Mahout in Action”, Owen et. al., Manning Pub. • “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub. • “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub. • Videos • CS-229, Machine Learning at Stanford University - Prof. Andrew Ng. • Collaborative ﬁltering at scale - Sean Owen • Distributed Item-based Collaborative Filtering - Sebastian Schelter • EMail Classiﬁcation with Mahout - Grant Ingersoll @ Lucid Imagination 46 { “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
- • WWW • http://mahout.apache.org - Mahout@Apache • http://hadoop.apache.org - Hadoop@Apache • dev@mahout.apache.org - Developer mailing list • user@mahout.apache.org - User mailing list • http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout 47 { “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
- { “Mahout” : “The End” } 48 {“Thank You” : “Have a Nice and Green Day” }