Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning and Apache Mahout : An Introduction

12,995 views

Published on

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

Published in: Technology, Education
  • Be the first to comment

Machine Learning and Apache Mahout : An Introduction

  1. 1. + Machine Learning and Apache Mahout Varad Meru Software Development Engineer Orzota, Inc. about.me/vrdmr © Varad Meru, 2013
  2. 2. + 2 Who Am I  Orzota, Inc.    Making BigData Easy Designing a Cloud-based platform for ETL, Analytics Past Work Experience   Persistent Systems Ltd. Recommendation Engines and User Behavior Analytics. Area of Interest  Machine Learning  Distributed Systems  Recommendation Engines
  3. 3. + 3 Outline  Introduction  Machine Learning      Apache Mahout     Introduction and History Types of Learning Algorithms Applications What’s New History Architecture Applications and Examples Conclusion © Varad Meru, 2013
  4. 4. + Machine Learning Rise of the Machine-Era 4
  5. 5. + 5 Introduction “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”  Term coined by Arthur Samuel  "Field of study that gives computers the ability to learn without being explicitly programmed“.  Branch of Artificial Intelligence and Statistics  Focuses on prediction based on known properties  Used as a sub-process in Data Mining.  Data Mining focuses on discovering new, unknown properties.
  6. 6. + 6 Learning Algorithms  Supervised Learning    Unsupervised Learning    Unlabelled input data. Creating a function to predict the relation and output Semi-Supervised Learning   Labelled input data. Creating classifiers to predict unseen inputs. Combines Supervised and Unsupervised Learning methodology Reinforcement Learning  Reward-Punishment based agent.
  7. 7. + 7 Supervised Learning Introduction  Learn from the Data  Data is already labelled   Expert, Crowd-sourced or case-based labelling of data. Applications  Handwriting Recognition  Spam Detection  Information Retrieval   Personalisation based on ranks Speech Recognition
  8. 8. + 8 Supervised Learning Algorithms  Decision Trees  k-Nearest Neighbours  Naive Bayes  Logistic Regression  Perceptron and Multi-level Perceptrons  Neural Networks  SVM and Kernel estimation
  9. 9. + 9 Supervised Learning Example: Naive Bayes Classifier  President Obama’s Speech’s Word Map
  10. 10. + 10 Supervised Learning Example: Naive Bayes Classifier  A Spam Document’s Word Map
  11. 11. + 11 Supervised Learning Example: Naive Bayes Classifier  Running a test on the Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier Spam Bin
  12. 12. + 12 Unsupervised Learning Introduction  Finding hidden structure in data  Unlabelled Data  SMEs needed post-processing to verify, validate and use the output  Used in exploratory analysis rather than predictive analytics  Applications  Pattern Recognition  Groupings based on a distance measure  Group of People, Objects, ...
  13. 13. + 13 Unsupervised Learning Algorithms  Clustering  k-Means, MinHash, Hierarchical Clustering  Hidden Markov Models  Feature Extraction methods  Self-organizing Maps (Neural Nets)
  14. 14. + 14 Unsupervised Learning Example K-Means Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
  15. 15. + 15 Learning Problem Cat and Dog Problem  Humans can easily classify which is a cat and which is a dog.  But how can a computer do that?  Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
  16. 16. + Apache Mahout Scalable Machine Learning Library 16 © Varad Meru, 2013
  17. 17. + 17 History and Etymology  Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.  Written in Java. Apache License.  Founders  Mahout – Isabel Drost, Grant Ingersoll, Karl Witten.  Taste – Sean Owen  Mahout – Keeper/Driver of Elephants.  Current Release – 0.8 (stable) © Varad Meru, 2013
  18. 18. + Size Need  BigData  Ever-growing data.  Yesterday’s methods to process tomorrow’s data   Cheap Storage Scalable from Ground Up   Lines Sample Data KBs – low MBs Prototype Data Analysis and Visualisation Analysis and Visualisation Tools18 Whiteboard, Bash, ... Matlab, Octave, R, Processing, Bash, ... Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. MBs – low GBs Should be build on top of anyOnline existing Distributed Systems Data framework Should contain distributed version of ML algorithms Classification GBs – TBs – PBs Visualisation Flare, AmCharts, Raphael Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  19. 19. + 19 Mahout Modules Applications Evolutionary Algorithms Classification Utilies Lucene/Vectorizer Clustering Recommenders Math Vectors/ Matrics/SVD Regression Collections (Primitives) FPM Dimension Reduction Hadoop
  20. 20. + 20 Recommender Systems © Varad Meru, 2013
  21. 21. + 21 Recommender Systems Introduction  Types of Recommender Systems     Content Based Recommendations Collaborative Filtering Recommendations  User-User Recommendations  Item-Item Recommendations Dimensionality Reduction (SVD) Recommendations Applications      Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
  22. 22. + 22 Recommender Systems Collaborative Filtering in Action  Assuming people have seen at least one movie.  Cold Start?   © Varad Meru, 2013 1: seen 0: not seen
  23. 23. + 23 Collaborative Filtering in Action  Tanimoto Coefficient T ( a, b) NA NC NB NC  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  24. 24. + 24 Collaborative Filtering in Action  Cosine Coefficient C (a, b) NC NA NB  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  25. 25. + 25 Apache Mahout Recommender System Architecture  Two Modes    Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version for Collaborative Filtering Top-level Packages  Data Model  User Similarity  Item Similarity  User Neighbourhood  Recommender
  26. 26. + 26 Naive Bayes Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier
  27. 27. + 27 Naive Bayes Classifier  Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.  Training:   Calculate per-Document Statistics  Normalize across Categories   Read the Features Calculate normalizing factor of each label Testing  Classification (fifth job, explicitly invoked) © Varad Meru, 2013
  28. 28. + 28 K-Means Clustering Iterations
  29. 29. + 29 K-Means Clustering MapReduce Version
  30. 30. 30 + Summary • Machine Learning • • • Learning Algorithms Varied Applications Mahout • Scaling to Giga/Tera/Peta Scale • Free and Open Source
  31. 31. + 31 More Info. 1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012. 2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012) 3. http://mahout.apache.org/ - Apache Mahout Project Page 4. http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout 5. [VIDEO] “Collaborative filtering at scale” by Sean Owen 6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub. © Varad Meru, 2013
  32. 32. + Questions? 32 © Varad Meru, 2013
  33. 33. 33 + Thank You Go BigData!!!  © Varad Meru, 2014

×