Machine Learning and Apache Mahout : An Introduction
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Machine Learning and Apache Mahout : An Introduction

  • 6,271 views
Uploaded on

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,271
On Slideshare
6,149
From Embeds
122
Number of Embeds
4

Actions

Shares
Downloads
261
Comments
0
Likes
18

Embeds 122

https://twitter.com 112
http://www.linkedin.com 5
http://192.168.6.56 3
https://www.linkedin.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. + Machine Learning and Apache Mahout Varad Meru Software Development Engineer Orzota, Inc. about.me/vrdmr © Varad Meru, 2013
  • 2. + 2 Who Am I  Orzota, Inc.    Making BigData Easy Designing a Cloud-based platform for ETL, Analytics Past Work Experience   Persistent Systems Ltd. Recommendation Engines and User Behavior Analytics. Area of Interest  Machine Learning  Distributed Systems  Recommendation Engines
  • 3. + 3 Outline  Introduction  Machine Learning      Apache Mahout     Introduction and History Types of Learning Algorithms Applications What’s New History Architecture Applications and Examples Conclusion © Varad Meru, 2013
  • 4. + Machine Learning Rise of the Machine-Era 4
  • 5. + 5 Introduction “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”  Term coined by Arthur Samuel  "Field of study that gives computers the ability to learn without being explicitly programmed“.  Branch of Artificial Intelligence and Statistics  Focuses on prediction based on known properties  Used as a sub-process in Data Mining.  Data Mining focuses on discovering new, unknown properties.
  • 6. + 6 Learning Algorithms  Supervised Learning    Unsupervised Learning    Unlabelled input data. Creating a function to predict the relation and output Semi-Supervised Learning   Labelled input data. Creating classifiers to predict unseen inputs. Combines Supervised and Unsupervised Learning methodology Reinforcement Learning  Reward-Punishment based agent.
  • 7. + 7 Supervised Learning Introduction  Learn from the Data  Data is already labelled   Expert, Crowd-sourced or case-based labelling of data. Applications  Handwriting Recognition  Spam Detection  Information Retrieval   Personalisation based on ranks Speech Recognition
  • 8. + 8 Supervised Learning Algorithms  Decision Trees  k-Nearest Neighbours  Naive Bayes  Logistic Regression  Perceptron and Multi-level Perceptrons  Neural Networks  SVM and Kernel estimation
  • 9. + 9 Supervised Learning Example: Naive Bayes Classifier  President Obama’s Speech’s Word Map
  • 10. + 10 Supervised Learning Example: Naive Bayes Classifier  A Spam Document’s Word Map
  • 11. + 11 Supervised Learning Example: Naive Bayes Classifier  Running a test on the Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier Spam Bin
  • 12. + 12 Unsupervised Learning Introduction  Finding hidden structure in data  Unlabelled Data  SMEs needed post-processing to verify, validate and use the output  Used in exploratory analysis rather than predictive analytics  Applications  Pattern Recognition  Groupings based on a distance measure  Group of People, Objects, ...
  • 13. + 13 Unsupervised Learning Algorithms  Clustering  k-Means, MinHash, Hierarchical Clustering  Hidden Markov Models  Feature Extraction methods  Self-organizing Maps (Neural Nets)
  • 14. + 14 Unsupervised Learning Example K-Means Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
  • 15. + 15 Learning Problem Cat and Dog Problem  Humans can easily classify which is a cat and which is a dog.  But how can a computer do that?  Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
  • 16. + Apache Mahout Scalable Machine Learning Library 16 © Varad Meru, 2013
  • 17. + 17 History and Etymology  Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.  Written in Java. Apache License.  Founders  Mahout – Isabel Drost, Grant Ingersoll, Karl Witten.  Taste – Sean Owen  Mahout – Keeper/Driver of Elephants.  Current Release – 0.8 (stable) © Varad Meru, 2013
  • 18. + Size Need  BigData  Ever-growing data.  Yesterday’s methods to process tomorrow’s data   Cheap Storage Scalable from Ground Up   Lines Sample Data KBs – low MBs Prototype Data Analysis and Visualisation Analysis and Visualisation Tools18 Whiteboard, Bash, ... Matlab, Octave, R, Processing, Bash, ... Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. MBs – low GBs Should be build on top of anyOnline existing Distributed Systems Data framework Should contain distributed version of ML algorithms Classification GBs – TBs – PBs Visualisation Flare, AmCharts, Raphael Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  • 19. + 19 Mahout Modules Applications Evolutionary Algorithms Classification Utilies Lucene/Vectorizer Clustering Recommenders Math Vectors/ Matrics/SVD Regression Collections (Primitives) FPM Dimension Reduction Hadoop
  • 20. + 20 Recommender Systems © Varad Meru, 2013
  • 21. + 21 Recommender Systems Introduction  Types of Recommender Systems     Content Based Recommendations Collaborative Filtering Recommendations  User-User Recommendations  Item-Item Recommendations Dimensionality Reduction (SVD) Recommendations Applications      Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
  • 22. + 22 Recommender Systems Collaborative Filtering in Action  Assuming people have seen at least one movie.  Cold Start?   © Varad Meru, 2013 1: seen 0: not seen
  • 23. + 23 Collaborative Filtering in Action  Tanimoto Coefficient T ( a, b) NA NC NB NC  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  • 24. + 24 Collaborative Filtering in Action  Cosine Coefficient C (a, b) NC NA NB  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  • 25. + 25 Apache Mahout Recommender System Architecture  Two Modes    Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version for Collaborative Filtering Top-level Packages  Data Model  User Similarity  Item Similarity  User Neighbourhood  Recommender
  • 26. + 26 Naive Bayes Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier
  • 27. + 27 Naive Bayes Classifier  Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.  Training:   Calculate per-Document Statistics  Normalize across Categories   Read the Features Calculate normalizing factor of each label Testing  Classification (fifth job, explicitly invoked) © Varad Meru, 2013
  • 28. + 28 K-Means Clustering Iterations
  • 29. + 29 K-Means Clustering MapReduce Version
  • 30. 30 + Summary • Machine Learning • • • Learning Algorithms Varied Applications Mahout • Scaling to Giga/Tera/Peta Scale • Free and Open Source
  • 31. + 31 More Info. 1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012. 2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012) 3. http://mahout.apache.org/ - Apache Mahout Project Page 4. http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout 5. [VIDEO] “Collaborative filtering at scale” by Sean Owen 6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub. © Varad Meru, 2013
  • 32. + Questions? 32 © Varad Meru, 2013
  • 33. 33 + Thank You Go BigData!!!  © Varad Meru, 2014