Machine Learning and Apache Mahout : An Introduction
Upcoming SlideShare
Loading in...5
×
 

Machine Learning and Apache Mahout : An Introduction

on

  • 5,621 views

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

Statistics

Views

Total Views
5,621
Views on SlideShare
5,499
Embed Views
122

Actions

Likes
18
Downloads
235
Comments
0

4 Embeds 122

https://twitter.com 112
http://www.linkedin.com 5
http://192.168.6.56 3
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Machine Learning and Apache Mahout : An Introduction Machine Learning and Apache Mahout : An Introduction Presentation Transcript

  • + Machine Learning and Apache Mahout Varad Meru Software Development Engineer Orzota, Inc. about.me/vrdmr © Varad Meru, 2013
  • + 2 Who Am I  Orzota, Inc.    Making BigData Easy Designing a Cloud-based platform for ETL, Analytics Past Work Experience   Persistent Systems Ltd. Recommendation Engines and User Behavior Analytics. Area of Interest  Machine Learning  Distributed Systems  Recommendation Engines
  • + 3 Outline  Introduction  Machine Learning      Apache Mahout     Introduction and History Types of Learning Algorithms Applications What’s New History Architecture Applications and Examples Conclusion © Varad Meru, 2013
  • + Machine Learning Rise of the Machine-Era 4
  • + 5 Introduction “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”  Term coined by Arthur Samuel  "Field of study that gives computers the ability to learn without being explicitly programmed“.  Branch of Artificial Intelligence and Statistics  Focuses on prediction based on known properties  Used as a sub-process in Data Mining.  Data Mining focuses on discovering new, unknown properties.
  • + 6 Learning Algorithms  Supervised Learning    Unsupervised Learning    Unlabelled input data. Creating a function to predict the relation and output Semi-Supervised Learning   Labelled input data. Creating classifiers to predict unseen inputs. Combines Supervised and Unsupervised Learning methodology Reinforcement Learning  Reward-Punishment based agent.
  • + 7 Supervised Learning Introduction  Learn from the Data  Data is already labelled   Expert, Crowd-sourced or case-based labelling of data. Applications  Handwriting Recognition  Spam Detection  Information Retrieval   Personalisation based on ranks Speech Recognition
  • + 8 Supervised Learning Algorithms  Decision Trees  k-Nearest Neighbours  Naive Bayes  Logistic Regression  Perceptron and Multi-level Perceptrons  Neural Networks  SVM and Kernel estimation
  • + 9 Supervised Learning Example: Naive Bayes Classifier  President Obama’s Speech’s Word Map
  • + 10 Supervised Learning Example: Naive Bayes Classifier  A Spam Document’s Word Map
  • + 11 Supervised Learning Example: Naive Bayes Classifier  Running a test on the Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier Spam Bin
  • + 12 Unsupervised Learning Introduction  Finding hidden structure in data  Unlabelled Data  SMEs needed post-processing to verify, validate and use the output  Used in exploratory analysis rather than predictive analytics  Applications  Pattern Recognition  Groupings based on a distance measure  Group of People, Objects, ...
  • + 13 Unsupervised Learning Algorithms  Clustering  k-Means, MinHash, Hierarchical Clustering  Hidden Markov Models  Feature Extraction methods  Self-organizing Maps (Neural Nets)
  • + 14 Unsupervised Learning Example K-Means Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
  • + 15 Learning Problem Cat and Dog Problem  Humans can easily classify which is a cat and which is a dog.  But how can a computer do that?  Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
  • + Apache Mahout Scalable Machine Learning Library 16 © Varad Meru, 2013
  • + 17 History and Etymology  Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.  Written in Java. Apache License.  Founders  Mahout – Isabel Drost, Grant Ingersoll, Karl Witten.  Taste – Sean Owen  Mahout – Keeper/Driver of Elephants.  Current Release – 0.8 (stable) © Varad Meru, 2013
  • + Size Need  BigData  Ever-growing data.  Yesterday’s methods to process tomorrow’s data   Cheap Storage Scalable from Ground Up   Lines Sample Data KBs – low MBs Prototype Data Analysis and Visualisation Analysis and Visualisation Tools18 Whiteboard, Bash, ... Matlab, Octave, R, Processing, Bash, ... Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. MBs – low GBs Should be build on top of anyOnline existing Distributed Systems Data framework Should contain distributed version of ML algorithms Classification GBs – TBs – PBs Visualisation Flare, AmCharts, Raphael Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  • + 19 Mahout Modules Applications Evolutionary Algorithms Classification Utilies Lucene/Vectorizer Clustering Recommenders Math Vectors/ Matrics/SVD Regression Collections (Primitives) FPM Dimension Reduction Hadoop
  • + 20 Recommender Systems © Varad Meru, 2013
  • + 21 Recommender Systems Introduction  Types of Recommender Systems     Content Based Recommendations Collaborative Filtering Recommendations  User-User Recommendations  Item-Item Recommendations Dimensionality Reduction (SVD) Recommendations Applications      Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
  • + 22 Recommender Systems Collaborative Filtering in Action  Assuming people have seen at least one movie.  Cold Start?   © Varad Meru, 2013 1: seen 0: not seen
  • + 23 Collaborative Filtering in Action  Tanimoto Coefficient T ( a, b) NA NC NB NC  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  • + 24 Collaborative Filtering in Action  Cosine Coefficient C (a, b) NC NA NB  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  • + 25 Apache Mahout Recommender System Architecture  Two Modes    Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version for Collaborative Filtering Top-level Packages  Data Model  User Similarity  Item Similarity  User Neighbourhood  Recommender
  • + 26 Naive Bayes Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier
  • + 27 Naive Bayes Classifier  Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.  Training:   Calculate per-Document Statistics  Normalize across Categories   Read the Features Calculate normalizing factor of each label Testing  Classification (fifth job, explicitly invoked) © Varad Meru, 2013
  • + 28 K-Means Clustering Iterations
  • + 29 K-Means Clustering MapReduce Version
  • 30 + Summary • Machine Learning • • • Learning Algorithms Varied Applications Mahout • Scaling to Giga/Tera/Peta Scale • Free and Open Source
  • + 31 More Info. 1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012. 2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012) 3. http://mahout.apache.org/ - Apache Mahout Project Page 4. http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout 5. [VIDEO] “Collaborative filtering at scale” by Sean Owen 6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub. © Varad Meru, 2013
  • + Questions? 32 © Varad Meru, 2013
  • 33 + Thank You Go BigData!!!  © Varad Meru, 2014