Your SlideShare is downloading. ×
0
+ Machine
Learning
and
Apache
Mahout
Varad Meru
Software Development Engineer
Orzota, Inc.
about.me/vrdmr

© Varad Meru, 2...
+

2

Who Am I


Orzota, Inc.





Making BigData Easy
Designing a Cloud-based platform for ETL, Analytics

Past Work ...
+

3

Outline


Introduction



Machine Learning








Apache Mahout






Introduction and History
Types of ...
+
Machine Learning
Rise of the Machine-Era

4
+

5

Introduction
“Machine Learning is Programming Computers to
optimize a Performance Criterion using Example Data
or Pa...
+

6

Learning Algorithms


Supervised Learning





Unsupervised Learning






Unlabelled input data.
Creating a ...
+

7

Supervised Learning
Introduction


Learn from the Data



Data is already labelled




Expert, Crowd-sourced or ...
+

8

Supervised Learning
Algorithms


Decision Trees



k-Nearest Neighbours



Naive Bayes



Logistic Regression

...
+

9

Supervised Learning
Example: Naive Bayes Classifier


President Obama’s Speech’s Word Map
+

10

Supervised Learning
Example: Naive Bayes Classifier


A Spam Document’s Word Map
+

11

Supervised Learning
Example: Naive Bayes Classifier


Running a test on the Classifier

“Order a trial Adobe
chick...
+

12

Unsupervised Learning
Introduction


Finding hidden structure in data



Unlabelled Data



SMEs needed post-pro...
+

13

Unsupervised Learning
Algorithms


Clustering


k-Means, MinHash, Hierarchical Clustering



Hidden Markov Model...
+

14

Unsupervised Learning
Example K-Means

Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
+

15

Learning Problem
Cat and Dog Problem


Humans can easily classify which is a cat and which is a dog.



But how c...
+
Apache Mahout
Scalable Machine Learning Library

16
© Varad Meru, 2013
+

17

History and Etymology


Inspired from MapReduce for Machine
Learning on Multicore” Ng et. al.



Written in Java....
+

Size

Need


BigData


Ever-growing data.



Yesterday’s methods to
process tomorrow’s data




Cheap Storage

Sca...
+

19

Mahout Modules

Applications

Evolutionary
Algorithms

Classification

Utilies
Lucene/Vectorizer

Clustering

Recom...
+

20

Recommender
Systems

© Varad Meru, 2013
+

21

Recommender Systems
Introduction


Types of Recommender Systems







Content Based Recommendations
Collabora...
+

22

Recommender Systems
Collaborative Filtering in Action



Assuming people
have seen at least
one movie.


Cold Sta...
+

23

Collaborative Filtering in Action


Tanimoto Coefficient

T ( a, b)

NA

NC
NB

NC



NA – Number of Customers
wh...
+

24

Collaborative Filtering in Action


Cosine Coefficient

C (a, b)

NC
NA

NB



NA – Number of Customers
who bough...
+

25

Apache Mahout
Recommender System
Architecture


Two Modes





Stand-alone non distributed (“Taste”)
Scalable D...
+

26

Naive Bayes Classifier

“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”

Classifier
+

27

Naive Bayes Classifier


Naive Bayes is a pretty complex process in Mahout: training
the classifier requires four ...
+

28

K-Means Clustering
Iterations
+

29

K-Means Clustering
MapReduce Version
30

+

Summary
•

Machine Learning
•
•

•

Learning Algorithms

Varied Applications

Mahout
•

Scaling to Giga/Tera/Peta S...
+

31

More Info.
1.

“Scalable Similarity-Based Neighborhood Methods with
MapReduce” by Sebastian Schelter, Christoph Bod...
+
Questions?

32
© Varad Meru, 2013
33

+

Thank You
Go BigData!!! 

© Varad Meru, 2014
Upcoming SlideShare
Loading in...5
×

Machine Learning and Apache Mahout : An Introduction

8,009

Published on

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

Published in: Technology, Education

Transcript of "Machine Learning and Apache Mahout : An Introduction"

  1. 1. + Machine Learning and Apache Mahout Varad Meru Software Development Engineer Orzota, Inc. about.me/vrdmr © Varad Meru, 2013
  2. 2. + 2 Who Am I  Orzota, Inc.    Making BigData Easy Designing a Cloud-based platform for ETL, Analytics Past Work Experience   Persistent Systems Ltd. Recommendation Engines and User Behavior Analytics. Area of Interest  Machine Learning  Distributed Systems  Recommendation Engines
  3. 3. + 3 Outline  Introduction  Machine Learning      Apache Mahout     Introduction and History Types of Learning Algorithms Applications What’s New History Architecture Applications and Examples Conclusion © Varad Meru, 2013
  4. 4. + Machine Learning Rise of the Machine-Era 4
  5. 5. + 5 Introduction “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”  Term coined by Arthur Samuel  "Field of study that gives computers the ability to learn without being explicitly programmed“.  Branch of Artificial Intelligence and Statistics  Focuses on prediction based on known properties  Used as a sub-process in Data Mining.  Data Mining focuses on discovering new, unknown properties.
  6. 6. + 6 Learning Algorithms  Supervised Learning    Unsupervised Learning    Unlabelled input data. Creating a function to predict the relation and output Semi-Supervised Learning   Labelled input data. Creating classifiers to predict unseen inputs. Combines Supervised and Unsupervised Learning methodology Reinforcement Learning  Reward-Punishment based agent.
  7. 7. + 7 Supervised Learning Introduction  Learn from the Data  Data is already labelled   Expert, Crowd-sourced or case-based labelling of data. Applications  Handwriting Recognition  Spam Detection  Information Retrieval   Personalisation based on ranks Speech Recognition
  8. 8. + 8 Supervised Learning Algorithms  Decision Trees  k-Nearest Neighbours  Naive Bayes  Logistic Regression  Perceptron and Multi-level Perceptrons  Neural Networks  SVM and Kernel estimation
  9. 9. + 9 Supervised Learning Example: Naive Bayes Classifier  President Obama’s Speech’s Word Map
  10. 10. + 10 Supervised Learning Example: Naive Bayes Classifier  A Spam Document’s Word Map
  11. 11. + 11 Supervised Learning Example: Naive Bayes Classifier  Running a test on the Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier Spam Bin
  12. 12. + 12 Unsupervised Learning Introduction  Finding hidden structure in data  Unlabelled Data  SMEs needed post-processing to verify, validate and use the output  Used in exploratory analysis rather than predictive analytics  Applications  Pattern Recognition  Groupings based on a distance measure  Group of People, Objects, ...
  13. 13. + 13 Unsupervised Learning Algorithms  Clustering  k-Means, MinHash, Hierarchical Clustering  Hidden Markov Models  Feature Extraction methods  Self-organizing Maps (Neural Nets)
  14. 14. + 14 Unsupervised Learning Example K-Means Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
  15. 15. + 15 Learning Problem Cat and Dog Problem  Humans can easily classify which is a cat and which is a dog.  But how can a computer do that?  Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
  16. 16. + Apache Mahout Scalable Machine Learning Library 16 © Varad Meru, 2013
  17. 17. + 17 History and Etymology  Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.  Written in Java. Apache License.  Founders  Mahout – Isabel Drost, Grant Ingersoll, Karl Witten.  Taste – Sean Owen  Mahout – Keeper/Driver of Elephants.  Current Release – 0.8 (stable) © Varad Meru, 2013
  18. 18. + Size Need  BigData  Ever-growing data.  Yesterday’s methods to process tomorrow’s data   Cheap Storage Scalable from Ground Up   Lines Sample Data KBs – low MBs Prototype Data Analysis and Visualisation Analysis and Visualisation Tools18 Whiteboard, Bash, ... Matlab, Octave, R, Processing, Bash, ... Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. MBs – low GBs Should be build on top of anyOnline existing Distributed Systems Data framework Should contain distributed version of ML algorithms Classification GBs – TBs – PBs Visualisation Flare, AmCharts, Raphael Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  19. 19. + 19 Mahout Modules Applications Evolutionary Algorithms Classification Utilies Lucene/Vectorizer Clustering Recommenders Math Vectors/ Matrics/SVD Regression Collections (Primitives) FPM Dimension Reduction Hadoop
  20. 20. + 20 Recommender Systems © Varad Meru, 2013
  21. 21. + 21 Recommender Systems Introduction  Types of Recommender Systems     Content Based Recommendations Collaborative Filtering Recommendations  User-User Recommendations  Item-Item Recommendations Dimensionality Reduction (SVD) Recommendations Applications      Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
  22. 22. + 22 Recommender Systems Collaborative Filtering in Action  Assuming people have seen at least one movie.  Cold Start?   © Varad Meru, 2013 1: seen 0: not seen
  23. 23. + 23 Collaborative Filtering in Action  Tanimoto Coefficient T ( a, b) NA NC NB NC  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  24. 24. + 24 Collaborative Filtering in Action  Cosine Coefficient C (a, b) NC NA NB  NA – Number of Customers who bought A  NB – Number of Customers who bought B  NC – Number of Customers who bought A and B © Varad Meru, 2013
  25. 25. + 25 Apache Mahout Recommender System Architecture  Two Modes    Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version for Collaborative Filtering Top-level Packages  Data Model  User Similarity  Item Similarity  User Neighbourhood  Recommender
  26. 26. + 26 Naive Bayes Classifier “Order a trial Adobe chicken daily EABList new summer savings, welcome!” Classifier
  27. 27. + 27 Naive Bayes Classifier  Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.  Training:   Calculate per-Document Statistics  Normalize across Categories   Read the Features Calculate normalizing factor of each label Testing  Classification (fifth job, explicitly invoked) © Varad Meru, 2013
  28. 28. + 28 K-Means Clustering Iterations
  29. 29. + 29 K-Means Clustering MapReduce Version
  30. 30. 30 + Summary • Machine Learning • • • Learning Algorithms Varied Applications Mahout • Scaling to Giga/Tera/Peta Scale • Free and Open Source
  31. 31. + 31 More Info. 1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012. 2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012) 3. http://mahout.apache.org/ - Apache Mahout Project Page 4. http://www.ibm.com/developerworks/java/library/j-mahout/ Introducing Apache Mahout 5. [VIDEO] “Collaborative filtering at scale” by Sean Owen 6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub. © Varad Meru, 2013
  32. 32. + Questions? 32 © Varad Meru, 2013
  33. 33. 33 + Thank You Go BigData!!!  © Varad Meru, 2014
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×