MapReduce for Machine
Learning
by
pranya prabhakar
S4 MCA
05
CONTENTS
 Introduction
 Machine Learning
 MapReduce
 ML on MapReduce
 Apache mahout and its installation
steps
 Conclusion
Introduction
• Data increasing rapidly
• It is necessary to process and to analyze the
data
• Analyzing the data by machine as a human
being. …Different
Machine Learning
 Supervised Learning:
Generate a function based upon assigned
labels that maps inputs to desired outputs.
 Unsupervised Learning:
Looks for patterns native to a dataset, and
models it like clustering (e.g. Data mining
&knowledge discovery).
 Reinforcement Learning:
Learns how to act given reward(or
punishment) from the world.
Types of problems
 Classification:
data is labeled means it assigned a class
- Learn a model from a manually classified data
- Predict the class of a new object based on its
features and the learned model
e.g.: spam/non-spam, fraud/non-fraud
 Clustering
data is not labelled,but can be divided into groups based on
similarity
- Group similar looking objects
- Notion of similarity: Distance measure:
eg:organizing pictures by faces without names.
 Regression
 Data is labeled with real value rather than a label
eg:time series data like the price of a stock over time.
Supervised Learning
Algorithms
 Decision Trees
 k-Nearest Neighbours
 Naive Bayes
 Logistic Regression
 Perceptron and Multi-level
Perceptions
 Neural Networks
 SVM and Kernel estimation
Unsupervised Learning
Algorithms
 Clustering
◦ k-Means, MinHash, Hierarchical
Clustering
 Hidden Markov Models
 Feature Extraction methods
 Self-organizing Maps (Neural Nets)
uses
 Spam filtering
 Credit card Fraud detection
 Face recognition(computer vision)
 Speech understanding
 Medical diagnosis
and so on…
Current state of ML libraries
 Lack scalability
 Lack documentations and examples
 Lack Apache licensing
 Are not well tested
 Are Research oriented
 Not built over existing production
quality libraries
 Lack “Deployability”
MapReduce
 It’s a programming framework
 Used for parallel processing over large
data sets
 Application divided into small
fragments of works and distributed
across the cluster
 Computation unit of Hadoop
 Two functions: Map() and Reduce()
Apache mahout
 The starting place for MapReduce-
based machine learning
 A disparate collection of algorithms for
 Recommendation
 Clustering
 Classification
 Frequency item Mining
Mahout installation
 Prerequisites
java
Hadoop
maven
 Java installation
1. sudo apt-get install sun java jdk
2. sudo gedit .bashrc
set JAVA_HOME in .bashrc file
 Installation of maven
1. sudo apt-get install maven2
2. open .bashrc and add the lines
############## Apache-Maven #########
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
export JAVA_HOME=$HOME/programs/jdk
Contd..
 Run mvn --version to verify that it is
correctly installed.
 Hadoop installation
single node hadoop cluster has been set up as how java
installed
 Installation of Mahout
1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/
2. Create a folder and move the download file to the created directory
say, mkdir usr/local/mahout
3.Mvn install..it shows as
Example showing 20news
group’s database
Application of Mahout
 Collaborative Filtering
Matrix factorization based recommenders
A user based Recommender
 Clustering
Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Affinity Propagation Clustering
 Classification
Naive Bayes
Conclusion
 By using the mapReduce framework,
we could parallelize a wide range of
machine learning algorithms and
apache mahout provide s a platform
for machine learning in mapReduce
paradigm.

mapReduce for machine learning

  • 1.
  • 2.
    CONTENTS  Introduction  MachineLearning  MapReduce  ML on MapReduce  Apache mahout and its installation steps  Conclusion
  • 3.
    Introduction • Data increasingrapidly • It is necessary to process and to analyze the data • Analyzing the data by machine as a human being. …Different
  • 4.
    Machine Learning  SupervisedLearning: Generate a function based upon assigned labels that maps inputs to desired outputs.  Unsupervised Learning: Looks for patterns native to a dataset, and models it like clustering (e.g. Data mining &knowledge discovery).  Reinforcement Learning: Learns how to act given reward(or punishment) from the world.
  • 7.
    Types of problems Classification: data is labeled means it assigned a class - Learn a model from a manually classified data - Predict the class of a new object based on its features and the learned model e.g.: spam/non-spam, fraud/non-fraud  Clustering data is not labelled,but can be divided into groups based on similarity - Group similar looking objects - Notion of similarity: Distance measure: eg:organizing pictures by faces without names.  Regression  Data is labeled with real value rather than a label eg:time series data like the price of a stock over time.
  • 8.
    Supervised Learning Algorithms  DecisionTrees  k-Nearest Neighbours  Naive Bayes  Logistic Regression  Perceptron and Multi-level Perceptions  Neural Networks  SVM and Kernel estimation
  • 9.
    Unsupervised Learning Algorithms  Clustering ◦k-Means, MinHash, Hierarchical Clustering  Hidden Markov Models  Feature Extraction methods  Self-organizing Maps (Neural Nets)
  • 10.
    uses  Spam filtering Credit card Fraud detection  Face recognition(computer vision)  Speech understanding  Medical diagnosis and so on…
  • 11.
    Current state ofML libraries  Lack scalability  Lack documentations and examples  Lack Apache licensing  Are not well tested  Are Research oriented  Not built over existing production quality libraries  Lack “Deployability”
  • 12.
    MapReduce  It’s aprogramming framework  Used for parallel processing over large data sets  Application divided into small fragments of works and distributed across the cluster  Computation unit of Hadoop  Two functions: Map() and Reduce()
  • 13.
    Apache mahout  Thestarting place for MapReduce- based machine learning  A disparate collection of algorithms for  Recommendation  Clustering  Classification  Frequency item Mining
  • 14.
    Mahout installation  Prerequisites java Hadoop maven Java installation 1. sudo apt-get install sun java jdk 2. sudo gedit .bashrc set JAVA_HOME in .bashrc file  Installation of maven 1. sudo apt-get install maven2 2. open .bashrc and add the lines ############## Apache-Maven ######### export M2_HOME=/usr/local/apache-maven-3.0.4 export M2=$M2_HOME/bin export PATH=$M2:$PATH export JAVA_HOME=$HOME/programs/jdk
  • 15.
    Contd..  Run mvn--version to verify that it is correctly installed.
  • 16.
     Hadoop installation singlenode hadoop cluster has been set up as how java installed  Installation of Mahout 1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/ 2. Create a folder and move the download file to the created directory say, mkdir usr/local/mahout 3.Mvn install..it shows as
  • 19.
  • 20.
    Application of Mahout Collaborative Filtering Matrix factorization based recommenders A user based Recommender  Clustering Canopy Clustering K-Means Clustering Fuzzy K-Means Affinity Propagation Clustering  Classification Naive Bayes
  • 21.
    Conclusion  By usingthe mapReduce framework, we could parallelize a wide range of machine learning algorithms and apache mahout provide s a platform for machine learning in mapReduce paradigm.