mapReduce for machine learning

MapReduce for Machine
Learning
by
pranya prabhakar
S4 MCA
05

CONTENTS
 Introduction
 Machine Learning
 MapReduce
 ML on MapReduce
 Apache mahout and its installation
steps
 Conclusion

Introduction
• Data increasing rapidly
• It is necessary to process and to analyze the
data
• Analyzing the data by machine as a human
being. …Different

Machine Learning
 Supervised Learning:
Generate a function based upon assigned
labels that maps inputs to desired outputs.
 Unsupervised Learning:
Looks for patterns native to a dataset, and
models it like clustering (e.g. Data mining
&knowledge discovery).
 Reinforcement Learning:
Learns how to act given reward(or
punishment) from the world.

Types of problems
 Classification:
data is labeled means it assigned a class
- Learn a model from a manually classified data
- Predict the class of a new object based on its
features and the learned model
e.g.: spam/non-spam, fraud/non-fraud
 Clustering
data is not labelled,but can be divided into groups based on
similarity
- Group similar looking objects
- Notion of similarity: Distance measure:
eg:organizing pictures by faces without names.
 Regression
 Data is labeled with real value rather than a label
eg:time series data like the price of a stock over time.

Supervised Learning
Algorithms
 Decision Trees
 k-Nearest Neighbours
 Naive Bayes
 Logistic Regression
 Perceptron and Multi-level
Perceptions
 Neural Networks
 SVM and Kernel estimation

Unsupervised Learning
Algorithms
 Clustering
◦ k-Means, MinHash, Hierarchical
Clustering
 Hidden Markov Models
 Feature Extraction methods
 Self-organizing Maps (Neural Nets)

uses
 Spam filtering
 Credit card Fraud detection
 Face recognition(computer vision)
 Speech understanding
 Medical diagnosis
and so on…

Current state of ML libraries
 Lack scalability
 Lack documentations and examples
 Lack Apache licensing
 Are not well tested
 Are Research oriented
 Not built over existing production
quality libraries
 Lack “Deployability”

MapReduce
 It’s a programming framework
 Used for parallel processing over large
data sets
 Application divided into small
fragments of works and distributed
across the cluster
 Computation unit of Hadoop
 Two functions: Map() and Reduce()

Apache mahout
 The starting place for MapReduce-
based machine learning
 A disparate collection of algorithms for
 Recommendation
 Clustering
 Classification
 Frequency item Mining

Mahout installation
 Prerequisites
java
Hadoop
maven
 Java installation
1. sudo apt-get install sun java jdk
2. sudo gedit .bashrc
set JAVA_HOME in .bashrc file
 Installation of maven
1. sudo apt-get install maven2
2. open .bashrc and add the lines
############## Apache-Maven #########
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
export JAVA_HOME=$HOME/programs/jdk

Contd..
 Run mvn --version to verify that it is
correctly installed.

 Hadoop installation
single node hadoop cluster has been set up as how java
installed
 Installation of Mahout
1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/
2. Create a folder and move the download file to the created directory
say, mkdir usr/local/mahout
3.Mvn install..it shows as

Example showing 20news
group’s database

Application of Mahout
 Collaborative Filtering
Matrix factorization based recommenders
A user based Recommender
 Clustering
Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Affinity Propagation Clustering
 Classification
Naive Bayes

Conclusion
 By using the mapReduce framework,
we could parallelize a wide range of
machine learning algorithms and
apache mahout provide s a platform
for machine learning in mapReduce
paradigm.

mapReduce for machine learning

More Related Content

What's hot

Similar to mapReduce for machine learning

Recently uploaded

mapReduce for machine learning