Brief Introduction :
“Scalable machine learning library”
Mahout is a solid Java framework in the Data Mining/Artificial Intelligence area. It is a machine
learning project by the Apache Software Foundation that tries to build intelligent algorithms that
learn from some data input.
What is special about Mahout is that it is a scalable library, prepared to deal with huge datasets. Its
algorithms are built on top of the Apache Hadoopproject and, so, they work with distributed
It’s also scalable. Mahout aims to be the machine learning tool of choice when the collection of data
to be processed is very large, perhaps far too large for a single machine.
Finally, it’s a Java library. It doesn’t provide a user interface, a prepackaged server, or an installer.
It’s a framework of tools intended to be used and adapted by developers.
Although Mahout is, in theory, a project open to implementations of all kinds of machine learning
techniques, it’s in practice a project that focuses on three key areas of machine learning at the
moment. They are :-
1. Recommended Engines
Some examples where these are used :
1. Recommended Engines :Eg. Social networking sites like Facebook use variants on recommender
techniques to identify people most likely to be as-yet-unconnected friends.
2. Clustering :Eg. Google News groups news articles by topic using clustering techniques, in order
to present news grouped by logical story, rather than presenting a raw listing of all articles.
3. Classification :Eg. Yahoo! Mail decides whether or not incoming messages are spam based on
prior emails and spam reports from users, as well as on characteristics of the
Each of these techniques works best when provided with a large amount of good input data. In some
cases, these techniques must not only work on large amounts of input, but must produce results
quickly, and these factors make scalability a major issue. And, as mentioned before, one of Mahout’s
key reasons for being is to produce implementations of these techniques that do scale up to huge
We have to focus on Classification technique . So coming on to it , we move forward with the
Classification using Mahout .
Classification is a simplified form of decision making that gives discrete answers to an individual
Machine-based classification is an automation of this decision making process that learns from
examples of correct decision making and emulates those decisions automatically—a core concept in
Mahout can be used on a wide range of classification projects, but the advantage of Mahout over
other approaches becomes striking as the number of training examples gets extremely large. What
large means can vary enormously. Up to about 100,000 examples, other classification systems can
be efficient and accurate. But generally, as the input exceeds 1 to 10 million training examples,
something scalable like Mahout is needed.
The reason Mahout has an advantage with larger data sets is that as input data increases, the time
or memory requirements for training may not increase linearly in a non-scalable system. A system
that slows by a factor of 2 with twice the data may be acceptable, but if 5 times as much data input
results in the system taking 100 times as long to run, another solution must be found. This is the sort
of situation in which Mahout shines.
Following table shows you , where Mahout is the best choice :-
System size in number
Choice of classification
< 100,000 Traditional, non-Mahout
approaches should work very
well. Mahout may
even be slower for training.
100,000 to 1 million Mahout begins to be a good
choice. The flexible API may
make Mahout a
preferred choice, even though
there is no performance
1 million to 10 million Mahout is an excellent choice in
> 10 million Mahout excels where others fail.
Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive
analytics is to build automated systems that can make decisions to replicate human judgment.
Classification algorithms are a fundamental tool for meeting that goal. One example of predictive
analytics is spam detection. A computer uses the details of user history and features of email
messages to determine whether new messages are spam or are relatively welcome email. Another
example is credit card fraud detection. A computer uses the recent history of an account and the
details of the current transaction to determine whether the transaction is fraudulent.
There are two main phases involved in building a classification system:
1. the creation of a model produced by a learning algorithm,
2. the use of that model to assign new data to categories.
The first phase includes a lot of job such as , selection of training data, output categories (the
targets), the algorithm through which the system will learn, and the variables used as input.
We should know about some terms before we go into deep in the classification part :
Model A computer program that makes
decisions; in classification, the output of
the training algorithm is a model.
Training data A subset of training examples labelled with
the value of the target variable and used
as input to the learning algorithm to
produce the model.
Test data A withheld portion of the training data
with the value of the target variable
hidden so that it can be used to evaluate
Training The learning process that uses training
data to produce a model. That model can
then compute estimates of the target
variable given the predictor variables as
Training example An entity with features that will be used
as input for learning algorithm.
Feature A known characteristic of a training or a
new example; a feature is equivalent to a
Variable In this context, the value of a feature or a
function of several features. This usage is
somewhat different from the use of
variable in a computer program.
Record A container where an example is stored;
such a record is composed of fields.
Field Part of a record that contains the value of
a feature (a variable).
Predictor variable A feature selected for use as input to a
classification model. Not all features need
be used. Some features may be
algorithmic combinations of other
Target variable A feature that the classification model is
attempting to estimate: the target variable
is categorical, and its determination is the
aim of the classification system.
Workflow of typical classification project in Brief :
1. Training the model Define target variable.
Collect historical data.
Define predictor variables.
Select a learning algorithm.
Use the learning algorithm to train the
2. Evaluating the model Run test data.
Adjust the input (use different
predictor variables, different
algorithms, or both).
3. Using the model in production Input new examples to estimate
unknown target values.
Retrain the model as needed.
Breif Study of WorkFlow:-
Work Flow for Stage 1 :
1. Define Categories for Target Variable :-
The target variable can’t have an open-ended set of possible values. Your choice of
categories,in turn, affects your choices for possible learning algorithms, because some
algorithms are limited to binary target variables. Although you can have no. of categories ,
but if you can limit the categories to just two , u will have more options for learning algos.
2. Collect Historical Data:-
The source of historical data you choose will be directed in part by the need to collect
historical data with known values for the target variable.
3. Define Predictor Variable:
These variables are the concreteencoding of the features extracted from the training and
test examples. The predictor variables appear in records for the training and test data and
for the production data.
4. Select a learning algo for training the model :
This is one of the most imp part , there are no of algorithm such as:
a) Logistic Regression (SGD)
c) Support Vector Machines (SVM)
d) Perceptron and Winnow
e) Neural Network
f) Random Forests
g) Restricted Boltzmann Machines
h) Online Passive Aggressive
j) Hidden Markov Models (HMM) - Training is done in Map-Reduce
Work Flow for Stage 2 :evaluating the classification model
An essential step before using the classification system in production is to find out
how well it’s likely to work. To do this, you must evaluate the accuracy of the model
and make large or small adjustments as needed before you begin classification.
Work Flow for Stage 3 : This is using the model in production
Once the model’s output has reached an acceptable level of accuracy, classification of new data can
begin. The performance of the classification system in production will depend on several factors, one
of the most important being the quality of the input data. If the new data to be analyzed has
inaccuracies in the values of predictor variables, or if the new data isn’t an appropriate match to the
training data, or if external conditions change over time, the quality of the classification model’s
output will degrade. In order to guard against this problem, periodic retesting of the model is useful,
and retraining may be necessary.
Point of different steps In Detail you must Know before starting : -
1 .In Training Classifier :-
In Training , most imp part is the feature –extraction part , from which we find out the predictor
Note :Your classifier can only be as good as the training data lets it be…
– If you don’t do good data prep, everything will perform poorly
– Data collection and pre-processing takes the bulk of the time
Preparing data for the training algorithm consists of two main steps:
1. Preprocessing raw data—Raw data is rearranged into records with identical fields.
These fields can be of four types: continuous, categorical, word-like, or text-like
in order to be classifiable.
2. Converting data to vectors—Classifiable data is parsed and vectorized using custom
code or tools such as Luceneanalyzers and Mahout vector encoders. Some
Mahout classifiers also include vectorization code.
The features should be chosen very carefully , as it is the base for the performance of ant
classification model . Like for an example :
Sometimes age is better for classification, and sometimes birth
date is better. For instance, in the case of insurance data on car accidents,
age will be a better variable to use because having car accidents is more
related to life-stage than it is to the generation a person belongs to. On
the other hand, in the case of music purchases, birth date might be more
interesting because people often retain early music preferences as they
get older. Their tastes often reflect those of their generation.
How to convert data into Vector :-
Approach : - Represent Vectors implicitly as bags of words
Used : In Bayesian classifier method.
Benefit : Involves one pass and no collisions, it avoids the need for a dictionary, but itmeans that it’s
difficult to make use of Mahout’s linear algebra capabilities that require known and consistent
lengths for the Vector objects involved.
There are other techniques ,such as feature –hashing , which is used in SGD (Stochastic Gradient
Descent) , in algos such as Linear Regression.
Choosing an algorithm to train the classifier :
Following tells u to choose the algo , in accordance to the size of training data :
The algorithms differ somewhat in the overhead or cost of training, the size of the data set for which
they’re most efficient, and the complexity of analyses they can deliver.
We will learn abt the algo in the later section .
2 .Evaluating the classifier :-
To evaluate classifiers, Mahout offers a variety of performance metrics. The main approaches are
percent correct, confusion matrix, AUC, and log likelihood. The naive Bayes and complementary
naive Bayes classifier algorithms are best evaluated using percentcorrect and confusion matrix. Any
of these methods will work with the SGD algorithm; AUC or log likelihood may be particularly useful,
because they provide insight into the model’s confidence level.
There are all the classes in Mahout through u are goin to do this , so that needs no extra effort to be
applied by us , we can directly use the Mahout classes…
Metric Supported by Mahout class
Percent correct CrossFoldLearner
Confusion matrix ConfusionMatrix, Auc
Entropy matrix Auc
AUC Auc, OnlineAuc, CrossFoldLearner, AdaptiveLogisticRegression
Log likelihood CrossFoldLearner
3 .Deploying the classifier :-
The deployment process can be broken down into these steps:
1. Scope out the problem
2. Optimize feature extraction as needed
3. Optimize vector extraction as needed
4. Deploy the scalable classifier service
: Naive Bayes :
• Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence
given the label
– It is only valid to multiply probabilities when the events are independent
– Simplistic assumption in real life
– Despite the name, Naïve works well on actual datasets
• Simple probabilistic classifier based on
– applying Baye’s theorem (from Bayesian statistics)
– strong (naive) independence assumptions.
– A more descriptive term for the underlying probability model would be
“independent feature model".
The Naive Bayes algorithm is a probabilistic classification algorithm. It makes its decisions about
which class to assign to an input document using probabilities derived from training data. The
training process analyzes the relationship between words in the training documents and categories,
and then categories and the entire training set. The available facts are collected using calculations
based on Bayes’ Theorem to produce the probability that a collection of words (a document) belongs
in a certain class.
Bayes’ Theorem states that the probability of a category given a document is equal to the Probability
of a document given a category multiplied by the probability of the category divided by the
probability of a document. This can be expressed as:
P(Category | Document) = P(Document | Category) x P(Category) / P(Document)