SlideShare a Scribd company logo
1 of 9
Download to read offline
Mahout Classification
Brief Introduction :
“Scalable machine learning library”
Mahout is a solid Java framework in the Data Mining/Artificial Intelligence area. It is a machine
learning project by the Apache Software Foundation that tries to build intelligent algorithms that
learn from some data input.
What is special about Mahout is that it is a scalable library, prepared to deal with huge datasets. Its
algorithms are built on top of the Apache Hadoopproject and, so, they work with distributed
computing.
It’s also scalable. Mahout aims to be the machine learning tool of choice when the collection of data
to be processed is very large, perhaps far too large for a single machine.
Finally, it’s a Java library. It doesn’t provide a user interface, a prepackaged server, or an installer.
It’s a framework of tools intended to be used and adapted by developers.
Although Mahout is, in theory, a project open to implementations of all kinds of machine learning
techniques, it’s in practice a project that focuses on three key areas of machine learning at the
moment. They are :-
1. Recommended Engines
2. Clustering
3. Classification
Some examples where these are used :
1. Recommended Engines :Eg. Social networking sites like Facebook use variants on recommender
techniques to identify people most likely to be as-yet-unconnected friends.
2. Clustering :Eg. Google News groups news articles by topic using clustering techniques, in order
to present news grouped by logical story, rather than presenting a raw listing of all articles.
3. Classification :Eg. Yahoo! Mail decides whether or not incoming messages are spam based on
prior emails and spam reports from users, as well as on characteristics of the
email itself.
Each of these techniques works best when provided with a large amount of good input data. In some
cases, these techniques must not only work on large amounts of input, but must produce results
quickly, and these factors make scalability a major issue. And, as mentioned before, one of Mahout’s
key reasons for being is to produce implementations of these techniques that do scale up to huge
input.
We have to focus on Classification technique . So coming on to it , we move forward with the
Classification using Mahout .
Classification :
Classification is a simplified form of decision making that gives discrete answers to an individual
question.
Machine-based classification is an automation of this decision making process that learns from
examples of correct decision making and emulates those decisions automatically—a core concept in
predictive analytics.
Mahout can be used on a wide range of classification projects, but the advantage of Mahout over
other approaches becomes striking as the number of training examples gets extremely large. What
large means can vary enormously. Up to about 100,000 examples, other classification systems can
be efficient and accurate. But generally, as the input exceeds 1 to 10 million training examples,
something scalable like Mahout is needed.
The reason Mahout has an advantage with larger data sets is that as input data increases, the time
or memory requirements for training may not increase linearly in a non-scalable system. A system
that slows by a factor of 2 with twice the data may be acceptable, but if 5 times as much data input
results in the system taking 100 times as long to run, another solution must be found. This is the sort
of situation in which Mahout shines.
Following table shows you , where Mahout is the best choice :-
System size in number
of examples
Choice of classification
approach
< 100,000 Traditional, non-Mahout
approaches should work very
well. Mahout may
even be slower for training.
100,000 to 1 million Mahout begins to be a good
choice. The flexible API may
make Mahout a
preferred choice, even though
there is no performance
advantage.
1 million to 10 million Mahout is an excellent choice in
this range.
> 10 million Mahout excels where others fail.
Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive
analytics is to build automated systems that can make decisions to replicate human judgment.
Classification algorithms are a fundamental tool for meeting that goal. One example of predictive
analytics is spam detection. A computer uses the details of user history and features of email
messages to determine whether new messages are spam or are relatively welcome email. Another
example is credit card fraud detection. A computer uses the recent history of an account and the
details of the current transaction to determine whether the transaction is fraudulent.
There are two main phases involved in building a classification system:
1. the creation of a model produced by a learning algorithm,
2. the use of that model to assign new data to categories.
The first phase includes a lot of job such as , selection of training data, output categories (the
targets), the algorithm through which the system will learn, and the variables used as input.
We should know about some terms before we go into deep in the classification part :
Terms Meaning
Model A computer program that makes
decisions; in classification, the output of
the training algorithm is a model.
Training data A subset of training examples labelled with
the value of the target variable and used
as input to the learning algorithm to
produce the model.
Test data A withheld portion of the training data
with the value of the target variable
hidden so that it can be used to evaluate
the model.
Training The learning process that uses training
data to produce a model. That model can
then compute estimates of the target
variable given the predictor variables as
inputs.
Training example An entity with features that will be used
as input for learning algorithm.
Feature A known characteristic of a training or a
new example; a feature is equivalent to a
characteristic.
Variable In this context, the value of a feature or a
function of several features. This usage is
somewhat different from the use of
variable in a computer program.
Record A container where an example is stored;
such a record is composed of fields.
Field Part of a record that contains the value of
a feature (a variable).
Predictor variable A feature selected for use as input to a
classification model. Not all features need
be used. Some features may be
algorithmic combinations of other
features.
Target variable A feature that the classification model is
attempting to estimate: the target variable
is categorical, and its determination is the
aim of the classification system.
Workflow of typical classification project in Brief :
Stage Step
1. Training the model Define target variable.
Collect historical data.
Define predictor variables.
Select a learning algorithm.
Use the learning algorithm to train the
model.
2. Evaluating the model Run test data.
Adjust the input (use different
predictor variables, different
algorithms, or both).
3. Using the model in production Input new examples to estimate
unknown target values.
Retrain the model as needed.
Breif Study of WorkFlow:-
Work Flow for Stage 1 :
1. Define Categories for Target Variable :-
The target variable can’t have an open-ended set of possible values. Your choice of
categories,in turn, affects your choices for possible learning algorithms, because some
algorithms are limited to binary target variables. Although you can have no. of categories ,
but if you can limit the categories to just two , u will have more options for learning algos.
2. Collect Historical Data:-
The source of historical data you choose will be directed in part by the need to collect
historical data with known values for the target variable.
3. Define Predictor Variable:
These variables are the concreteencoding of the features extracted from the training and
test examples. The predictor variables appear in records for the training and test data and
for the production data.
4. Select a learning algo for training the model :
This is one of the most imp part , there are no of algorithm such as:
a) Logistic Regression (SGD)
b) Bayesian
c) Support Vector Machines (SVM)
d) Perceptron and Winnow
e) Neural Network
f) Random Forests
g) Restricted Boltzmann Machines
h) Online Passive Aggressive
i) Boosting
j) Hidden Markov Models (HMM) - Training is done in Map-Reduce
Work Flow for Stage 2 :evaluating the classification model
An essential step before using the classification system in production is to find out
how well it’s likely to work. To do this, you must evaluate the accuracy of the model
and make large or small adjustments as needed before you begin classification.
Work Flow for Stage 3 : This is using the model in production
Once the model’s output has reached an acceptable level of accuracy, classification of new data can
begin. The performance of the classification system in production will depend on several factors, one
of the most important being the quality of the input data. If the new data to be analyzed has
inaccuracies in the values of predictor variables, or if the new data isn’t an appropriate match to the
training data, or if external conditions change over time, the quality of the classification model’s
output will degrade. In order to guard against this problem, periodic retesting of the model is useful,
and retraining may be necessary.
Point of different steps In Detail you must Know before starting : -
1 .In Training Classifier :-
In Training , most imp part is the feature –extraction part , from which we find out the predictor
variable .
Note :Your classifier can only be as good as the training data lets it be…
– If you don’t do good data prep, everything will perform poorly
– Data collection and pre-processing takes the bulk of the time
Preparing data for the training algorithm consists of two main steps:
1. Preprocessing raw data—Raw data is rearranged into records with identical fields.
These fields can be of four types: continuous, categorical, word-like, or text-like
in order to be classifiable.
2. Converting data to vectors—Classifiable data is parsed and vectorized using custom
code or tools such as Luceneanalyzers and Mahout vector encoders. Some
Mahout classifiers also include vectorization code.
The features should be chosen very carefully , as it is the base for the performance of ant
classification model . Like for an example :
Sometimes age is better for classification, and sometimes birth
date is better. For instance, in the case of insurance data on car accidents,
age will be a better variable to use because having car accidents is more
related to life-stage than it is to the generation a person belongs to. On
the other hand, in the case of music purchases, birth date might be more
interesting because people often retain early music preferences as they
get older. Their tastes often reflect those of their generation.
How to convert data into Vector :-
Approach : - Represent Vectors implicitly as bags of words
Used : In Bayesian classifier method.
Benefit : Involves one pass and no collisions, it avoids the need for a dictionary, but itmeans that it’s
difficult to make use of Mahout’s linear algebra capabilities that require known and consistent
lengths for the Vector objects involved.
There are other techniques ,such as feature –hashing , which is used in SGD (Stochastic Gradient
Descent) , in algos such as Linear Regression.
Choosing an algorithm to train the classifier :
Following tells u to choose the algo , in accordance to the size of training data :
The algorithms differ somewhat in the overhead or cost of training, the size of the data set for which
they’re most efficient, and the complexity of analyses they can deliver.
We will learn abt the algo in the later section .
2 .Evaluating the classifier :-
To evaluate classifiers, Mahout offers a variety of performance metrics. The main approaches are
percent correct, confusion matrix, AUC, and log likelihood. The naive Bayes and complementary
naive Bayes classifier algorithms are best evaluated using percentcorrect and confusion matrix. Any
of these methods will work with the SGD algorithm; AUC or log likelihood may be particularly useful,
because they provide insight into the model’s confidence level.
There are all the classes in Mahout through u are goin to do this , so that needs no extra effort to be
applied by us , we can directly use the Mahout classes…
Metric Supported by Mahout class
Percent correct CrossFoldLearner
Confusion matrix ConfusionMatrix, Auc
Entropy matrix Auc
AUC Auc, OnlineAuc, CrossFoldLearner, AdaptiveLogisticRegression
Log likelihood CrossFoldLearner
3 .Deploying the classifier :-
The deployment process can be broken down into these steps:
1. Scope out the problem
2. Optimize feature extraction as needed
3. Optimize vector extraction as needed
4. Deploy the scalable classifier service
: Naive Bayes :
• Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence
given the label
– It is only valid to multiply probabilities when the events are independent
– Simplistic assumption in real life
– Despite the name, Naïve works well on actual datasets
• Simple probabilistic classifier based on
– applying Baye’s theorem (from Bayesian statistics)
– strong (naive) independence assumptions.
– A more descriptive term for the underlying probability model would be
“independent feature model".
The Naive Bayes algorithm is a probabilistic classification algorithm. It makes its decisions about
which class to assign to an input document using probabilities derived from training data. The
training process analyzes the relationship between words in the training documents and categories,
and then categories and the entire training set. The available facts are collected using calculations
based on Bayes’ Theorem to produce the probability that a collection of words (a document) belongs
in a certain class.
Bayes’ Theorem states that the probability of a category given a document is equal to the Probability
of a document given a category multiplied by the probability of the category divided by the
probability of a document. This can be expressed as:
P(Category | Document) = P(Document | Category) x P(Category) / P(Document)

More Related Content

What's hot

Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersSatyam Jaiswal
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learningZAMANCHBWN
 
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationTara ram Goyal
 
Supervised learning
Supervised learningSupervised learning
Supervised learningAlia Hamwi
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overviewprih_yah
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4Luis Borbon
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview QuestionsRock Interview
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning TechniquesTara ram Goyal
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 

What's hot (20)

Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learning
 
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its application
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 
C3 w4
C3 w4C3 w4
C3 w4
 
INTERNSHIP
INTERNSHIPINTERNSHIP
INTERNSHIP
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 

Viewers also liked

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
Characterization of a dielectric barrier discharge (DBD) for waste gas treatment
Characterization of a dielectric barrier discharge (DBD) for waste gas treatmentCharacterization of a dielectric barrier discharge (DBD) for waste gas treatment
Characterization of a dielectric barrier discharge (DBD) for waste gas treatmentDevansh Sharma
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis Naveen Kumar
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Principal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding documentPrincipal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding documentNaveen Kumar
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 

Viewers also liked (18)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Characterization of a dielectric barrier discharge (DBD) for waste gas treatment
Characterization of a dielectric barrier discharge (DBD) for waste gas treatmentCharacterization of a dielectric barrier discharge (DBD) for waste gas treatment
Characterization of a dielectric barrier discharge (DBD) for waste gas treatment
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Principal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding documentPrincipal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding document
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 

Similar to Understanding Mahout classification documentation

Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsEng Teong Cheah
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docxjaffarbikat
 
Types of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike MoinTypes of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike MoinTanvir Moin
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology rebeccatho
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptxNaveenkushwaha18
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docxJadhavArjun2
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code OptimizationIRJET Journal
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Classification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsClassification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsAM Publications
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data ScienceJohn B. Rollins, Ph.D.
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra
 
Using machine learning in anti money laundering part 2
Using machine learning in anti money laundering   part 2Using machine learning in anti money laundering   part 2
Using machine learning in anti money laundering part 2Naveen Grover
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxdaniahendric
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxssuser957b41
 

Similar to Understanding Mahout classification documentation (20)

Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Types of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike MoinTypes of Machine Learning- Tanvir Siddike Moin
Types of Machine Learning- Tanvir Siddike Moin
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning Contents.pptx
Machine Learning Contents.pptxMachine Learning Contents.pptx
Machine Learning Contents.pptx
 
machine learning.docx
machine learning.docxmachine learning.docx
machine learning.docx
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Classification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsClassification of Machine Learning Algorithms
Classification of Machine Learning Algorithms
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Using machine learning in anti money laundering part 2
Using machine learning in anti money laundering   part 2Using machine learning in anti money laundering   part 2
Using machine learning in anti money laundering part 2
 
Algorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docxAlgorithm ExampleFor the following taskUse the random module .docx
Algorithm ExampleFor the following taskUse the random module .docx
 
Chapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptxChapter 05 Machine Learning.pptx
Chapter 05 Machine Learning.pptx
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Understanding Mahout classification documentation

  • 1. Mahout Classification Brief Introduction : “Scalable machine learning library” Mahout is a solid Java framework in the Data Mining/Artificial Intelligence area. It is a machine learning project by the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input. What is special about Mahout is that it is a scalable library, prepared to deal with huge datasets. Its algorithms are built on top of the Apache Hadoopproject and, so, they work with distributed computing. It’s also scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. Finally, it’s a Java library. It doesn’t provide a user interface, a prepackaged server, or an installer. It’s a framework of tools intended to be used and adapted by developers. Although Mahout is, in theory, a project open to implementations of all kinds of machine learning techniques, it’s in practice a project that focuses on three key areas of machine learning at the moment. They are :- 1. Recommended Engines 2. Clustering 3. Classification Some examples where these are used : 1. Recommended Engines :Eg. Social networking sites like Facebook use variants on recommender techniques to identify people most likely to be as-yet-unconnected friends. 2. Clustering :Eg. Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles. 3. Classification :Eg. Yahoo! Mail decides whether or not incoming messages are spam based on prior emails and spam reports from users, as well as on characteristics of the email itself. Each of these techniques works best when provided with a large amount of good input data. In some cases, these techniques must not only work on large amounts of input, but must produce results quickly, and these factors make scalability a major issue. And, as mentioned before, one of Mahout’s key reasons for being is to produce implementations of these techniques that do scale up to huge input.
  • 2. We have to focus on Classification technique . So coming on to it , we move forward with the Classification using Mahout . Classification : Classification is a simplified form of decision making that gives discrete answers to an individual question. Machine-based classification is an automation of this decision making process that learns from examples of correct decision making and emulates those decisions automatically—a core concept in predictive analytics. Mahout can be used on a wide range of classification projects, but the advantage of Mahout over other approaches becomes striking as the number of training examples gets extremely large. What large means can vary enormously. Up to about 100,000 examples, other classification systems can be efficient and accurate. But generally, as the input exceeds 1 to 10 million training examples, something scalable like Mahout is needed. The reason Mahout has an advantage with larger data sets is that as input data increases, the time or memory requirements for training may not increase linearly in a non-scalable system. A system that slows by a factor of 2 with twice the data may be acceptable, but if 5 times as much data input results in the system taking 100 times as long to run, another solution must be found. This is the sort of situation in which Mahout shines. Following table shows you , where Mahout is the best choice :- System size in number of examples Choice of classification approach < 100,000 Traditional, non-Mahout approaches should work very well. Mahout may even be slower for training. 100,000 to 1 million Mahout begins to be a good choice. The flexible API may make Mahout a preferred choice, even though there is no performance advantage. 1 million to 10 million Mahout is an excellent choice in this range. > 10 million Mahout excels where others fail.
  • 3. Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive analytics is to build automated systems that can make decisions to replicate human judgment. Classification algorithms are a fundamental tool for meeting that goal. One example of predictive analytics is spam detection. A computer uses the details of user history and features of email messages to determine whether new messages are spam or are relatively welcome email. Another example is credit card fraud detection. A computer uses the recent history of an account and the details of the current transaction to determine whether the transaction is fraudulent. There are two main phases involved in building a classification system: 1. the creation of a model produced by a learning algorithm, 2. the use of that model to assign new data to categories. The first phase includes a lot of job such as , selection of training data, output categories (the targets), the algorithm through which the system will learn, and the variables used as input. We should know about some terms before we go into deep in the classification part : Terms Meaning Model A computer program that makes decisions; in classification, the output of the training algorithm is a model. Training data A subset of training examples labelled with the value of the target variable and used as input to the learning algorithm to produce the model. Test data A withheld portion of the training data with the value of the target variable hidden so that it can be used to evaluate the model. Training The learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs. Training example An entity with features that will be used as input for learning algorithm. Feature A known characteristic of a training or a new example; a feature is equivalent to a characteristic. Variable In this context, the value of a feature or a function of several features. This usage is
  • 4. somewhat different from the use of variable in a computer program. Record A container where an example is stored; such a record is composed of fields. Field Part of a record that contains the value of a feature (a variable). Predictor variable A feature selected for use as input to a classification model. Not all features need be used. Some features may be algorithmic combinations of other features. Target variable A feature that the classification model is attempting to estimate: the target variable is categorical, and its determination is the aim of the classification system. Workflow of typical classification project in Brief : Stage Step 1. Training the model Define target variable. Collect historical data. Define predictor variables. Select a learning algorithm. Use the learning algorithm to train the model. 2. Evaluating the model Run test data. Adjust the input (use different predictor variables, different algorithms, or both). 3. Using the model in production Input new examples to estimate unknown target values. Retrain the model as needed.
  • 5. Breif Study of WorkFlow:- Work Flow for Stage 1 : 1. Define Categories for Target Variable :- The target variable can’t have an open-ended set of possible values. Your choice of categories,in turn, affects your choices for possible learning algorithms, because some algorithms are limited to binary target variables. Although you can have no. of categories , but if you can limit the categories to just two , u will have more options for learning algos. 2. Collect Historical Data:- The source of historical data you choose will be directed in part by the need to collect historical data with known values for the target variable. 3. Define Predictor Variable: These variables are the concreteencoding of the features extracted from the training and test examples. The predictor variables appear in records for the training and test data and for the production data. 4. Select a learning algo for training the model : This is one of the most imp part , there are no of algorithm such as: a) Logistic Regression (SGD) b) Bayesian c) Support Vector Machines (SVM) d) Perceptron and Winnow e) Neural Network f) Random Forests g) Restricted Boltzmann Machines h) Online Passive Aggressive i) Boosting j) Hidden Markov Models (HMM) - Training is done in Map-Reduce Work Flow for Stage 2 :evaluating the classification model An essential step before using the classification system in production is to find out how well it’s likely to work. To do this, you must evaluate the accuracy of the model and make large or small adjustments as needed before you begin classification.
  • 6. Work Flow for Stage 3 : This is using the model in production Once the model’s output has reached an acceptable level of accuracy, classification of new data can begin. The performance of the classification system in production will depend on several factors, one of the most important being the quality of the input data. If the new data to be analyzed has inaccuracies in the values of predictor variables, or if the new data isn’t an appropriate match to the training data, or if external conditions change over time, the quality of the classification model’s output will degrade. In order to guard against this problem, periodic retesting of the model is useful, and retraining may be necessary. Point of different steps In Detail you must Know before starting : - 1 .In Training Classifier :- In Training , most imp part is the feature –extraction part , from which we find out the predictor variable . Note :Your classifier can only be as good as the training data lets it be… – If you don’t do good data prep, everything will perform poorly – Data collection and pre-processing takes the bulk of the time Preparing data for the training algorithm consists of two main steps: 1. Preprocessing raw data—Raw data is rearranged into records with identical fields. These fields can be of four types: continuous, categorical, word-like, or text-like in order to be classifiable. 2. Converting data to vectors—Classifiable data is parsed and vectorized using custom code or tools such as Luceneanalyzers and Mahout vector encoders. Some Mahout classifiers also include vectorization code. The features should be chosen very carefully , as it is the base for the performance of ant classification model . Like for an example : Sometimes age is better for classification, and sometimes birth date is better. For instance, in the case of insurance data on car accidents, age will be a better variable to use because having car accidents is more related to life-stage than it is to the generation a person belongs to. On the other hand, in the case of music purchases, birth date might be more interesting because people often retain early music preferences as they get older. Their tastes often reflect those of their generation. How to convert data into Vector :-
  • 7. Approach : - Represent Vectors implicitly as bags of words Used : In Bayesian classifier method. Benefit : Involves one pass and no collisions, it avoids the need for a dictionary, but itmeans that it’s difficult to make use of Mahout’s linear algebra capabilities that require known and consistent lengths for the Vector objects involved. There are other techniques ,such as feature –hashing , which is used in SGD (Stochastic Gradient Descent) , in algos such as Linear Regression. Choosing an algorithm to train the classifier : Following tells u to choose the algo , in accordance to the size of training data : The algorithms differ somewhat in the overhead or cost of training, the size of the data set for which they’re most efficient, and the complexity of analyses they can deliver. We will learn abt the algo in the later section . 2 .Evaluating the classifier :- To evaluate classifiers, Mahout offers a variety of performance metrics. The main approaches are percent correct, confusion matrix, AUC, and log likelihood. The naive Bayes and complementary naive Bayes classifier algorithms are best evaluated using percentcorrect and confusion matrix. Any of these methods will work with the SGD algorithm; AUC or log likelihood may be particularly useful, because they provide insight into the model’s confidence level. There are all the classes in Mahout through u are goin to do this , so that needs no extra effort to be applied by us , we can directly use the Mahout classes…
  • 8. Metric Supported by Mahout class Percent correct CrossFoldLearner Confusion matrix ConfusionMatrix, Auc Entropy matrix Auc AUC Auc, OnlineAuc, CrossFoldLearner, AdaptiveLogisticRegression Log likelihood CrossFoldLearner 3 .Deploying the classifier :- The deployment process can be broken down into these steps: 1. Scope out the problem 2. Optimize feature extraction as needed 3. Optimize vector extraction as needed 4. Deploy the scalable classifier service : Naive Bayes : • Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence given the label – It is only valid to multiply probabilities when the events are independent – Simplistic assumption in real life – Despite the name, Naïve works well on actual datasets
  • 9. • Simple probabilistic classifier based on – applying Baye’s theorem (from Bayesian statistics) – strong (naive) independence assumptions. – A more descriptive term for the underlying probability model would be “independent feature model". The Naive Bayes algorithm is a probabilistic classification algorithm. It makes its decisions about which class to assign to an input document using probabilities derived from training data. The training process analyzes the relationship between words in the training documents and categories, and then categories and the entire training set. The available facts are collected using calculations based on Bayes’ Theorem to produce the probability that a collection of words (a document) belongs in a certain class. Bayes’ Theorem states that the probability of a category given a document is equal to the Probability of a document given a category multiplied by the probability of the category divided by the probability of a document. This can be expressed as: P(Category | Document) = P(Document | Category) x P(Category) / P(Document)