SlideShare a Scribd company logo
Logistic RegressionLogistic Regression
UsingUsing
Apache MahoutApache Mahout
CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Tanuvir Singh
OutlineOutline
• Logistic Regression Model
• Mahout Overview
• Logistic Regression using Mahout
• Training and Testing ML Model
• Demo
Logistic Regression ModelLogistic Regression Model
• Logistic regression is a model used for prediction of
the probability of occurrence of an event. It makes
use of several predictor variables that may be either
numerical or categories.
• It is used to predict a binary response from a
binary predictor, used for predicting the outcome of
a categorical dependent variable (i.e., a class
label) based on one or more predictor variables
(features).
• Logistic regression is the standard industry workhorse
that underlies many production fraud detection and
advertising quality and targeting products.
MahoutMahout
• Apache Mahout is a project of the Apache
Software Foundation to produce free
implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in
the areas of collaborative filtering, clustering and
classification.
• The advantage of Mahout over other approaches
becomes striking as the number of training
examples gets extremely large. What large means
can vary enormously.
• As the input exceeds 1 to 10 million training
examples, something scalable like Mahout is
needed.
Logistic Regression Using MahoutLogistic Regression Using Mahout
• Mahout’s implementation of Logistic regression uses
Stochastic Gradient Descent (SGD) algorithm
• This algorithm is a sequential (nonparallel) algorithm,
but it’s fast.
• While working with large data, the SGD algorithm
uses a constant amount of memory regardless of
the size of the input.
• Mahout includes a command line example of
logistic regression program.
• For production use, the logistic regression stuff
mostly is not run from the command line, but is
integrated more tightly into some data flow.
• Mahout’s implementation of Logistic Regression
using SGD supports the following command line
program names:
Valid program names are:
ocat : Print a file or resource as the logistic regression models would see it
orunlogistic : Run a logistic regression model against CSV data
otrainlogistic : Train a logistic regression using stochastic gradient descent
Logistic Regression Using MahoutLogistic Regression Using Mahout
BUILDING THE MODEL
You can build a model to determine the REQUIRED field from the selected features
with the
following command:
$ bin/mahout trainlogistic --input Input.csv 
--output ./model 
--target TARGETVARIABLE--categories 2 
--predictors Predictor1 Predictor2 . . . --types numeric 
--features 20 --passes 100 --rate 50
......
This command specifies that the input comes from the resource named Input.csv,
that the resulting model is stored in the file ./model, that the target variable is in the
field named TARGETVARIABLE and that it has two possible values. The command also
specifies that the algorithm should use variables x and y as predictors, both with
numerical types. The remaining options specify internal parameters for the learning
algorithm.
Logistic Regression Using MahoutLogistic Regression Using Mahout
Handle What it does
--quiet Produces less status and progress output.
--input <file-or-resource> Uses the specified file or resource as input.
--output <file-for-model> Puts the model into the specified file.
--target <variable> Uses the specified variable as the target.
--categories <n> Specifies how many categories the target variable has.
--predictors <v1> ... <vn> Specifies the names of the predictor variables.
--types
<t1> ... <tm> Gives a list of the types of the predictor variables. Each type should be one of numeric,
word, or text. Types can be abbreviated to their first letter. If too few types are given, the last one is
used again as necessary. Use word for categorical variables.
--passes
Specifies the number of times the input data should be reexamined during training. Small input files
may need to be examined dozens of times. Very large input files probably don’t even need to be
completely examined.
--lambda
Controls how much the algorithm tries to eliminate variables from the final model. A value of 0
indicates no effort is made. Typical values are on the order of 0.00001 or less.
--rate
Sets the initial learning rate. This can be large if you have lots of data or use lots of passes because it’s
decreased progressively as data is examined.
--noBias
Eliminates the intercept term (a built-in constant predictor variable) from the model. Occasionally this
is a good idea, but generally it isn’t since the SGD learning algorithm can usually eliminate the
intercept term if warranted.
--features
Sets the size of the internal feature vector to use in building the model. A larger value here can be
helpful, especially with text-like input data.
Logistic Regression Using MahoutLogistic Regression Using Mahout
Handles for Trainlogistic:
EVALUATING THE MODEL
You can evaluate the model by running it on the same training dataset as:
$ bin/mahout runlogistic --input test.csv --model ./model
--auc --confusion
......
This command takes as input the model created in our trainlogistic command
and produces the Area Under the Curve and Confusion Matrix.
for the learning algorithm.
Logistic Regression Using MahoutLogistic Regression Using Mahout
Handle What it does
--quiet Produces less status and progress output.
--auc Prints AUC score for model versus input data after reading data.
--scores Prints target variable value and scores for each input example.
--
confusion Prints confusion matrix for a particular threshold (see --threshold).
--input <input> Reads data records from specified file or resource.
--model <model> Reads model from specified file.
Demo - DatasetDemo - Dataset
• Let’s take an example data set:
"x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1
0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1
. . .
0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1
0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1
0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1
• Now we will run our Logistic regression model to predict the value of
Color based on the predictors
"x","y","shape","k","k0","xx","xy","yy","a","b","c","bias"
Demo – Dataset ExplainedDemo – Dataset Explained
Variable Description Possible values
x The x coordinate of a point Numerical from 0 to 1
y The y coordinate of a point Numerical from 0 to 1
shape The shape of a point Shape code from 21 to 25
color Whether the point is filled or not 1=empty, 2=filled
k The k-means cluster ID derived using only x and y Integer cluster code from 1 to 10
k0
The k-means cluster ID derived using x, y, and color Integer
cluster code from 1 to 10
xx The square of the x coordinate Numerical from 0 to 1
xy The product of the x and y coordinates Numerical from 0 to 1
yy The square of the y coordinate Numerical from 0 to 1
a The distance from the point (0,0) Numerical from 0 to
b The distance from the point (1,0) Numerical from 0 to
c The distance from the point (0.5,0.5) Numerical from 0 to
bias A constant 1
Demo – Training the ModelDemo – Training the Model
• To determine the color field from the x and y features with the
following command:
$ bin/mahout trainlogistic --input color.csv 
--output ./model 
--target color --categories 2 
--predictors x y --types numeric 
--features 20 --passes 100 --rate 50
...
• After the model is trained, you can run the model on the training data again
to evaluate how it does (even though we know it won’t do well).
$ bin/mahout runlogistic --input donut.csv --model ./model 
--auc --confusion
AUC = 0.57
confusion: [[27.0, 13.0], [0.0, 0.0]]
• AUC can range from 0 to 0.5 for a model that’s no better than random to 1 for
a perfect model.
• The value here of 0.57 indicates a model that’s hardly better than random.
Demo – Selecting FeaturesDemo – Selecting Features
• You can get more interesting results if you use additional variables for training.
For instance, this command allows the model to use x and y as well as a, b,
and c to build the model:
$ bin/mahout trainlogistic --input .colorcsv --output model 
--target color --categories 2 
--predictors x y a b c --types numeric 
--features 20 --passes 100 --rate 50
• If you run this revised model on the training data, you’ll get a very different
result than with the previous model:
$ bin/mahout runlogistic --input color.csv --model model 
--auc –confusion
Result:
AUC = 1.00
confusion: [[27.0, 0.0], [0.0, 13.0]]
entropy: [[-0.1, -1.5], [-4.0, -0.2]]
• Now you have a perfect value of 1 for AUC, and you can see from the
confusion matrix that the model was able to classify all of the training
examples perfectly.
Demo – EvaluationDemo – Evaluation
• You can get more interesting results if you use additional variables for
training. Now you can run this same model on additional data in the
donut-test.csv resource.
• Because this new data was not used to train the model, there are a
few surprises in store for the model.
$ bin/mahout runlogistic --input color-test.csv --model model 
--auc –confusion
Result
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -2.8], [-4.1, -0.1]]
• On this withheld data, you can see that the AUC has dropped to
0.97, which is still quite good.
ReferenceReference
• https://
mahout.apache.org/users/classification/logistic-regression.html
• https://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-
logistic-regression-sgd-classifier/
• For incorporating your code by using the libraries:
http://skife.org/mahout/2013/02/14/first_steps_with_mahout.html
• Mahout in Action
Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman

More Related Content

What's hot

Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
Stanley Wang
 
DPLYR package in R
DPLYR package in RDPLYR package in R
DPLYR package in R
Bimba Pawar
 
GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介
Wen Liao
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using r
Tahera Shaikh
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
Walaa Hamdy Assy
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
Huy Nguyen
 
Analysis Of Attribute Revelance
Analysis Of Attribute RevelanceAnalysis Of Attribute Revelance
Analysis Of Attribute Revelance
pradeepa velmurugan
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
ananth
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
Kazuki Yoshida
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
hoangminhdong
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1
Korivi Sravan Kumar
 
Introduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdfIntroduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdf
SisayNegash4
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User Interfaces
Setia Pramana
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Star schema PPT
Star schema PPTStar schema PPT
Star schema PPT
Swati Kulkarni Jaipurkar
 

What's hot (20)

Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
DPLYR package in R
DPLYR package in RDPLYR package in R
DPLYR package in R
 
GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using r
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
 
Analysis Of Attribute Revelance
Analysis Of Attribute RevelanceAnalysis Of Attribute Revelance
Analysis Of Attribute Revelance
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1
 
Introduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdfIntroduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdf
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Developing R Graphical User Interfaces
Developing R Graphical User InterfacesDeveloping R Graphical User Interfaces
Developing R Graphical User Interfaces
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Star schema PPT
Star schema PPTStar schema PPT
Star schema PPT
 

Viewers also liked

Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
Gaurav Kasliwal
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010Ted Dunning
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
Ted Dunning
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark Summit
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark Summit
 
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
Spark Summit
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Khaled Abd Elaziz
 
Logistic regression with SPSS examples
Logistic regression with SPSS examplesLogistic regression with SPSS examples
Logistic regression with SPSS examples
Gaurav Kamboj
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 

Viewers also liked (20)

Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
 
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
Data-Driven Water Security with Bluemix Apache Spark Service: Spark Summit Ea...
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Logistic regression with SPSS examples
Logistic regression with SPSS examplesLogistic regression with SPSS examples
Logistic regression with SPSS examples
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

Similar to Logistic Regression using Mahout

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
Carlo Parodi
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
TarunPaparaju
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
Renjith M P
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
Max Kleiner
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Software Testing Techniques
Software Testing TechniquesSoftware Testing Techniques
Software Testing Techniques
Kiran Kumar
 
Test Techniques
Test TechniquesTest Techniques
Test Techniques
nazeer pasha
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
Shintaro Fukushima
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
Max Kleiner
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
KalyankumarVenkat1
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
Shruti Mohan
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
AliciaWei1
 
Compiler Construction for DLX Processor
Compiler Construction for DLX Processor Compiler Construction for DLX Processor
Compiler Construction for DLX Processor
Soham Kulkarni
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
Raouf KESKES
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
Robert Grossman
 
Progamming Primer Polymorphism (Method Overloading) VB
Progamming Primer Polymorphism (Method Overloading) VBProgamming Primer Polymorphism (Method Overloading) VB
Progamming Primer Polymorphism (Method Overloading) VB
sunmitraeducation
 

Similar to Logistic Regression using Mahout (20)

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Software Testing Techniques
Software Testing TechniquesSoftware Testing Techniques
Software Testing Techniques
 
Test Techniques
Test TechniquesTest Techniques
Test Techniques
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 
Compiler Construction for DLX Processor
Compiler Construction for DLX Processor Compiler Construction for DLX Processor
Compiler Construction for DLX Processor
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Progamming Primer Polymorphism (Method Overloading) VB
Progamming Primer Polymorphism (Method Overloading) VBProgamming Primer Polymorphism (Method Overloading) VB
Progamming Primer Polymorphism (Method Overloading) VB
 

Recently uploaded

The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 

Logistic Regression using Mahout

  • 1. Logistic RegressionLogistic Regression UsingUsing Apache MahoutApache Mahout CS 267 : Data Mining Presentation Guided by : Dr. Tran -Tanuvir Singh
  • 2. OutlineOutline • Logistic Regression Model • Mahout Overview • Logistic Regression using Mahout • Training and Testing ML Model • Demo
  • 3. Logistic Regression ModelLogistic Regression Model • Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. • It is used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). • Logistic regression is the standard industry workhorse that underlies many production fraud detection and advertising quality and targeting products.
  • 4. MahoutMahout • Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. • The advantage of Mahout over other approaches becomes striking as the number of training examples gets extremely large. What large means can vary enormously. • As the input exceeds 1 to 10 million training examples, something scalable like Mahout is needed.
  • 5. Logistic Regression Using MahoutLogistic Regression Using Mahout • Mahout’s implementation of Logistic regression uses Stochastic Gradient Descent (SGD) algorithm • This algorithm is a sequential (nonparallel) algorithm, but it’s fast. • While working with large data, the SGD algorithm uses a constant amount of memory regardless of the size of the input. • Mahout includes a command line example of logistic regression program. • For production use, the logistic regression stuff mostly is not run from the command line, but is integrated more tightly into some data flow.
  • 6. • Mahout’s implementation of Logistic Regression using SGD supports the following command line program names: Valid program names are: ocat : Print a file or resource as the logistic regression models would see it orunlogistic : Run a logistic regression model against CSV data otrainlogistic : Train a logistic regression using stochastic gradient descent Logistic Regression Using MahoutLogistic Regression Using Mahout
  • 7. BUILDING THE MODEL You can build a model to determine the REQUIRED field from the selected features with the following command: $ bin/mahout trainlogistic --input Input.csv --output ./model --target TARGETVARIABLE--categories 2 --predictors Predictor1 Predictor2 . . . --types numeric --features 20 --passes 100 --rate 50 ...... This command specifies that the input comes from the resource named Input.csv, that the resulting model is stored in the file ./model, that the target variable is in the field named TARGETVARIABLE and that it has two possible values. The command also specifies that the algorithm should use variables x and y as predictors, both with numerical types. The remaining options specify internal parameters for the learning algorithm. Logistic Regression Using MahoutLogistic Regression Using Mahout
  • 8. Handle What it does --quiet Produces less status and progress output. --input <file-or-resource> Uses the specified file or resource as input. --output <file-for-model> Puts the model into the specified file. --target <variable> Uses the specified variable as the target. --categories <n> Specifies how many categories the target variable has. --predictors <v1> ... <vn> Specifies the names of the predictor variables. --types <t1> ... <tm> Gives a list of the types of the predictor variables. Each type should be one of numeric, word, or text. Types can be abbreviated to their first letter. If too few types are given, the last one is used again as necessary. Use word for categorical variables. --passes Specifies the number of times the input data should be reexamined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined. --lambda Controls how much the algorithm tries to eliminate variables from the final model. A value of 0 indicates no effort is made. Typical values are on the order of 0.00001 or less. --rate Sets the initial learning rate. This can be large if you have lots of data or use lots of passes because it’s decreased progressively as data is examined. --noBias Eliminates the intercept term (a built-in constant predictor variable) from the model. Occasionally this is a good idea, but generally it isn’t since the SGD learning algorithm can usually eliminate the intercept term if warranted. --features Sets the size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data. Logistic Regression Using MahoutLogistic Regression Using Mahout Handles for Trainlogistic:
  • 9. EVALUATING THE MODEL You can evaluate the model by running it on the same training dataset as: $ bin/mahout runlogistic --input test.csv --model ./model --auc --confusion ...... This command takes as input the model created in our trainlogistic command and produces the Area Under the Curve and Confusion Matrix. for the learning algorithm. Logistic Regression Using MahoutLogistic Regression Using Mahout Handle What it does --quiet Produces less status and progress output. --auc Prints AUC score for model versus input data after reading data. --scores Prints target variable value and scores for each input example. -- confusion Prints confusion matrix for a particular threshold (see --threshold). --input <input> Reads data records from specified file or resource. --model <model> Reads model from specified file.
  • 10. Demo - DatasetDemo - Dataset • Let’s take an example data set: "x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias" 0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1 0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1 . . . 0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1 0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1 0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1 • Now we will run our Logistic regression model to predict the value of Color based on the predictors "x","y","shape","k","k0","xx","xy","yy","a","b","c","bias"
  • 11. Demo – Dataset ExplainedDemo – Dataset Explained Variable Description Possible values x The x coordinate of a point Numerical from 0 to 1 y The y coordinate of a point Numerical from 0 to 1 shape The shape of a point Shape code from 21 to 25 color Whether the point is filled or not 1=empty, 2=filled k The k-means cluster ID derived using only x and y Integer cluster code from 1 to 10 k0 The k-means cluster ID derived using x, y, and color Integer cluster code from 1 to 10 xx The square of the x coordinate Numerical from 0 to 1 xy The product of the x and y coordinates Numerical from 0 to 1 yy The square of the y coordinate Numerical from 0 to 1 a The distance from the point (0,0) Numerical from 0 to b The distance from the point (1,0) Numerical from 0 to c The distance from the point (0.5,0.5) Numerical from 0 to bias A constant 1
  • 12. Demo – Training the ModelDemo – Training the Model • To determine the color field from the x and y features with the following command: $ bin/mahout trainlogistic --input color.csv --output ./model --target color --categories 2 --predictors x y --types numeric --features 20 --passes 100 --rate 50 ... • After the model is trained, you can run the model on the training data again to evaluate how it does (even though we know it won’t do well). $ bin/mahout runlogistic --input donut.csv --model ./model --auc --confusion AUC = 0.57 confusion: [[27.0, 13.0], [0.0, 0.0]] • AUC can range from 0 to 0.5 for a model that’s no better than random to 1 for a perfect model. • The value here of 0.57 indicates a model that’s hardly better than random.
  • 13. Demo – Selecting FeaturesDemo – Selecting Features • You can get more interesting results if you use additional variables for training. For instance, this command allows the model to use x and y as well as a, b, and c to build the model: $ bin/mahout trainlogistic --input .colorcsv --output model --target color --categories 2 --predictors x y a b c --types numeric --features 20 --passes 100 --rate 50 • If you run this revised model on the training data, you’ll get a very different result than with the previous model: $ bin/mahout runlogistic --input color.csv --model model --auc –confusion Result: AUC = 1.00 confusion: [[27.0, 0.0], [0.0, 13.0]] entropy: [[-0.1, -1.5], [-4.0, -0.2]] • Now you have a perfect value of 1 for AUC, and you can see from the confusion matrix that the model was able to classify all of the training examples perfectly.
  • 14. Demo – EvaluationDemo – Evaluation • You can get more interesting results if you use additional variables for training. Now you can run this same model on additional data in the donut-test.csv resource. • Because this new data was not used to train the model, there are a few surprises in store for the model. $ bin/mahout runlogistic --input color-test.csv --model model --auc –confusion Result AUC = 0.97 confusion: [[24.0, 2.0], [3.0, 11.0]] entropy: [[-0.2, -2.8], [-4.1, -0.1]] • On this withheld data, you can see that the AUC has dropped to 0.97, which is still quite good.
  • 15. ReferenceReference • https:// mahout.apache.org/users/classification/logistic-regression.html • https://blog.trifork.com/2014/02/04/an-introduction-to-mahouts- logistic-regression-sgd-classifier/ • For incorporating your code by using the libraries: http://skife.org/mahout/2013/02/14/first_steps_with_mahout.html • Mahout in Action Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman