SlideShare a Scribd company logo
1 of 18
Download to read offline
STATISTICAL LEARNING PROJECT
SUPERVISED LEARNING-DECISION TRESS AND LINEAR
REGRESSION
AMANPREET SINGH
942091
Contents
1 Introduction
1.1 What is Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Structure of a Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Introduction
2.1 The Algorithm behind Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Let us now see how a Decision Tree algorithm generates a TREE. . . . . . . . . . . . . . . . .
3 Introduction
3.1 Predictions from Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Classiļ¬cation Trees: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Advantages of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Introduction
4.1 Measures Used for Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Call Housing Data
6 Decision Tree with Data on R
6.1 Analyzing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 2. Applying Linear Regression to the problem . . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Applying Regression Trees to the problem . . . . . . . . . . . . . . . . . . . . . . . .
6.1.3 Making a Regression Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Chapter 1
Introduction
1.1 What is Decision Tree
A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target
value.It is used for either classiļ¬cation (categorical target variable) or regression (continuous target variable).
Hence, it is also known as CART (Classiļ¬cation Regression Trees). Some real-life applications of Decision
Trees include: Credit scoring models in which the criteria that cause an applicant to be rejected need to be
clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn,
which will be shared with management or advertising agencies
1.2 Structure of a Decision Tree
Decision trees have three main parts:
ā€¢ Root Node: The node that performs the ļ¬rst split. The most important Variable
ā€¢ Terminal Nodes/Leaves :Nodes that predict the outcome. The next variable eļ¬€ecting decision
ā€¢ Branches: arrows connecting nodes, showing the ļ¬‚ow from question to answer.
The root node is the starting point of the tree, and both root and terminal nodes contain questions or
criteria to be answered. Each node typically has two or more nodes extending from it. For example, if the
question in the ļ¬rst node requires a ā€œyesā€ or ā€œnoā€ answer, there will be one leaf node for a ā€œyesā€ response,
and another node for ā€œno.ā€
Figure 1.1: Decision Tree Nodes
Chapter 2
Introduction
2.1 The Algorithm behind Decision Trees.
The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces
so that the outcomes in each ļ¬nal sub-space is as homogeneous as possible. This approach is technically called
recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable,
which can be either: a continuous variable, for regression trees a categorical variable, for classiļ¬cation trees
The decision rules generated by the CART (Classiļ¬cation Regression Trees) predictive model are generally
visualized as a binary tree.
CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. If
a new observation fell into any of the subsets, it would now be decided by the majority of the observations
in that particular subset.
2.2 Let us now see how a Decision Tree algorithm generates a
TREE.
To Predict Red Or Grey The ļ¬rst split tests whether the variable x is less than 60. If yes, the model says to
predict red, and if no, the model moves on to the next split. Then, the second split checks whether or not
the variable y is less than 20. If no, the model says to predict gray, but if yes, the model moves on to the
next split. The third split checks whether or not the variable x is less than 85. If yes, then the model says
to predict red, and if no, the model says to predict grey.
Figure 2.1: Decision Tree Nodes
Chapter 3
Introduction
3.1 Predictions from Decision Trees
In the above example,we discussed Classiļ¬cation trees i.e when the output is a factor/category:red or
grey.Trees can also be used for regression where the output at each leaf of the tree is no longer a cate-
gory, but a number. They are called Regression Trees.
3.2 Classiļ¬cation Trees:
With Classiļ¬cation Trees, we report the average outcome at each leaf of our tree. However, Instead of just
taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each
type of outcome.
3.3 Advantages of Decision Trees
ā€¢ It is quite interpret-able and easy to understand.
ā€¢ It can also be used to identify the most signiļ¬cant variables in data-set
Figure 3.1: Splits in CART
Chapter 4
Introduction
4.1 Measures Used for Split
There are diļ¬€erent ways to control how many splits are generated.
ā€¢ Gini Index: It is the measure of inequality of distribution. It says if we select two items from a
population at random then they must be of same class and probability for this is 1 if population is pure.It
works with categorical target variable ā€œSuccessā€ or ā€œFailureā€.It performs only Binary splits,Lower
the value of Gini, higher the homogeneity.CART uses Gini method to create binary splits.Process to
calculate Gini Measure:
ā€¢ Entropy
Entropy is a way to measure impurity.Less impure node require less information to describe them
and more impure node require more information. If the sample is completely homogeneous, then the
entropy is zero and if the sample is an equally divided one, it has entropy of one.
ā€¢ Information Gain : Information Gain is simply a mathematical way to capture the amount of
information one gains(or reduction in randomness) by picking a particular attribute.In a decision
algorithm, we start at the tree root and split the data on the feature that results in the largest
information gain (IG). In other words, IG tells us how important a given attribute is.The Information
Gain (IG) can be deļ¬ned as follows:
Chapter 5
Call Housing Data
I worked on the Call-housing dataset provided by Prof Cesa Bianchi.This data shows various variable that
may or maynot be eļ¬€ect the prices of houses in a certain area.I explored the relationship between prices of
houses with the aid of Decision trees and Regression tress and tried to built a model initially of how prices
vary by location across a region.
longitude and latitude = Location of the areas of the particular houses
housing median age= Median age of houses in that Area
total rooms = Total Rooms of the houses in that area
total bedrooms= Total bedrooms out of those total tooms
population = Population living in that Area
households = Number of households in that area
median income = Median Income of the household living in the area
median house value:= Median House Values in that area
ocean proximity = Houses Proximity from Ocean Classiļ¬ed into
< 1H OCEAN
NEAR BAY
INLAND
ISLAND
NEAR OCEAN
Chapter 6
Decision Tree with Data on R
6.1 Analyzing the Data
Figure 6.1: Appendix Ref 1
There are 20640 observations corresponding to 20640 census tracts. I am interested in building a model of
how prices vary by location across a region. So, letā€™s ļ¬rst see how the points are laid out.
Figure 6.2: Appendix Ref 2
Now lets plot Diļ¬€erent Region types on the Map.As from Data study and real estate
knowledge we know that Houses near oceans tend to be more expensive and on islands being
more exclusive tend to be high Luxury.
< 1H OCEAN =Purple
NEAR BAY =Blue
INLAND =Black
ISLAND =Green
NEAR OCEAN=Pink
Figure 6.3: Appendix Ref 3
6.1.1 2. Applying Linear Regression to the problem
Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but
natural to look up to Linear Regression algorithm to solve the problem. As seen in the last ļ¬gure that the
house prices were distributed over the area in an interesting way according to ocean proximity, certainly
not the kind of linear way and I feel Linear Regression is not going to work very well here.
Figure 6.4: Appendix Ref 4
ā€¢ R-squared is around 0.24, which is not great.
ā€¢ The latitude is not signiļ¬cant, which means the north-south location diļ¬€erences arenā€™t going to be
really used at all. This also seems unlikely.
ā€¢ Longitude is signiļ¬cant, but negative which means that as we go towards the east house prices decrease
linearly, which is also unlikely.
To see how this linear regression model looks on a plot.So I can plot the census tracts again and then plot
the above-median house prices with bright red dots. The red dots will tell the actual positions in data
where houses are costly. We shall then test the same fact with Linear Regression predictions using blue
sign,
Figure 6.5: Appendix Ref 5
The linear regression model has plotted a blue euro sign for every time it thinks the census tract is above
value 3,00,000. Itā€™s almost a sharp line that the linear regression deļ¬nes. Also notice the shape is almost
vertical since the latitude variable was not very signiļ¬cant in the regression. The blue and the red dots do
not overlap especially in the east and north west. It turns out, the linear regression model isnā€™t really doing
a good job. And it has completely ignored everything to the right and left side of the picture. So thatā€™s
interesting and pretty wrong.
6.1.2 Applying Regression Trees to the problem
I built a regression tree by using the rpart command.I am predicting median house value as a function of
latitude and longitude, using the dataset.
Figure 6.6: Appendix Ref 6
The leaves of the tree are important.
ā€¢ In a classiļ¬cation tree, the leaves would be the classiļ¬cation that I assign .
ā€¢ In regression trees, instead predicting the number.That number here is the average of the median house
prices in that bucket.
Figure 6.7: Appendix Ref 7
The ļ¬tted values are greater than 3,00,000, the colour is blue, The Regression tree has done a much better
job and has kind of overlapped the red dots. It has left the low value area out, and has correctly managed
to classify some of those points in the bottom right and top right,but still not accurate as it left the houses
in the center.Now I further run Regression Tree to better the Model and predictions
6.1.3 Making a Regression Tree Model
ā€¢ Training and Splitting median house value
ā€¢ Finding and adding important variables to regression- latitude,longitude + median income +
population
ā€¢ Making Regression Tree
-
Figure 6.8: Appendix Ref 8
Results: The median income is the most important variable and split ,and decision starts from there,then
algorithm further distinguish between more income values,and then algorithm makes decision based on the
locations,longitude and latitude
tree.sse= [1] 3.269545e+13
Figure 6.9: Appendix Ref 9
6.1.4 Conclusions
Even though the Decision Trees appears to be working very well in certain conditions, it comes with its
own drawbacks.The model has very high chances of ā€œover-ļ¬ttingā€. If no limit is set, in the worst case, it
will end up putting each observation into a leaf node. Maybe If i further try to use PCA and Cross
validation with Regression maybe I can remove the problem of overļ¬tting that I can see with decision tree
right now.For that I am going to do futher implementation of PCA and Cross validation on the project of
UNSUPERVISED LEARNING with this same data .
Appendix
1
#Analayze Data
str(dat)
2
#Building/laying down Maps/Location
plot(dat$longitude ,dat$latitude)
3
#Marking Different regions ,< 1H OCEAN =Purple ,NEAR BAY #=Blue ,INLAND=Black ,ISLAND =Green ,NEAR OCEAN=Pink
plot(dat$longitude ,dat$latitude)
points(dat$longitude[dat$ocean_proximity =="NEAR BAY"], dat$latitude[dat$ocean_proximity =="NEAR BAY"], col="blue", pch=19)
points(dat$longitude[dat$ocean_proximity =="ISLAND"], dat$latitude[dat$ocean_proximity =="ISLAND"], col="green", pch=19)
points(dat$longitude[dat$ocean_proximity =="NEAR OCEAN"], dat$latitude[dat$ocean_proximity =="NEAR OCEAN"], col="pink", pch=19)
points(dat$longitude[dat$ocean_proximity =="INLAND"], dat$latitude[dat$ocean_proximity =="INLAND"], col="black", pch=19)
points(dat$longitude[dat$ocean_proximity =="<1H OCEAN"], dat$latitude[dat$ocean_proximity =="<1H OCEAN"], col="purple", pch=19)
4
#Performing Linear regression
begin{lstlisting}
#Building/laying down Maps/Location
plot(dat$longitude ,dat$latitude)
1
5
#comparing linear regression results with real data
plot(dat$latitude ,dat$longitude)
points(dat$latitude[dat$median_house_value >=300000], dat$longitude[dat$median_house_value >=300000], col="red", pch=20)
latlonlm$fitted.values
points(dat$latitude[latlonlm$fitted.values >= 300000], dat$longitude[latlonlm$fitted.values >= 300000], col="blue", pch=" ")
)
6
#Making Decision Tree
library(rpart)
library(rpart.plot)
latlontree = rpart(median_house_value ~ latitude + longitude , data=dat)
prp(latlontree)
2
7
#Comparing Decision Tree Results with Real Data
plot(dat$latitude ,dat$longitude)
points( dat$latitude[dat$median_house_value >=300000],dat$longitude[dat$median_house_value >=300000], col="red", pch=20)
fittedvalues = predict(latlontree)
points(dat$latitude[fittedvalues >= 300000], dat$longitude[fittedvalues >= 300000], col="blue", pch="$")
#Training and Splitting the Data
library(caTools)
set.seed(123)
split = sample.split(dat$median_house_value , SplitRatio = 0.7)
train = subset(dat , split == TRUE)
test = subset(dat , split == FALSE)
8
# Create a CART model
tree = rpart(median_house_value ~ latitude + longitude + median_income + population , data=train)
prp(tree)
# Finding results for Regression Tree
tree.pred = predict(tree , newdata=test)
tree.sse = sum(( tree.pred - test$median_house_value )^2)
tree.sse
9
# Comparing results of Regression Tree with Real data
3
fittedvalues2 <-tree.pred
plot(dat$latitude ,dat$longitude)
points( dat$latitude[dat$median_house_value >=300000],dat$longitude[dat$median_house_value >=300000], col="red", pch=20)
fittedvalues = predict(latlontree)
points(dat$latitude[fittedvalues2 >= 300000], dat$longitude[fittedvalues2 >= 300000], col="blue", pch="$")
4
List of Figures
1.1 Decision Tree Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Decision Tree Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Splits in CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Appendix Ref 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Appendix Ref 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Appendix Ref 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Appendix Ref 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Appendix Ref 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Appendix Ref 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Appendix Ref 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.8 Appendix Ref 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.9 Appendix Ref 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii

More Related Content

What's hot

Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationKomal Kotak
Ā 
Lesson 8 zscore
Lesson 8 zscoreLesson 8 zscore
Lesson 8 zscorenurun2010
Ā 
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...Shakehand with Life
Ā 
Mba i qt unit-2.1_measures of variations
Mba i qt unit-2.1_measures of variationsMba i qt unit-2.1_measures of variations
Mba i qt unit-2.1_measures of variationsRai University
Ā 
Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Karan Kukreja
Ā 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmMohd. Noor Abdul Hamid
Ā 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variationleblance
Ā 
Descriptions of data statistics for research
Descriptions of data   statistics for researchDescriptions of data   statistics for research
Descriptions of data statistics for researchHarve Abella
Ā 
Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
Ā 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
Ā 
Standard deviation
Standard deviationStandard deviation
Standard deviationKen Plummer
Ā 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendencyChie Pegollo
Ā 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplotsLong Beach City College
Ā 
Lesson03_new
Lesson03_newLesson03_new
Lesson03_newshengvn
Ā 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5nurun2010
Ā 
Standard deviation
Standard deviationStandard deviation
Standard deviationM K
Ā 
Empirics of standard deviation
Empirics of standard deviationEmpirics of standard deviation
Empirics of standard deviationAdebanji Ayeni
Ā 

What's hot (20)

Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian Classification
Ā 
Dispersion 2
Dispersion 2Dispersion 2
Dispersion 2
Ā 
Lesson 8 zscore
Lesson 8 zscoreLesson 8 zscore
Lesson 8 zscore
Ā 
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Ā 
Mba i qt unit-2.1_measures of variations
Mba i qt unit-2.1_measures of variationsMba i qt unit-2.1_measures of variations
Mba i qt unit-2.1_measures of variations
Ā 
3.1 Measures of center
3.1 Measures of center3.1 Measures of center
3.1 Measures of center
Ā 
Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester
Ā 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
Ā 
Decision tree
Decision tree Decision tree
Decision tree
Ā 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
Ā 
Descriptions of data statistics for research
Descriptions of data   statistics for researchDescriptions of data   statistics for research
Descriptions of data statistics for research
Ā 
Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
Ā 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing Research
Ā 
Standard deviation
Standard deviationStandard deviation
Standard deviation
Ā 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
Ā 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
Ā 
Lesson03_new
Lesson03_newLesson03_new
Lesson03_new
Ā 
Aed1222 lesson 5
Aed1222 lesson 5Aed1222 lesson 5
Aed1222 lesson 5
Ā 
Standard deviation
Standard deviationStandard deviation
Standard deviation
Ā 
Empirics of standard deviation
Empirics of standard deviationEmpirics of standard deviation
Empirics of standard deviation
Ā 

Similar to Supervised learning (2)

Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
Ā 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-typeKayleigh Beard
Ā 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
Ā 
Statistics firstfive
Statistics firstfiveStatistics firstfive
Statistics firstfiveSukirti Garg
Ā 
Decision theory & decisiontrees
Decision theory & decisiontreesDecision theory & decisiontrees
Decision theory & decisiontreesrabindrakumar111082
Ā 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Ā 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
Ā 
Machine learning session 10
Machine learning session 10Machine learning session 10
Machine learning session 10NirsandhG
Ā 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision TreesRupak Roy
Ā 
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...INFOGAIN PUBLICATION
Ā 
07 learning
07 learning07 learning
07 learningankit_ppt
Ā 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptxRahul Borate
Ā 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
Ā 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
Ā 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
Ā 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and ClusteringUsha Vijay
Ā 

Similar to Supervised learning (2) (20)

M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
Ā 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
Ā 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
Ā 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
Ā 
Statistics firstfive
Statistics firstfiveStatistics firstfive
Statistics firstfive
Ā 
Decision theory & decisiontrees
Decision theory & decisiontreesDecision theory & decisiontrees
Decision theory & decisiontrees
Ā 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Ā 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
Ā 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Ā 
Machine learning session 10
Machine learning session 10Machine learning session 10
Machine learning session 10
Ā 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
Ā 
Stat-2.pptx
Stat-2.pptxStat-2.pptx
Stat-2.pptx
Ā 
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ā 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
Ā 
07 learning
07 learning07 learning
07 learning
Ā 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
Ā 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
Ā 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Ā 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Ā 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
Ā 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
Ā 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
Ā 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Ā 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...amitlee9823
Ā 
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callshivangimorya083
Ā 
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girl
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girlCall Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girl
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girlkumarajju5765
Ā 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
Ā 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Ā 
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
ź§ā¤ Aerocity Call Girls Service Aerocity Delhi ā¤ź§‚ 9999965857 ā˜Žļø Hard And Sexy ...
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Ā 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Ā 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
Ā 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Ā 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
Ā 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Ā 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Ā 
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Callź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
ź§ā¤ Greater Noida Call Girls Delhi ā¤ź§‚ 9711199171 ā˜Žļø Hard And Sexy Vip Call
Ā 
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girl
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girlCall Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girl
Call Girls šŸ«¤ Dwarka āž”ļø 9711199171 āž”ļø Delhi šŸ«¦ Two shot with one girl
Ā 

Supervised learning (2)

  • 1. STATISTICAL LEARNING PROJECT SUPERVISED LEARNING-DECISION TRESS AND LINEAR REGRESSION AMANPREET SINGH 942091
  • 2. Contents 1 Introduction 1.1 What is Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Structure of a Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Introduction 2.1 The Algorithm behind Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Let us now see how a Decision Tree algorithm generates a TREE. . . . . . . . . . . . . . . . . 3 Introduction 3.1 Predictions from Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Classiļ¬cation Trees: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Advantages of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Introduction 4.1 Measures Used for Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Call Housing Data 6 Decision Tree with Data on R 6.1 Analyzing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 2. Applying Linear Regression to the problem . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Applying Regression Trees to the problem . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Making a Regression Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
  • 3. Chapter 1 Introduction 1.1 What is Decision Tree A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target value.It is used for either classiļ¬cation (categorical target variable) or regression (continuous target variable). Hence, it is also known as CART (Classiļ¬cation Regression Trees). Some real-life applications of Decision Trees include: Credit scoring models in which the criteria that cause an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies 1.2 Structure of a Decision Tree Decision trees have three main parts: ā€¢ Root Node: The node that performs the ļ¬rst split. The most important Variable ā€¢ Terminal Nodes/Leaves :Nodes that predict the outcome. The next variable eļ¬€ecting decision ā€¢ Branches: arrows connecting nodes, showing the ļ¬‚ow from question to answer. The root node is the starting point of the tree, and both root and terminal nodes contain questions or criteria to be answered. Each node typically has two or more nodes extending from it. For example, if the question in the ļ¬rst node requires a ā€œyesā€ or ā€œnoā€ answer, there will be one leaf node for a ā€œyesā€ response, and another node for ā€œno.ā€ Figure 1.1: Decision Tree Nodes
  • 4. Chapter 2 Introduction 2.1 The Algorithm behind Decision Trees. The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each ļ¬nal sub-space is as homogeneous as possible. This approach is technically called recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable, which can be either: a continuous variable, for regression trees a categorical variable, for classiļ¬cation trees The decision rules generated by the CART (Classiļ¬cation Regression Trees) predictive model are generally visualized as a binary tree. CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. If a new observation fell into any of the subsets, it would now be decided by the majority of the observations in that particular subset. 2.2 Let us now see how a Decision Tree algorithm generates a TREE. To Predict Red Or Grey The ļ¬rst split tests whether the variable x is less than 60. If yes, the model says to predict red, and if no, the model moves on to the next split. Then, the second split checks whether or not the variable y is less than 20. If no, the model says to predict gray, but if yes, the model moves on to the next split. The third split checks whether or not the variable x is less than 85. If yes, then the model says to predict red, and if no, the model says to predict grey. Figure 2.1: Decision Tree Nodes
  • 5. Chapter 3 Introduction 3.1 Predictions from Decision Trees In the above example,we discussed Classiļ¬cation trees i.e when the output is a factor/category:red or grey.Trees can also be used for regression where the output at each leaf of the tree is no longer a cate- gory, but a number. They are called Regression Trees. 3.2 Classiļ¬cation Trees: With Classiļ¬cation Trees, we report the average outcome at each leaf of our tree. However, Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome. 3.3 Advantages of Decision Trees ā€¢ It is quite interpret-able and easy to understand. ā€¢ It can also be used to identify the most signiļ¬cant variables in data-set Figure 3.1: Splits in CART
  • 6. Chapter 4 Introduction 4.1 Measures Used for Split There are diļ¬€erent ways to control how many splits are generated. ā€¢ Gini Index: It is the measure of inequality of distribution. It says if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.It works with categorical target variable ā€œSuccessā€ or ā€œFailureā€.It performs only Binary splits,Lower the value of Gini, higher the homogeneity.CART uses Gini method to create binary splits.Process to calculate Gini Measure: ā€¢ Entropy Entropy is a way to measure impurity.Less impure node require less information to describe them and more impure node require more information. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided one, it has entropy of one. ā€¢ Information Gain : Information Gain is simply a mathematical way to capture the amount of information one gains(or reduction in randomness) by picking a particular attribute.In a decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). In other words, IG tells us how important a given attribute is.The Information Gain (IG) can be deļ¬ned as follows:
  • 7. Chapter 5 Call Housing Data I worked on the Call-housing dataset provided by Prof Cesa Bianchi.This data shows various variable that may or maynot be eļ¬€ect the prices of houses in a certain area.I explored the relationship between prices of houses with the aid of Decision trees and Regression tress and tried to built a model initially of how prices vary by location across a region. longitude and latitude = Location of the areas of the particular houses housing median age= Median age of houses in that Area total rooms = Total Rooms of the houses in that area total bedrooms= Total bedrooms out of those total tooms population = Population living in that Area households = Number of households in that area median income = Median Income of the household living in the area median house value:= Median House Values in that area ocean proximity = Houses Proximity from Ocean Classiļ¬ed into < 1H OCEAN NEAR BAY INLAND ISLAND NEAR OCEAN
  • 8. Chapter 6 Decision Tree with Data on R 6.1 Analyzing the Data Figure 6.1: Appendix Ref 1 There are 20640 observations corresponding to 20640 census tracts. I am interested in building a model of how prices vary by location across a region. So, letā€™s ļ¬rst see how the points are laid out. Figure 6.2: Appendix Ref 2
  • 9. Now lets plot Diļ¬€erent Region types on the Map.As from Data study and real estate knowledge we know that Houses near oceans tend to be more expensive and on islands being more exclusive tend to be high Luxury. < 1H OCEAN =Purple NEAR BAY =Blue INLAND =Black ISLAND =Green NEAR OCEAN=Pink Figure 6.3: Appendix Ref 3 6.1.1 2. Applying Linear Regression to the problem Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but natural to look up to Linear Regression algorithm to solve the problem. As seen in the last ļ¬gure that the house prices were distributed over the area in an interesting way according to ocean proximity, certainly not the kind of linear way and I feel Linear Regression is not going to work very well here. Figure 6.4: Appendix Ref 4
  • 10. ā€¢ R-squared is around 0.24, which is not great. ā€¢ The latitude is not signiļ¬cant, which means the north-south location diļ¬€erences arenā€™t going to be really used at all. This also seems unlikely. ā€¢ Longitude is signiļ¬cant, but negative which means that as we go towards the east house prices decrease linearly, which is also unlikely. To see how this linear regression model looks on a plot.So I can plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell the actual positions in data where houses are costly. We shall then test the same fact with Linear Regression predictions using blue sign, Figure 6.5: Appendix Ref 5 The linear regression model has plotted a blue euro sign for every time it thinks the census tract is above value 3,00,000. Itā€™s almost a sharp line that the linear regression deļ¬nes. Also notice the shape is almost vertical since the latitude variable was not very signiļ¬cant in the regression. The blue and the red dots do not overlap especially in the east and north west. It turns out, the linear regression model isnā€™t really doing a good job. And it has completely ignored everything to the right and left side of the picture. So thatā€™s interesting and pretty wrong.
  • 11. 6.1.2 Applying Regression Trees to the problem I built a regression tree by using the rpart command.I am predicting median house value as a function of latitude and longitude, using the dataset. Figure 6.6: Appendix Ref 6 The leaves of the tree are important. ā€¢ In a classiļ¬cation tree, the leaves would be the classiļ¬cation that I assign . ā€¢ In regression trees, instead predicting the number.That number here is the average of the median house prices in that bucket. Figure 6.7: Appendix Ref 7 The ļ¬tted values are greater than 3,00,000, the colour is blue, The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low value area out, and has correctly managed
  • 12. to classify some of those points in the bottom right and top right,but still not accurate as it left the houses in the center.Now I further run Regression Tree to better the Model and predictions 6.1.3 Making a Regression Tree Model ā€¢ Training and Splitting median house value ā€¢ Finding and adding important variables to regression- latitude,longitude + median income + population ā€¢ Making Regression Tree - Figure 6.8: Appendix Ref 8 Results: The median income is the most important variable and split ,and decision starts from there,then algorithm further distinguish between more income values,and then algorithm makes decision based on the locations,longitude and latitude tree.sse= [1] 3.269545e+13
  • 13. Figure 6.9: Appendix Ref 9 6.1.4 Conclusions Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own drawbacks.The model has very high chances of ā€œover-ļ¬ttingā€. If no limit is set, in the worst case, it will end up putting each observation into a leaf node. Maybe If i further try to use PCA and Cross validation with Regression maybe I can remove the problem of overļ¬tting that I can see with decision tree right now.For that I am going to do futher implementation of PCA and Cross validation on the project of UNSUPERVISED LEARNING with this same data .
  • 14. Appendix 1 #Analayze Data str(dat) 2 #Building/laying down Maps/Location plot(dat$longitude ,dat$latitude) 3 #Marking Different regions ,< 1H OCEAN =Purple ,NEAR BAY #=Blue ,INLAND=Black ,ISLAND =Green ,NEAR OCEAN=Pink plot(dat$longitude ,dat$latitude) points(dat$longitude[dat$ocean_proximity =="NEAR BAY"], dat$latitude[dat$ocean_proximity =="NEAR BAY"], col="blue", pch=19) points(dat$longitude[dat$ocean_proximity =="ISLAND"], dat$latitude[dat$ocean_proximity =="ISLAND"], col="green", pch=19) points(dat$longitude[dat$ocean_proximity =="NEAR OCEAN"], dat$latitude[dat$ocean_proximity =="NEAR OCEAN"], col="pink", pch=19) points(dat$longitude[dat$ocean_proximity =="INLAND"], dat$latitude[dat$ocean_proximity =="INLAND"], col="black", pch=19) points(dat$longitude[dat$ocean_proximity =="<1H OCEAN"], dat$latitude[dat$ocean_proximity =="<1H OCEAN"], col="purple", pch=19) 4 #Performing Linear regression begin{lstlisting} #Building/laying down Maps/Location plot(dat$longitude ,dat$latitude) 1
  • 15. 5 #comparing linear regression results with real data plot(dat$latitude ,dat$longitude) points(dat$latitude[dat$median_house_value >=300000], dat$longitude[dat$median_house_value >=300000], col="red", pch=20) latlonlm$fitted.values points(dat$latitude[latlonlm$fitted.values >= 300000], dat$longitude[latlonlm$fitted.values >= 300000], col="blue", pch=" ") ) 6 #Making Decision Tree library(rpart) library(rpart.plot) latlontree = rpart(median_house_value ~ latitude + longitude , data=dat) prp(latlontree) 2
  • 16. 7 #Comparing Decision Tree Results with Real Data plot(dat$latitude ,dat$longitude) points( dat$latitude[dat$median_house_value >=300000],dat$longitude[dat$median_house_value >=300000], col="red", pch=20) fittedvalues = predict(latlontree) points(dat$latitude[fittedvalues >= 300000], dat$longitude[fittedvalues >= 300000], col="blue", pch="$") #Training and Splitting the Data library(caTools) set.seed(123) split = sample.split(dat$median_house_value , SplitRatio = 0.7) train = subset(dat , split == TRUE) test = subset(dat , split == FALSE) 8 # Create a CART model tree = rpart(median_house_value ~ latitude + longitude + median_income + population , data=train) prp(tree) # Finding results for Regression Tree tree.pred = predict(tree , newdata=test) tree.sse = sum(( tree.pred - test$median_house_value )^2) tree.sse 9 # Comparing results of Regression Tree with Real data 3
  • 17. fittedvalues2 <-tree.pred plot(dat$latitude ,dat$longitude) points( dat$latitude[dat$median_house_value >=300000],dat$longitude[dat$median_house_value >=300000], col="red", pch=20) fittedvalues = predict(latlontree) points(dat$latitude[fittedvalues2 >= 300000], dat$longitude[fittedvalues2 >= 300000], col="blue", pch="$") 4
  • 18. List of Figures 1.1 Decision Tree Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Decision Tree Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Splits in CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Appendix Ref 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Appendix Ref 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Appendix Ref 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Appendix Ref 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Appendix Ref 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Appendix Ref 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Appendix Ref 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Appendix Ref 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Appendix Ref 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii