1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

1
Exploratory Data Analysis (EDA)
by Melvin Ott, PhD
September, 2017
Introduction
The Masters in Predictive Analytics program at Northwestern
University offers
graduate courses that cover predictive modeling using several
software products
such as SAS, R and Python. The Predict 410 course is one of
the core courses and
this section focuses on using Python.
Predict 410 will follow a sequence in the assignments. The first
assignment will ask
you to perform an EDA(See Ratner1 Chapters 1&2) for the
Ames Housing Data
dataset to determine the best single variable model. It will be
followed by an

assignment to expand to a multivariable model. Python
software for boxplots,
scatterplots and more will help you identify the single variable.
However, it is easy
to get lost in the programming and lose sight of the objective.
Namely, which of
the variable choices best explain the variability in the response
variable?
(You will need to be familiar with the data types and level of
measurement. This
will be critical in determining the choice of when to use a
dummy variable for model
building. If this topic is new to you review the definitions at
Types of Data before
reading further.)
This report will help you become familiar with some of the
tools for EDA and allow
you to interact with the data by using links to a software
product, Shiny, that will
demonstrate and interact with you to produce various plots of
the data. Shiny is
located on a cloud server and will allow you to make choices in
looking at the plots
for the data. Study the plots carefully. This is your initial EDA

tool and leads to
your model building and your overall understanding of
predictive analytics.
Single Variable Linear Regression EDA
1. Become Familiar With the Data
2
Identify the variables that are categorical and the variables that
are quantitative.
For the Ames Housing Data, you should review the Ames Data
Description pdf file.
2. Look at Plots of the Data
For the variables that are quantitative, you should look at
scatter plots vs the
response variable saleprice. For the categorical variables, look
at boxplots vs
saleprice. You have sample Python code to help with the EDA
and below are some
links that will demonstrate the relationships for the a different
building_prices
dataset.

For the boxplots with Shiny:
Click here
For the scatterplots with Shiny:
Click here
3. Begin Writing Python Code
Start with the shell code and improve on the model provided.
http://melvin.shinyapps.io/SboxPlot
http://melvin.shinyapps.io/SScatter/
http://melvin.shinyapps.io/SScatter/
3
Single Variable Logistic Regression EDA
1. Become Familiar With the Data
In 411 you will have an introduction to logistic regression and
again will ask you to
perform an EDA. See the file credit data for more info. Make
sure you recognize
which variables are quantitative and which are categorical.
And, for several of

these variables, what is the level of measurement?
2. Look at Plots of the Data
For logistic regression, the response variable is of the type
yes/no. In this
dataset it is coded as good/bad. So, the EDA may include
histograms for
quantitative variables with a separate histogram for each of the
response values.
For numeric coded explanatory categorical variables, if the
response good/bad is
recoded as 0/1 then the mean for the response variable for each
of the categories
will indicate if there is a relationship.
For the histograms with Shiny:
Click here
For the means with Shiny:
Click here
3. Begin Writing Python Code
OK. You have looked at the plots, which variable do you think
will be most useful
for predicting or explaining bad credit? After you answer this
question, begin

writing Python code to see if you can replicate these plots.
http://melvin.shinyapps.io/SHisto
http://melvin.shinyapps.io/SHisto
http://melvin.shinyapps.io/SCRMeans
http://melvin.shinyapps.io/SCRMeans
4
The data set CREDIT contains information on 1000 customers.
There are 21 variables in
the data set:
Name Model
Role
Measurement
Level
Description
AGE Input Interval Age in years
AMOUNT Input Interval Amount of credit requested
CHECKING Input Nominal or
Ordinal

Balance in existing checking account:
1 = less than 0 DM
2 = more than 0 but less than 200 DM
3 = at least 200 DM
4 = no checking account
COAPP Input Nominal Other debtors or guarantors:
1 = none
2 = co-applicant
3 = guarantor
DEPENDS Input Interval Number of dependents
DURATION Input Interval Length of loan in months
EMPLOYED Input Ordinal Time at present employment:
1 = unemployed
2 = less than 1 year
3 = at least 1, but less than 4 years
4 = at least 4, but less than 7 years
5 = at least 7 years
EXISTCR Input Interval Number of existing accounts at this

bank
FOREIGN Input Binary Foreign worker:
1 = Yes
2 = No
GOOD_BAD Target Binary Credit Rating Status (good or bad)
5
HISTORY Input Ordinal Credit History:
0 = no loans taken / all loans paid back in
full and on time
1 = all loans at this bank paid back in full
and on time
2 = all loans paid back on time until now
3 = late payments on previous loans
4 = critical account / loans in arrears at
other banks
HOUSING Input Nominal Rent/Own:
1 = rent

2 = own
3 = free housing
INSTALLP Input Interval Debt as a percent of disposable
income
JOB Input Ordinal Employment status:
1 = unemployed / unskilled non-resident
2 = unskilled resident
3 = skilled employee / official
4 = management / self-employed / highly
skilled employee / officer
MARITAL Input Nominal Marital status and gender
1 = male – divorced/separated
2 = female – divorced/separated/married
3 = male – single
4 = male – married/widowed
5 = female – single
OTHER Input Nominal or
Ordinal

Other installment loans:
1 = bank
2 = stores
3 = none
PROPERTY Input Nominal or
Ordinal
Collateral property for loan:
1 – real estate
2 = if not 1, building society savings
agreement / life insurance
3 = if not 1 or 2, car or others
4 = unknown / no property
6
PURPOSE Input Nominal Reason for loan request:
0 = new car
1 = used car
2 = furniture/equipment

3 = radio / television
4 = domestic appliances
5 = repairs
6 = education
7 = vacation
8 = retraining
9 = business
x = other
RESIDENT Input Interval Years at current address
SAVINGS Input Nominal or
Ordinal
Savings account balance:
1 = less than 100 DM
2 = at least 100, but less than 500 DM
3 = at least 500, but less than 1000 DM
4 = at least 1000 DM
5 = unknown / no savings account
TELEPHON Input Binary Telephone:

1 = none
2 = yes, registered under the customer’s
name
Ratner1 describes ‘data mining’ “as any process that finds
unexpected structures in
data and uses the EDA framework to ensure that the process
explores the data,
not exploits it.” Unexpected suggests that the word exploratory
is very
appropriate to this process.
Tukey2 in his book and in many presentations gave structure to
EDA. Others have
extended it to include ‘big’ data. Big data has occurred due to
our ability to
capture huge datasets, store it on servers cost effectively, and
analyze it with
software that will handle it.
Shiny App s
To learn more about Shiny applications with RStudio click on
the link below:

http://rstudio.github.io/shiny/tutorial/
http://rstudio.github.io/shiny/tutorial/
7
Types of Data
Quantitative data are numeric and represent counts or
measurements.
Categorical data are names or labels such as a,b,c but can often
be shown as 1,2,3.
They do not suggest counts or measurements.
Discrete data are finite or countable numeric data.
Continuous data are values that represent a continuous scale of
measurement.
A nominal level of measurement suggests names or categories.
There is no
apparent order suggested.
Ordinal level data suggest a sequential ordering but
mathematical calculations
should not be performed on this data.
Interval level data are ordinal plus the difference between two
data values is

meaningful. And, there is no zero level.
Ratio level data are interval and have a zero level plus
differences and ratios may
be calculated.
References:
1. Ratner, B. (2012). Statistical and Machine-Learning Data
Mining: Techniques for Better
Predictive Modeling and Analysis of Big Data (2nd ed.). New
York: CRC Press
[ISBN-13: 9781439860915]
2. Tukey, J.W. (1977). Exploratory Data Analysis. Addison-
Wesley.
Ames Housing OLS Regression Project (300 Points)
The ames_train data set contains approximately 2039 records.
See the data

description in the file Introduction_to_Ames_Housing_Data.
This is a random
selection of training data selected from the full dataset. Note,
the index numbers
have been randomized and the split between train and test is
also random so you
will not be able to match the test data with sale price values.
You are to use OLS
(“Linear”) Regression to predict the sale price for homes in the
ames_test_sfam
dataset by building two models using the ames_train data.
Note, the test data set
is single family homes, the training data is all homes.
DELIVERABLES
zip files). Your
write up should have five sections. Each section should have
enough detail so
that I can follow your logic and someone else can replicate your
work. (150

Points)
analysis. I should be
able to run this file and get all the output that you got.
ames_test_sfam.
There will be only two columns in this file: index and
p_saleprice. You will be
graded on how your model performs versus my model and those
of other
students in the class.
submitting your csv
file to kaggle at
https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf
It is OK to submit to Kaggle many times.
You will have to tell me your alias for Kaggle so I can see the
score.

https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf
이강복
강조
이강복
강조
WRITE UP (200 POINTS)
1. First Steps (40 points)
Describe the ames_train data set so that I am convinced you
understand it.
Use my shell code as a start to explore the data. Apply your
creativity and go
from there.
If you know how to do pivot tables in Excel, it is a great tool
for Exploratory Data
Analysis (EDA).
EDA was well established by John Tukey. He was a great
advocate for it and
developed much of what we do today.

Knowing your data typically consists of three components: (a) a
data survey, (b) a
data quality check, and (c) an initial exploratory data analysis.
(a) A Data Survey
- Take a broad overview of the Ames housing data set. Read
over the data
documentation. What data do you have, and what is it supposed
to represent?
- In the linear regression component of this course you build
linear regression
models to predict the value of a property (single family home).
Do you have the
right data to properly address the problem? Are there
observations in the data
that should be excluded?
- What kinds of problems can you properly address given the
data that you have?
In particular if you were to build a regression model with the
variable SalePrice as
the response variable, what types of properties would you be
valuing? Be careful
about what you are doing here.

(b) Define the Sample Population
- When building statistical models you have to define the
population of interest,
and then sample from THAT population. Frequently you will
not actively perform
the sampling function. Instead, the data will be made available
and you will have to
sample from it retrospectively, i.e. you will need to carve out
the population of
interest. In this assignment the objective of is to be able to
provide estimates of
home values for 'typical' homes in Ames, Iowa. You may not be
able to define what
이강복
강조
'typical' is, but can use the data to find out what is atypical. Any
values which are
not atypical are then considered to be typical.
- Define the appropriate sample population for your statistical
problem. Hint: You
are building regression models for the response variable

SalePrice. Are all
properties the same? Would you want to include an apartment
building in the same
sample as a single family residence? Would you want to include
a warehouse or a
shopping center in the same sample as a single family
residence? Would you want to
include condominiums in the same sample as a single family
residence?
- Define your sample using ‘drop conditions’. Create for the
drop conditions and
include it in your report so that it is clear to any reader what
you are excluding
from the data set when defining your sample population.
The definition of your sample data should be clearly noted in
your assignment
report.
(c) A Data Quality Check
- In practice your data will not be 'clean'. You will need to
examine your data for
errors and outliers. Errors will not always show as outliers, and
outliers are not

necessarily errors.
- If you have a data dictionary that states the set of proper
values for each field,
then you will want to check your data against the data
dictionary.
- If you do not have a data dictionary, then you will need to
reason and explore
your way to a proper data set.
Example 1: In this project you will be modeling the sales price
of housing
transactions. It should be obvious that none of these sales prices
should be zero or
negative. Observations with a zero or negative sales price
should logically be
considered to be errors.
Example 2: Suppose we had a 'small' number of housing
transactions with a sale
price over one million dollars, should we consider these sales
prices to be valid? In
this case these values could be valid data points, which would
make them outliers,
or they could be errors, such as 140,000.00 entered as
1,400,000. In either case

they are not relevant data points if the objective is to model the
'typical' home
price for the area.
2. EDA (30 Points)
Pick ten variables from the data quality check to explore in your
initial exploratory
data analysis. Perform an initial exploratory data analysis. How
do you perform an
exploratory data analysis for continuous versus discrete (or
categorical) data?
Consider the use of scatterplots, scatterplot smoothers such as
LOESS, and
boxplots to produce relevant graphics when appropriate.
Note that you are particularly interested in the relationships
between the
response variable and the predictor variables.
Suggest you split your EDA into two sections in your report –
one section for
continuous variables and one section for discrete variables.

3. BUILD MODELS (100 Points)
Build at least four different LINEAR REGRESSION models.
The first model should be a simple (single prediction variable)
model. Find the best
single variable model.
The next model should be a multiple regression model with two
predictor
variables. Find the best two variable model.
You do not need to build more complex models for this
assignment. More complex
models will be the topic for hw02.
Show all of your models and the statistical significance of the
input variables.
Discuss the quality of fit, R squared and adjusted R squared,
parsimony and
anything else you can think of that might be of value to share.
Discuss the coefficients in the model you select, do they make

sense? Are you
keeping the model even though it is counter intuitive?
4. SELECT MODELS (20 Points)
Decide on the criteria for selecting the “Best Model”. Will you
use a metric such as
Adjusted R-Square or AIC? Will you select a model with
slightly worse
performance if it makes more sense or is more parsimonious?
Discuss why you
selected your model. Put the metrics in a table to display the
results.
5. WRITE MODEL FORMULA (10 Points)
이강복
강조
Write a mathematical formula that will show the model you
selected. Explain your
formula.
Make sure you include this as a section in your report. Do not

expect that I
will search your report to find it. This step should allow
someone else to deploy
your model.
The variable with the predicted saleprice should be named:
p_saleprice
SCORED DATA FILE (100 POINTS)
Use the python model that you selected. Score the data file
ames_test_sfam.
Overall scoring for your model is based on providing a
prediction for every record
in the test data. Make sure you have not deleted any records in
the test data
and that none of your predictions are out of range. Create a
file that has only
TWO variables for each record:
index
p_saleprice

The first variable, index, will allow me to match my grading
key to your predicted
value. If I cannot do this, you won’t get a grade. So please
include this value. The
second value, p_saleprice is the predicted price for a property
per your model.
Your values will be compared against …
redict the Average value for everybody (MEAN)
If your model is not better than simply using an AVERAGE
value, you will lose
points.
BONUS
If you want Bonus Points, write a brief section at the top of
your Write Up
document and tell me exactly what you did and how many
points you are attempting.

If I cannot see your Bonus work, I cannot give you credit.
Bonus is difficult to
grade and I don’t have time to go back looking for it. If you
don’t tell me it’s there,
I cannot give you points.
The policy with Bonus is: All Sales are Final !
the results the
same? Are there any differences?
run with it. I might
give you points.
PENALTY BOX
e file

names of any files you
hand in
you hand in
1
Assignment Template
New and Revised for
September 2017
In the real world, you will be building predictive models and
doing analytic work.
But that is not your only function. After you do the work, you
need to explain it to
other people (most of whom will not understand analytics).
Therefore, it is critical
that you are able to explain your results in such a way that non
analytic people can

understand it. If you dump 20 or 30 pages of output on the
person and say “it’s all
in here”, then they won’t read it. In fact, that person will likely
just ignore your
results go about with their day to day business without giving
your work a second
thought. This is not a desirable outcome. You must write your
report so that it can
be understood by others and it must contain enough detail that it
can be replicated.
In my work I am often handed the work of others and asked to
provide a critique.
If I am unable to replicate their work because it is lacking in
detail then the
critique will be very negative.
It is not enough that you build a great model. You also have to
sell it.
DOs AND DON’Ts:
our document in PDF Format

“Homework_03_Fred_Smith.pdf”)
plaining the output
discussion
format
mework_03.pdf”
scroll through it
discussing it
diagram at the end of
the document …
UNLESS IT IS ABSOLUTELY NECESSARY to do that)
2

Example Report
(with a lot of commentary)
Assignment #3
Fred Smith PREDICT 410 Section 58
INTRODUCTION
The introduction should describe the purpose of the assignment
and what you are
going to do in order to complete the assignment. It should be
clear that you
understand why you are performing certain steps in an analysis.
BAD INTRODUCTION
The purpose of this report is to analyze baseball data.
GOOD INTRODUCTION
The purpose of the assignment is to analyze data from
somewhere in order to
predict the number of something. This will be accomplished by
generating simple
and multivariate regression models using different variable

selection techniques
including, but not limited to, Forward, Stepwise, and Backward
regression. From
these techniques, the best model will be selected. This best
model will then be
further analyzed to determine if it is an adequate model to
predict or if further
analysis is necessary.
Make sure you follow the assignment instructions. To get
points for each of these
sections, you have to show them in your report. Each
assignment will require a
different type report. This template is fairly generic so adjust
to assignment
instructions.
If I don’t see the section in your report you will get 0 points for
it.
1. Data Exploration
Important step. This is where you make or break model
building. Spend time on
this.

3
o many charts, Bar Charts, Box Plots, Scatter Plots of the
data
variables?)
be
imputed “fixed”?
will cause test records to be deleted,
fix
them.
2. Data Preparation
Also, a critical section. Experiment with this step. Be creative.
I like creative
ideas even if they don’t work.
Fix outliers

to create
new variables
3. Build Models
These are instructions from Assignment 1 but will be similar in
the other
assignments.
Build at least two different LINEAR REGRESSION models
using different
variables. Show all of your models and the statistical
significance of the input
variables.
Discuss the coefficients in the model, do they make sense? Are
you keeping the
model even though it is counter intuitive? Why?
Display the Python results for your assignment and comment on
the results. Your

discussion of the results should be intertwined with (or linked
to) the Python
output, i.e. the discussion should be on or near the page
containing the output.
You should not be showing a lot of unnecessary Python output.
4
Discuss the results thoroughly. Include such discussion points
as:
g different be done?
GOOD DESCRIPTION OF A DIAGRAM
The analysis continues by examining the plot of the residual
values versus the

predicted variables given in Figure 1. In this type of analysis, a
visual inspection of
the chart is conducted to determine whether or not any patterns
exist in the
residuals. Some patterns might include errors that increase or
decrease with
larger predictive variables or some other type of pattern such as
a curve. In an
ideal situation, the data will appear to be random. An inspection
of Figure 1
suggests that the data points are randomly distributed and no
obvious patterns
exist in the data. Therefore, there are no immediate concerns
with the
distribution of the errors.
Figure 1 Housing Data Predicted vs Residual Graph
BAD DESCRIPTION OF A DIAGRAM
I examined the output at the end of the document. There are no
patterns in the
data.
GOOD DESCRIPTION OF AN EQUATION

The model chosen from the different candidates was the XXX
model because it
had the highest Adjusted R-Squared value and the lowest AIC
and SBC values.
Using these metrics, it was far superior to the other models. The
formula given for
the predicted sale price is:
p_saleprice = 50000
+ 5000 * X1 LotFrontage
+ 6000 * X2 LotArea
+ 3000 * X3 OverallCond
5
The formula makes intuitive sense for the most part because
sale price
coefficients reflect that size and condition add to the value of a
property.
However, the data should be analyzed for multi-collinearity

which can result in sign
changes. Also, it might be wise to remove the variable from the
model if no
explanation can be found.
BAD DESCRIPTION OF AN EQUATION
This is the formula I chose.
p_saleprice = 10.4901
+ 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 +
2.65534 * X5 -
3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9
Additionally, it is important to note that this data was
developed on data from
XXXX years, so it is unknown as to whether this data will
translate into years in
the future. Further analysis will need to be done to determine
whether this model
will be robust and translate outside the XXXX year time
window.
NOTE: This is a made up formula, so don’t go investing in
housing in New York

based on this model. Come to think of it, it’s probably not a
good idea to invest in
New York unless you are very familiar with New York.
4. Select Models
use a metric such as
Adjusted R-Square or AIC? Will you select a model with
slightly worse
performance if it makes more sense or is more parsimonious?
Discuss why you
selected your model. Put the results in a table to display and
discuss.
5. Model Formula
If you expect points for this step, show it in your report and
explain it. You
will get 0 points if it is somewhere in your code and left out of
the report.
Don’t expect that I will search your code for it.
Write python code that will score new data and predict the sale
price. The variable
with the predicted sale price should be named:

6
p_saleprice
6. Scored Data File
Make sure you submit as a csv file.
Use the stand alone program that you wrote in the previous
section. Score the data
file ames_test. Create a file that has only TWO variables for
each record:
index
p_saleprice
The first variable, index, will allow me to match my grading
key to your predicted
value. If I cannot do this, you won’t get a grade. The second
value, p_saleprice is
the predicted sale price of a home based on the data given to
you.

Your values will be compared against …
body (MEAN)
If your model is not better than simply using an AVERAGE
value, you will lose
points.
CONCLUSION:
A short wrap up of the assignment including a discussion of
results and what was
learned.
GOOD CONCLUSION:
Several models were developed to predict the sale price of a
home using Ames

Housing data. The best model was derived using XXXX.
Although there were no
problems with the model from a statistical standpoint, the
winning model did have a
7
sign issue with one of the variables where seemingly bad
construction would result
in a higher sale price. This issue needs further investigation but
is beyond the
scope of this document.
BAD CONCLUSION:
I built some models that were good and I learned a lot.
CODE:
Attach as a separate file or paste your code in at the end.
BONUS
Place all bonus work at the end of the document. Clearly
identify what you are

doing and how many points you are trying to earn.
Assignment #1 Jahee Koo PREDICT 410 Section 58
(It is just a template example, so should change all contents
based on written instructions.)
INTRODUCTION
The purpose of the assignment is to analyze data from
somewhere in order to predict the number of something. This
will be accomplished by generating simple and multivariate
regression models using different variable selection techniques
including, but not limited to, Forward, Stepwise, and Backward
regression. From these techniques, the best model will be
selected. This best model will then be further analyzed to
determine if it is an adequate model to predict or if further
analysis is necessary.
Make sure you follow the assignment instructions. To get points
for each of these sections, you have to show them in your
report. Each assignment will require a different type report.
This template is fairly generic so adjust to assignment
instructions.
1. Data Exploration
of the
data
variables?)
be imputed “fixed”?

fix them.
2. Data Preparation
variables (such as ratios or adding or multiplying)
to create new variables
3. Build Models
These are instructions from Assignment 1 but will be similar in
the other assignments.
Build at least two different LINEAR REGRESSION models
using different variables. Show all of your models and the
statistical significance of the input variables.
Discuss the coefficients in the model, do they make sense? Are
you keeping the model even though it is counter intuitive?
Why?
Display the Python results for your assignment and comment on
the results. Your discussion of the results should be intertwined
with (or linked to) the Python output, i.e. the discussion should
be on or near the page containing the output. You should not be
showing a lot of unnecessary Python output.
Discuss the results thoroughly. Include such discussion points
What is observed in the graph /
table / output
nse?

GOOD DESCRIPTION OF A DIAGRAM
The analysis continues by examining the plot of the residual
values versus the predicted variables given in Figure 1. In this
type of analysis, a visual inspection of the chart is conducted to
determine whether or not any patterns exist in the residuals.
Some patterns might include errors that increase or decrease
with larger predictive variables or some other type of pattern
such as a curve. In an ideal situation, the data will appear to be
random. An inspection of Figure 1 suggests that the data points
are randomly distributed and no obvious patterns exist in the
data. Therefore, there are no immediate concerns with the
distribution of the errors.
Figure 1 Housing Data Predicted vs Residual Graph
GOOD DESCRIPTION OF AN EQUATION
The model chosen from the different candidates was the XXX
model because it had the highest Adjusted R-Squared value and
the lowest AIC and SBC values. Using these metrics, it was far
superior to the other models. The formula given for the
predicted sale price is:
p_saleprice = 50000
+ 5000 * X1 LotFrontage
+ 6000 * X2 LotArea
+ 3000 * X3 OverallCond 5
The formula makes intuitive sense for the most part because
sale price coefficients reflect that size and condition add to the
value of a property.
However, the data should be analyzed for multi-collinearity
which can result in sign changes. Also, it might be wise to
remove the variable from the model if no explanation can be
found.
4. Select Models

use a metric such as Adjusted R-Square or AIC? Will you select
a model with slightly worse performance if it makes more sense
or is more parsimonious? Discuss why you selected your model.
Put the results in a table to display and discuss.
5. Model Formula
Write python code that will score new data and predict the sale
price. The variable with the predicted sale price should be
named:
p_saleprice
6. Scored Data File
Make sure you submit as a csv file.
Use the stand alone program that you wrote in the previous
section. Score the data file ames_test. Create a file that has only
TWO variables for each record:
index
p_saleprice
CONCLUSION
Several models were developed to predict the sale price of a
home using Ames Housing data. The best model was derived
using XXXX. Although there were no problems with the model
from a statistical standpoint, the winning model did have a …
CODE:
Attach as a separate file or paste your code in at the end.
BONUS
Place all bonus work at the end of the document. Clearly
identify what you are doing and how many points you are trying
to earn.

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 16 22:58:46 2018
@author: Paul Lee
"""
# Using Linear Regression to predict
# family home sale prices in Ames, Iowa
# Packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import linear_model, metrics
# Set some options for the output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 120)
# Read in the data
train = pd.read_csv('C:/Users/Jahee
Koo/Desktop/AMES_TRAIN.csv')
test = pd.read_csv('C:/Users/Jahee
Koo/Desktop/AMES_TEST_SFAM.csv')
# Convert all variable names to lower case
train.columns = [col.lower() for col in train.columns]
test.columns = [col.lower() for col in test.columns]

# EDA
print('n----- Summary of Train Data -----n')
print('Object type: ', type(train))
print('Number of observations & variables: ', train.shape)
# Variable names and information
print(train.info())
print(train.dtypes.value_counts())
# Descriptive statistics
print(train.describe())
# show a portion of the beginning of the DataFrame
print(train.head(10))
print(train.shape)
train.loc[:,
train.isnull().any()].isnull().sum().sort_values(ascending=False)
train[train == 0].count().sort_values(ascending=False)
t_null = train.isnull().sum()
t_zero = train[train == 0].count()
t_good = train.shape[0] - (t_null + t_zero)
xx = range(train.shape[1])
plt.figure(figsize=(8,8))
plt.bar(xx, t_good, color='g', width=1,
bottom=t_null+t_zero)
plt.bar(xx, t_zero, color='y', width=1,
bottom=t_null)
plt.bar(xx, t_null, color='r', width=1)
plt.show()
print(t_null[t_null > 1000].sort_values(ascending=False))
print(t_zero[t_zero > 1900].sort_values(ascending=False))

drop_cols = (t_null > 1000) | (t_zero > 1900)
train = train.loc[:, -drop_cols]
# Some quick plots of the data
train.hist(figsize=(18,14))
train.plot(
kind='box',
subplots=True,
layout=(5,9),
sharex=False,
sharey=False,
figsize=(18,14)
)
train.plot.scatter(x='grlivarea', y='saleprice')
train.boxplot(column='saleprice', by='yrsold')
train.plot.scatter(x='subclass', y='saleprice')
train.boxplot(column='saleprice', by='overallqual')
train.boxplot(column='saleprice', by='overallcond')
train.plot.scatter(x='overallcond', y='saleprice')
train.plot.scatter(x='lotarea', y='saleprice')
# Replace NaN values with medians in train data
train = train.fillna(train.median())
train = train.apply(lambda
med:med.fillna(med.value_counts().index[0]))
train.head()
t_null = train.isnull().sum()
t_zero = train[train == 0].count()
t_good = train.shape[0] - (t_null + t_zero)
xx = range(train.shape[1])
plt.figure(figsize=(14,14))
plt.bar(xx, t_good, color='g', width=.8,
bottom=t_null+t_zero)
plt.bar(xx, t_zero, color='y', width=.8,

bottom=t_null)
plt.bar(xx, t_null, color='r', width=.8)
plt.show()
train.bldgtype.unique()
train.housestyle.unique()
# Goal is typical family home
# Drop observations too far from typical
iqr = np.percentile(train.saleprice, 75) -
np.percentile(train.saleprice, 25)
drop_rows = train.saleprice > iqr * 1.5 +
np.percentile(train.saleprice, 75)
train = train.loc[-drop_rows, :]
iqr = np.percentile(train.grlivarea, 75) -
np.percentile(train.grlivarea, 25)
drop_rows = train.grlivarea > iqr * 1.5 +
np.percentile(train.grlivarea, 75)
iqr = np.percentile(train.lotarea, 75) -
np.percentile(train.lotarea, 25)
drop_rows = train.lotarea > iqr * 1.5 +
np.percentile(train.lotarea, 75)
iqr = np.percentile(train.totalbsmtsf, 75) -
np.percentile(train.totalbsmtsf, 25)
drop_rows = train.totalbsmtsf > iqr * 1.5 +
np.percentile(train.totalbsmtsf, 75)
# Replace 0 values with median to living area in train data
m = np.median(train.grlivarea[train.grlivarea > 0])
train = train.replace({'grlivarea': {0: m}})

# Discrete variables
plt.figure()
g = sns.PairGrid(train,
x_vars=["bldgtype",
"exterqual",
"centralair",
"kitchenqual",
"salecondition"],
y_vars=["saleprice"],
aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");
# Print correlations
corr_matrix = train.corr()
print(corr_matrix["saleprice"].sort_values(ascending=False).hea
d(10))
print(corr_matrix["saleprice"].sort_values(ascending=True).hea
d(10))
## Pick 10 variable to focus on
pick_10 = [
'saleprice',
'grlivarea',
'overallqual',
'garagecars',
'yearbuilt',
'totalbsmtsf',
'salecondition',
'bldgtype',
'kitchenqual',
'exterqual',
'centralair'
]
corr = train[pick_10].corr()

blank = np.zeros_like(corr, dtype=np.bool)
blank[np.triu_indices_from(blank)] = True
fig, ax = plt.subplots(figsize=(10, 10))
corr_map = sns.diverging_palette(255, 133, l=60, n=7,
center="dark", as_cmap=True)
sns.heatmap(corr, mask=blank, cmap=corr_map, square=True,
vmax=.3, linewidths=0.25, cbar_kws={"shrink": .5})
# Quick plots
for variable in pick_10[1:]:
if train[variable].dtype.name == 'object':
plt.figure()
sns.stripplot(y="saleprice", x=variable, data=train,
jitter=True)
plt.show()
plt.figure()
sns.factorplot(y="saleprice", x=variable, data=train,
kind="box")
plt.show()
else:
fig, ax = plt.subplots()
ax.set_ylabel('Sale Price')
ax.set_xlabel(variable)
scatter_plot = ax.scatter(
y=train['saleprice'],
x=train[variable],
facecolors = 'none',
edgecolors = 'blue'
)
plt.show()
plt.figure()
sns.factorplot(x="bldgtype", y="saleprice", col="exterqual",
row="kitchenqual",
hue="overallqual", data=train, kind="swarm")

plt.figure()
sns.countplot(y="overallqual", hue="exterqual", data=train,
palette="Greens_d")
# Run simple models
model1 = smf.ols(formula='saleprice ~ grlivarea',
data=train).fit()
model2 = smf.ols(formula='saleprice ~ grlivarea + overallqual',
data=train).fit()
model3 = smf.ols(formula='saleprice ~ grlivarea + overallqual +
garagecars' , data=train).fit()
garagecars + yearbuilt' , data=train).fit()
garagecars + yearbuilt + totalbsmtsf + kitchenqual + exterqual +
centralair', data=train).fit()
print('nnmodel 1----------n', model1.summary())
out = [model1,
model2,
model3,
model4,
model5]
out_df = pd.DataFrame()
out_df['labels'] = ['rsquared', 'rsquared_adj', 'fstatistic', 'aic']
i = 0
for model in out:
train['pred'] = model.fittedvalues
plt.figure()

train.plot.scatter(x='saleprice', y='pred', title='model' +
str(i+1))
plt.show()
out_df['model' + str(i+1)] = [
model.rsquared.round(3),
model.rsquared_adj.round(3),
model.fvalue.round(3),
model.aic.round(3)
]
i += 1
train['predictions'] = model5.fittedvalues
print(train['predictions'])
# Clean test data
test.info()
test[3:] = test[3:].fillna(test[3:].median())
test["kitchenqual"] =
test["kitchenqual"].fillna(test["kitchenqual"].value_counts().ind
ex[0])
test["exterqual"] =
test["exterqual"].fillna(test["exterqual"].value_counts().index[0
])
m = np.median(test.grlivarea[test.grlivarea > 0])
test = test.replace({'grlivarea': {0: m}})
print(test)
# Convert the array predictions to a data frame then merge with
the index for the test data
test_predictions = model5.predict(test)
test_predictions[test_predictions < 0] = train['saleprice'].min()
print(test_predictions)

dat = {'p_saleprice': test_predictions}
df1 = test[['index']]
df2 = pd.DataFrame(data=dat)
submission = pd.concat([df1,df2], axis = 1,
join_axes=[df1.index])
print(submission)
submission.to_csv('C:/Users/Jahee
Koo/Desktop/hw01_predictions.csv')
NAME: AmesHousing.txt
TYPE: Population
SIZE: 2930 observations, 82 variables
ARTICLE TITLE: Ames Iowa: Alternative to the Boston
Housing Data Set
DESCRIPTIVE ABSTRACT: Data set contains information
from the Ames
Assessor’s Office used in computing assessed values for
individual residential
properties sold in Ames, IA from 2006 to 2010.
SOURCES:
Ames, Iowa Assessor’s Office

VARIABLE DESCRIPTIONS:
Tab characters are used to separate variables in the data file.
The data has 82
columns which include 23 nominal, 23 ordinal, 14 discrete, and
20 continuous
variables (and 2 additional observation identifiers).
Order (Discrete): Observation number
PID (Nominal): Parcel identification number - can be used with
city web site for
parcel review.
MS SubClass (Nominal): Identifies the type of dwelling
involved in the sale.
020 1-STORY 1946 & NEWER ALL STYLES
030 1-STORY 1945 & OLDER
040 1-STORY W/FINISHED ATTIC ALL AGES
045 1-1/2 STORY - UNFINISHED ALL AGES
050 1-1/2 STORY FINISHED ALL AGES
060 2-STORY 1946 & NEWER
070 2-STORY 1945 & OLDER

075 2-1/2 STORY ALL AGES
080 SPLIT OR MULTI-LEVEL
085 SPLIT FOYER
090 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 &
NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND
AGES
MS Zoning (Nominal): Identifies the general zoning
classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density

RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
Lot Frontage (Continuous): Linear feet of street connected to
property
Lot Area (Continuous): Lot size in square feet
Street (Nominal): Type of road access to property
Grvl Gravel
Pave Paved
Alley (Nominal): Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
Lot Shape (Ordinal): General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular

IR3 Irregular
Land Contour (Nominal): Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade
to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities (Ordinal): Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
Lot Config (Nominal): Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac

FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
Land Slope (Ordinal): Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood (Nominal): Physical locations within Ames city
limits (map available)
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards

Gilbert Gilbert
Greens Greens
GrnHill Green Hills
IDOTRR Iowa DOT and Rail Road
Landmrk Landmark
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook

Timber Timberland
Veenker Veenker
Condition 1 (Nominal): Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition 2 (Nominal): Proximity to various conditions (if more
than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal

RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Bldg Type (Nominal): Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-
family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
House Style (Nominal): Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished

2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
Overall Qual (Ordinal): Rates the overall material and finish of
the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor

Overall Cond (Ordinal): Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
Year Built (Discrete): Original construction date
Year Remod/Add (Discrete): Remodel date (same as
construction date if no
remodeling or additions)
Roof Style (Nominal): Type of roof
Flat Flat

Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
Roof Matl (Nominal): Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior 1 (Nominal): Exterior covering on house

AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior 2 (Nominal): Exterior covering on house (if more than

one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding

WdShing Wood Shingles
Mas Vnr Type (Nominal): Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
Mas Vnr Area (Continuous): Masonry veneer area in square feet
Exter Qual (Ordinal): Evaluates the quality of the material on
the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Exter Cond (Ordinal): Evaluates the present condition of the
material on the

exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation (Nominal): Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
Bsmt Qual (Ordinal): Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)

TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
Bsmt Cond (Ordinal): Evaluates the general condition of the
basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
Bsmt Exposure (Ordinal): Refers to walkout or garden level
walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score

average or
above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFin Type 1 (Ordinal): Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFin SF 1 (Continuous): Type 1 finished square feet
BsmtFinType 2 (Ordinal): Rating of basement finished area (if
multiple types)
GLQ Good Living Quarters

ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFin SF 2 (Continuous): Type 2 finished square feet
Bsmt Unf SF (Continuous): Unfinished square feet of basement
area
Total Bsmt SF (Continuous): Total square feet of basement area
Heating (Nominal): Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace

OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC (Ordinal): Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Central Air (Nominal): Central air conditioning
N No
Y Yes
Electrical (Ordinal): Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring
(Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring
(poor)

Mix Mixed
1st Flr SF (Continuous): First Floor square feet
2nd Flr SF (Continuous) : Second floor square feet
Low Qual Fin SF (Continuous): Low quality finished square
feet (all floors)
Gr Liv Area (Continuous): Above grade (ground) living area
square feet
Bsmt Full Bath (Discrete): Basement full bathrooms
Bsmt Half Bath (Discrete): Basement half bathrooms
Full Bath (Discrete): Full bathrooms above grade
Half Bath (Discrete): Half baths above grade
Bedroom (Discrete): Bedrooms above grade (does NOT include
basement
bedrooms)

Kitchen (Discrete): Kitchens above grade
KitchenQual (Ordinal): Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd (Discrete): Total rooms above grade (does not
include
bathrooms)
Functional (Ordinal): Home functionality (Assume typical
unless deductions are
warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions

Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces (Discrete): Number of fireplaces
FireplaceQu (Ordinal): Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area
or Masonry
Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
Garage Type (Nominal): Garage location
2Types More than one type of garage

Attchd Attached to home
Basment Basement Garage
BuiltIn uilt-In (Garage part of house - typically has room
above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
Garage Yr Blt (Discrete): Year garage was built
Garage Finish (Ordinal) : Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
Garage Cars (Discrete): Size of garage in car capacity
Garage Area (Continuous): Size of garage in square feet

Garage Qual (Ordinal): Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Garage Cond (Ordinal): Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Paved Drive (Ordinal): Paved driveway
Y Paved
P Partial Pavement

N Dirt/Gravel
Wood Deck SF (Continuous): Wood deck area in square feet
Open Porch SF (Continuous): Open porch area in square feet
Enclosed Porch (Continuous): Enclosed porch area in square
feet
3-Ssn Porch (Continuous): Three season porch area in square
feet
Screen Porch (Continuous): Screen porch area in square feet
Pool Area (Continuous): Pool area in square feet
Pool QC (Ordinal): Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair

NA No Pool
Fence (Ordinal): Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
Misc Feature (Nominal): Miscellaneous feature not covered in
other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
Misc Val (Continuous): $Value of miscellaneous feature
Mo Sold (Discrete): Month Sold (MM)

Yr Sold (Discrete): Year Sold (YYYY)
Sale Type (Nominal): Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
Sale Condition (Nominal): Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale

AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate
deeds, typically
condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed
(associated with
New Homes)
SalePrice (Continuous): Sale price $$
I have to complete EDA assignment using python before JAN.
20th.
I have an approximate python code and a report template
example. I would like to ask you to complete the report based
on these.
I will attach the necessary files for analysis and report
generation.
After completion, please send me a doc. and .py file.

1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

Recommended

Recommended

More Related Content

Similar to 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

Similar to 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx (18)

More from honey725342

More from honey725342 (20)

Recently uploaded

Recently uploaded (20)

1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx