SlideShare a Scribd company logo
1 of 84
1
Exploratory Data Analysis (EDA)
by Melvin Ott, PhD
September, 2017
Introduction
The Masters in Predictive Analytics program at Northwestern
University offers
graduate courses that cover predictive modeling using several
software products
such as SAS, R and Python. The Predict 410 course is one of
the core courses and
this section focuses on using Python.
Predict 410 will follow a sequence in the assignments. The first
assignment will ask
you to perform an EDA(See Ratner1 Chapters 1&2) for the
Ames Housing Data
dataset to determine the best single variable model. It will be
followed by an
assignment to expand to a multivariable model. Python
software for boxplots,
scatterplots and more will help you identify the single variable.
However, it is easy
to get lost in the programming and lose sight of the objective.
Namely, which of
the variable choices best explain the variability in the response
variable?
(You will need to be familiar with the data types and level of
measurement. This
will be critical in determining the choice of when to use a
dummy variable for model
building. If this topic is new to you review the definitions at
Types of Data before
reading further.)
This report will help you become familiar with some of the
tools for EDA and allow
you to interact with the data by using links to a software
product, Shiny, that will
demonstrate and interact with you to produce various plots of
the data. Shiny is
located on a cloud server and will allow you to make choices in
looking at the plots
for the data. Study the plots carefully. This is your initial EDA
tool and leads to
your model building and your overall understanding of
predictive analytics.
Single Variable Linear Regression EDA
1. Become Familiar With the Data
2
Identify the variables that are categorical and the variables that
are quantitative.
For the Ames Housing Data, you should review the Ames Data
Description pdf file.
2. Look at Plots of the Data
For the variables that are quantitative, you should look at
scatter plots vs the
response variable saleprice. For the categorical variables, look
at boxplots vs
saleprice. You have sample Python code to help with the EDA
and below are some
links that will demonstrate the relationships for the a different
building_prices
dataset.
For the boxplots with Shiny:
Click here
For the scatterplots with Shiny:
Click here
3. Begin Writing Python Code
Start with the shell code and improve on the model provided.
http://melvin.shinyapps.io/SboxPlot
http://melvin.shinyapps.io/SScatter/
http://melvin.shinyapps.io/SScatter/
3
Single Variable Logistic Regression EDA
1. Become Familiar With the Data
In 411 you will have an introduction to logistic regression and
again will ask you to
perform an EDA. See the file credit data for more info. Make
sure you recognize
which variables are quantitative and which are categorical.
And, for several of
these variables, what is the level of measurement?
2. Look at Plots of the Data
For logistic regression, the response variable is of the type
yes/no. In this
dataset it is coded as good/bad. So, the EDA may include
histograms for
quantitative variables with a separate histogram for each of the
response values.
For numeric coded explanatory categorical variables, if the
response good/bad is
recoded as 0/1 then the mean for the response variable for each
of the categories
will indicate if there is a relationship.
For the histograms with Shiny:
Click here
For the means with Shiny:
Click here
3. Begin Writing Python Code
OK. You have looked at the plots, which variable do you think
will be most useful
for predicting or explaining bad credit? After you answer this
question, begin
writing Python code to see if you can replicate these plots.
http://melvin.shinyapps.io/SHisto
http://melvin.shinyapps.io/SHisto
http://melvin.shinyapps.io/SCRMeans
http://melvin.shinyapps.io/SCRMeans
4
The data set CREDIT contains information on 1000 customers.
There are 21 variables in
the data set:
Name Model
Role
Measurement
Level
Description
AGE Input Interval Age in years
AMOUNT Input Interval Amount of credit requested
CHECKING Input Nominal or
Ordinal
Balance in existing checking account:
1 = less than 0 DM
2 = more than 0 but less than 200 DM
3 = at least 200 DM
4 = no checking account
COAPP Input Nominal Other debtors or guarantors:
1 = none
2 = co-applicant
3 = guarantor
DEPENDS Input Interval Number of dependents
DURATION Input Interval Length of loan in months
EMPLOYED Input Ordinal Time at present employment:
1 = unemployed
2 = less than 1 year
3 = at least 1, but less than 4 years
4 = at least 4, but less than 7 years
5 = at least 7 years
EXISTCR Input Interval Number of existing accounts at this
bank
FOREIGN Input Binary Foreign worker:
1 = Yes
2 = No
GOOD_BAD Target Binary Credit Rating Status (good or bad)
5
HISTORY Input Ordinal Credit History:
0 = no loans taken / all loans paid back in
full and on time
1 = all loans at this bank paid back in full
and on time
2 = all loans paid back on time until now
3 = late payments on previous loans
4 = critical account / loans in arrears at
other banks
HOUSING Input Nominal Rent/Own:
1 = rent
2 = own
3 = free housing
INSTALLP Input Interval Debt as a percent of disposable
income
JOB Input Ordinal Employment status:
1 = unemployed / unskilled non-resident
2 = unskilled resident
3 = skilled employee / official
4 = management / self-employed / highly
skilled employee / officer
MARITAL Input Nominal Marital status and gender
1 = male – divorced/separated
2 = female – divorced/separated/married
3 = male – single
4 = male – married/widowed
5 = female – single
OTHER Input Nominal or
Ordinal
Other installment loans:
1 = bank
2 = stores
3 = none
PROPERTY Input Nominal or
Ordinal
Collateral property for loan:
1 – real estate
2 = if not 1, building society savings
agreement / life insurance
3 = if not 1 or 2, car or others
4 = unknown / no property
6
PURPOSE Input Nominal Reason for loan request:
0 = new car
1 = used car
2 = furniture/equipment
3 = radio / television
4 = domestic appliances
5 = repairs
6 = education
7 = vacation
8 = retraining
9 = business
x = other
RESIDENT Input Interval Years at current address
SAVINGS Input Nominal or
Ordinal
Savings account balance:
1 = less than 100 DM
2 = at least 100, but less than 500 DM
3 = at least 500, but less than 1000 DM
4 = at least 1000 DM
5 = unknown / no savings account
TELEPHON Input Binary Telephone:
1 = none
2 = yes, registered under the customer’s
name
Exploratory Data Analysis (EDA)
Ratner1 describes ‘data mining’ “as any process that finds
unexpected structures in
data and uses the EDA framework to ensure that the process
explores the data,
not exploits it.” Unexpected suggests that the word exploratory
is very
appropriate to this process.
Tukey2 in his book and in many presentations gave structure to
EDA. Others have
extended it to include ‘big’ data. Big data has occurred due to
our ability to
capture huge datasets, store it on servers cost effectively, and
analyze it with
software that will handle it.
Shiny App s
To learn more about Shiny applications with RStudio click on
the link below:
http://rstudio.github.io/shiny/tutorial/
http://rstudio.github.io/shiny/tutorial/
7
Types of Data
Quantitative data are numeric and represent counts or
measurements.
Categorical data are names or labels such as a,b,c but can often
be shown as 1,2,3.
They do not suggest counts or measurements.
Discrete data are finite or countable numeric data.
Continuous data are values that represent a continuous scale of
measurement.
A nominal level of measurement suggests names or categories.
There is no
apparent order suggested.
Ordinal level data suggest a sequential ordering but
mathematical calculations
should not be performed on this data.
Interval level data are ordinal plus the difference between two
data values is
meaningful. And, there is no zero level.
Ratio level data are interval and have a zero level plus
differences and ratios may
be calculated.
References:
1. Ratner, B. (2012). Statistical and Machine-Learning Data
Mining: Techniques for Better
Predictive Modeling and Analysis of Big Data (2nd ed.). New
York: CRC Press
[ISBN-13: 9781439860915]
2. Tukey, J.W. (1977). Exploratory Data Analysis. Addison-
Wesley.
Ames Housing OLS Regression Project (300 Points)
The ames_train data set contains approximately 2039 records.
See the data
description in the file Introduction_to_Ames_Housing_Data.
This is a random
selection of training data selected from the full dataset. Note,
the index numbers
have been randomized and the split between train and test is
also random so you
will not be able to match the test data with sale price values.
You are to use OLS
(“Linear”) Regression to predict the sale price for homes in the
ames_test_sfam
dataset by building two models using the ames_train data.
Note, the test data set
is single family homes, the training data is all homes.
DELIVERABLES
zip files). Your
write up should have five sections. Each section should have
enough detail so
that I can follow your logic and someone else can replicate your
work. (150
Points)
analysis. I should be
able to run this file and get all the output that you got.
ames_test_sfam.
There will be only two columns in this file: index and
p_saleprice. You will be
graded on how your model performs versus my model and those
of other
students in the class.
submitting your csv
file to kaggle at
https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf
It is OK to submit to Kaggle many times.
You will have to tell me your alias for Kaggle so I can see the
score.
https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf
이강복
강조
이강복
강조
WRITE UP (200 POINTS)
1. First Steps (40 points)
Describe the ames_train data set so that I am convinced you
understand it.
Use my shell code as a start to explore the data. Apply your
creativity and go
from there.
If you know how to do pivot tables in Excel, it is a great tool
for Exploratory Data
Analysis (EDA).
EDA was well established by John Tukey. He was a great
advocate for it and
developed much of what we do today.
Knowing your data typically consists of three components: (a) a
data survey, (b) a
data quality check, and (c) an initial exploratory data analysis.
(a) A Data Survey
- Take a broad overview of the Ames housing data set. Read
over the data
documentation. What data do you have, and what is it supposed
to represent?
- In the linear regression component of this course you build
linear regression
models to predict the value of a property (single family home).
Do you have the
right data to properly address the problem? Are there
observations in the data
that should be excluded?
- What kinds of problems can you properly address given the
data that you have?
In particular if you were to build a regression model with the
variable SalePrice as
the response variable, what types of properties would you be
valuing? Be careful
about what you are doing here.
(b) Define the Sample Population
- When building statistical models you have to define the
population of interest,
and then sample from THAT population. Frequently you will
not actively perform
the sampling function. Instead, the data will be made available
and you will have to
sample from it retrospectively, i.e. you will need to carve out
the population of
interest. In this assignment the objective of is to be able to
provide estimates of
home values for 'typical' homes in Ames, Iowa. You may not be
able to define what
이강복
강조
'typical' is, but can use the data to find out what is atypical. Any
values which are
not atypical are then considered to be typical.
- Define the appropriate sample population for your statistical
problem. Hint: You
are building regression models for the response variable
SalePrice. Are all
properties the same? Would you want to include an apartment
building in the same
sample as a single family residence? Would you want to include
a warehouse or a
shopping center in the same sample as a single family
residence? Would you want to
include condominiums in the same sample as a single family
residence?
- Define your sample using ‘drop conditions’. Create for the
drop conditions and
include it in your report so that it is clear to any reader what
you are excluding
from the data set when defining your sample population.
The definition of your sample data should be clearly noted in
your assignment
report.
(c) A Data Quality Check
- In practice your data will not be 'clean'. You will need to
examine your data for
errors and outliers. Errors will not always show as outliers, and
outliers are not
necessarily errors.
- If you have a data dictionary that states the set of proper
values for each field,
then you will want to check your data against the data
dictionary.
- If you do not have a data dictionary, then you will need to
reason and explore
your way to a proper data set.
Example 1: In this project you will be modeling the sales price
of housing
transactions. It should be obvious that none of these sales prices
should be zero or
negative. Observations with a zero or negative sales price
should logically be
considered to be errors.
Example 2: Suppose we had a 'small' number of housing
transactions with a sale
price over one million dollars, should we consider these sales
prices to be valid? In
this case these values could be valid data points, which would
make them outliers,
or they could be errors, such as 140,000.00 entered as
1,400,000. In either case
they are not relevant data points if the objective is to model the
'typical' home
price for the area.
2. EDA (30 Points)
Pick ten variables from the data quality check to explore in your
initial exploratory
data analysis. Perform an initial exploratory data analysis. How
do you perform an
exploratory data analysis for continuous versus discrete (or
categorical) data?
Consider the use of scatterplots, scatterplot smoothers such as
LOESS, and
boxplots to produce relevant graphics when appropriate.
Note that you are particularly interested in the relationships
between the
response variable and the predictor variables.
Suggest you split your EDA into two sections in your report –
one section for
continuous variables and one section for discrete variables.
3. BUILD MODELS (100 Points)
Build at least four different LINEAR REGRESSION models.
The first model should be a simple (single prediction variable)
model. Find the best
single variable model.
The next model should be a multiple regression model with two
predictor
variables. Find the best two variable model.
You do not need to build more complex models for this
assignment. More complex
models will be the topic for hw02.
Show all of your models and the statistical significance of the
input variables.
Discuss the quality of fit, R squared and adjusted R squared,
parsimony and
anything else you can think of that might be of value to share.
Discuss the coefficients in the model you select, do they make
sense? Are you
keeping the model even though it is counter intuitive?
4. SELECT MODELS (20 Points)
Decide on the criteria for selecting the “Best Model”. Will you
use a metric such as
Adjusted R-Square or AIC? Will you select a model with
slightly worse
performance if it makes more sense or is more parsimonious?
Discuss why you
selected your model. Put the metrics in a table to display the
results.
5. WRITE MODEL FORMULA (10 Points)
이강복
강조
Write a mathematical formula that will show the model you
selected. Explain your
formula.
Make sure you include this as a section in your report. Do not
expect that I
will search your report to find it. This step should allow
someone else to deploy
your model.
The variable with the predicted saleprice should be named:
p_saleprice
SCORED DATA FILE (100 POINTS)
Use the python model that you selected. Score the data file
ames_test_sfam.
Overall scoring for your model is based on providing a
prediction for every record
in the test data. Make sure you have not deleted any records in
the test data
and that none of your predictions are out of range. Create a
file that has only
TWO variables for each record:
index
p_saleprice
The first variable, index, will allow me to match my grading
key to your predicted
value. If I cannot do this, you won’t get a grade. So please
include this value. The
second value, p_saleprice is the predicted price for a property
per your model.
Your values will be compared against …
redict the Average value for everybody (MEAN)
If your model is not better than simply using an AVERAGE
value, you will lose
points.
BONUS
If you want Bonus Points, write a brief section at the top of
your Write Up
document and tell me exactly what you did and how many
points you are attempting.
If I cannot see your Bonus work, I cannot give you credit.
Bonus is difficult to
grade and I don’t have time to go back looking for it. If you
don’t tell me it’s there,
I cannot give you points.
The policy with Bonus is: All Sales are Final !
the results the
same? Are there any differences?
run with it. I might
give you points.
PENALTY BOX
e file
names of any files you
hand in
you hand in
1
Assignment Template
New and Revised for
September 2017
In the real world, you will be building predictive models and
doing analytic work.
But that is not your only function. After you do the work, you
need to explain it to
other people (most of whom will not understand analytics).
Therefore, it is critical
that you are able to explain your results in such a way that non
analytic people can
understand it. If you dump 20 or 30 pages of output on the
person and say “it’s all
in here”, then they won’t read it. In fact, that person will likely
just ignore your
results go about with their day to day business without giving
your work a second
thought. This is not a desirable outcome. You must write your
report so that it can
be understood by others and it must contain enough detail that it
can be replicated.
In my work I am often handed the work of others and asked to
provide a critique.
If I am unable to replicate their work because it is lacking in
detail then the
critique will be very negative.
It is not enough that you build a great model. You also have to
sell it.
DOs AND DON’Ts:
our document in PDF Format
“Homework_03_Fred_Smith.pdf”)
plaining the output
discussion
format
mework_03.pdf”
scroll through it
discussing it
diagram at the end of
the document …
UNLESS IT IS ABSOLUTELY NECESSARY to do that)
2
Example Report
(with a lot of commentary)
Assignment #3
Fred Smith PREDICT 410 Section 58
INTRODUCTION
The introduction should describe the purpose of the assignment
and what you are
going to do in order to complete the assignment. It should be
clear that you
understand why you are performing certain steps in an analysis.
BAD INTRODUCTION
The purpose of this report is to analyze baseball data.
GOOD INTRODUCTION
The purpose of the assignment is to analyze data from
somewhere in order to
predict the number of something. This will be accomplished by
generating simple
and multivariate regression models using different variable
selection techniques
including, but not limited to, Forward, Stepwise, and Backward
regression. From
these techniques, the best model will be selected. This best
model will then be
further analyzed to determine if it is an adequate model to
predict or if further
analysis is necessary.
Make sure you follow the assignment instructions. To get
points for each of these
sections, you have to show them in your report. Each
assignment will require a
different type report. This template is fairly generic so adjust
to assignment
instructions.
If I don’t see the section in your report you will get 0 points for
it.
1. Data Exploration
Important step. This is where you make or break model
building. Spend time on
this.
3
o many charts, Bar Charts, Box Plots, Scatter Plots of the
data
variables?)
be
imputed “fixed”?
will cause test records to be deleted,
fix
them.
2. Data Preparation
Also, a critical section. Experiment with this step. Be creative.
I like creative
ideas even if they don’t work.
Fix outliers
to create
new variables
3. Build Models
These are instructions from Assignment 1 but will be similar in
the other
assignments.
Build at least two different LINEAR REGRESSION models
using different
variables. Show all of your models and the statistical
significance of the input
variables.
Discuss the coefficients in the model, do they make sense? Are
you keeping the
model even though it is counter intuitive? Why?
Display the Python results for your assignment and comment on
the results. Your
discussion of the results should be intertwined with (or linked
to) the Python
output, i.e. the discussion should be on or near the page
containing the output.
You should not be showing a lot of unnecessary Python output.
4
Discuss the results thoroughly. Include such discussion points
as:
g different be done?
GOOD DESCRIPTION OF A DIAGRAM
The analysis continues by examining the plot of the residual
values versus the
predicted variables given in Figure 1. In this type of analysis, a
visual inspection of
the chart is conducted to determine whether or not any patterns
exist in the
residuals. Some patterns might include errors that increase or
decrease with
larger predictive variables or some other type of pattern such as
a curve. In an
ideal situation, the data will appear to be random. An inspection
of Figure 1
suggests that the data points are randomly distributed and no
obvious patterns
exist in the data. Therefore, there are no immediate concerns
with the
distribution of the errors.
Figure 1 Housing Data Predicted vs Residual Graph
BAD DESCRIPTION OF A DIAGRAM
I examined the output at the end of the document. There are no
patterns in the
data.
GOOD DESCRIPTION OF AN EQUATION
The model chosen from the different candidates was the XXX
model because it
had the highest Adjusted R-Squared value and the lowest AIC
and SBC values.
Using these metrics, it was far superior to the other models. The
formula given for
the predicted sale price is:
p_saleprice = 50000
+ 5000 * X1 LotFrontage
+ 6000 * X2 LotArea
+ 3000 * X3 OverallCond
5
The formula makes intuitive sense for the most part because
sale price
coefficients reflect that size and condition add to the value of a
property.
However, the data should be analyzed for multi-collinearity
which can result in sign
changes. Also, it might be wise to remove the variable from the
model if no
explanation can be found.
BAD DESCRIPTION OF AN EQUATION
This is the formula I chose.
p_saleprice = 10.4901
+ 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 +
2.65534 * X5 -
3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9
Additionally, it is important to note that this data was
developed on data from
XXXX years, so it is unknown as to whether this data will
translate into years in
the future. Further analysis will need to be done to determine
whether this model
will be robust and translate outside the XXXX year time
window.
NOTE: This is a made up formula, so don’t go investing in
housing in New York
based on this model. Come to think of it, it’s probably not a
good idea to invest in
New York unless you are very familiar with New York.
4. Select Models
Decide on the criteria for selecting the “Best Model”. Will you
use a metric such as
Adjusted R-Square or AIC? Will you select a model with
slightly worse
performance if it makes more sense or is more parsimonious?
Discuss why you
selected your model. Put the results in a table to display and
discuss.
5. Model Formula
If you expect points for this step, show it in your report and
explain it. You
will get 0 points if it is somewhere in your code and left out of
the report.
Don’t expect that I will search your code for it.
Write python code that will score new data and predict the sale
price. The variable
with the predicted sale price should be named:
6
p_saleprice
6. Scored Data File
Make sure you submit as a csv file.
Use the stand alone program that you wrote in the previous
section. Score the data
file ames_test. Create a file that has only TWO variables for
each record:
index
p_saleprice
The first variable, index, will allow me to match my grading
key to your predicted
value. If I cannot do this, you won’t get a grade. The second
value, p_saleprice is
the predicted sale price of a home based on the data given to
you.
Your values will be compared against …
body (MEAN)
If your model is not better than simply using an AVERAGE
value, you will lose
points.
CONCLUSION:
A short wrap up of the assignment including a discussion of
results and what was
learned.
GOOD CONCLUSION:
Several models were developed to predict the sale price of a
home using Ames
Housing data. The best model was derived using XXXX.
Although there were no
problems with the model from a statistical standpoint, the
winning model did have a
7
sign issue with one of the variables where seemingly bad
construction would result
in a higher sale price. This issue needs further investigation but
is beyond the
scope of this document.
BAD CONCLUSION:
I built some models that were good and I learned a lot.
CODE:
Attach as a separate file or paste your code in at the end.
BONUS
Place all bonus work at the end of the document. Clearly
identify what you are
doing and how many points you are trying to earn.
Exploratory Data Analysis (EDA)
Assignment #1 Jahee Koo PREDICT 410 Section 58
(It is just a template example, so should change all contents
based on written instructions.)
INTRODUCTION
The purpose of the assignment is to analyze data from
somewhere in order to predict the number of something. This
will be accomplished by generating simple and multivariate
regression models using different variable selection techniques
including, but not limited to, Forward, Stepwise, and Backward
regression. From these techniques, the best model will be
selected. This best model will then be further analyzed to
determine if it is an adequate model to predict or if further
analysis is necessary.
Make sure you follow the assignment instructions. To get points
for each of these sections, you have to show them in your
report. Each assignment will require a different type report.
This template is fairly generic so adjust to assignment
instructions.
1. Data Exploration
of the
data
variables?)
be imputed “fixed”?
fix them.
2. Data Preparation
variables (such as ratios or adding or multiplying)
to create new variables
3. Build Models
These are instructions from Assignment 1 but will be similar in
the other assignments.
Build at least two different LINEAR REGRESSION models
using different variables. Show all of your models and the
statistical significance of the input variables.
Discuss the coefficients in the model, do they make sense? Are
you keeping the model even though it is counter intuitive?
Why?
Display the Python results for your assignment and comment on
the results. Your discussion of the results should be intertwined
with (or linked to) the Python output, i.e. the discussion should
be on or near the page containing the output. You should not be
showing a lot of unnecessary Python output.
Discuss the results thoroughly. Include such discussion points
What is observed in the graph /
table / output
nse?
GOOD DESCRIPTION OF A DIAGRAM
The analysis continues by examining the plot of the residual
values versus the predicted variables given in Figure 1. In this
type of analysis, a visual inspection of the chart is conducted to
determine whether or not any patterns exist in the residuals.
Some patterns might include errors that increase or decrease
with larger predictive variables or some other type of pattern
such as a curve. In an ideal situation, the data will appear to be
random. An inspection of Figure 1 suggests that the data points
are randomly distributed and no obvious patterns exist in the
data. Therefore, there are no immediate concerns with the
distribution of the errors.
Figure 1 Housing Data Predicted vs Residual Graph
GOOD DESCRIPTION OF AN EQUATION
The model chosen from the different candidates was the XXX
model because it had the highest Adjusted R-Squared value and
the lowest AIC and SBC values. Using these metrics, it was far
superior to the other models. The formula given for the
predicted sale price is:
p_saleprice = 50000
+ 5000 * X1 LotFrontage
+ 6000 * X2 LotArea
+ 3000 * X3 OverallCond 5
The formula makes intuitive sense for the most part because
sale price coefficients reflect that size and condition add to the
value of a property.
However, the data should be analyzed for multi-collinearity
which can result in sign changes. Also, it might be wise to
remove the variable from the model if no explanation can be
found.
4. Select Models
Decide on the criteria for selecting the “Best Model”. Will you
use a metric such as Adjusted R-Square or AIC? Will you select
a model with slightly worse performance if it makes more sense
or is more parsimonious? Discuss why you selected your model.
Put the results in a table to display and discuss.
5. Model Formula
Write python code that will score new data and predict the sale
price. The variable with the predicted sale price should be
named:
p_saleprice
6. Scored Data File
Make sure you submit as a csv file.
Use the stand alone program that you wrote in the previous
section. Score the data file ames_test. Create a file that has only
TWO variables for each record:
index
p_saleprice
CONCLUSION
Several models were developed to predict the sale price of a
home using Ames Housing data. The best model was derived
using XXXX. Although there were no problems with the model
from a statistical standpoint, the winning model did have a …
CODE:
Attach as a separate file or paste your code in at the end.
BONUS
Place all bonus work at the end of the document. Clearly
identify what you are doing and how many points you are trying
to earn.
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 16 22:58:46 2018
@author: Paul Lee
"""
# Using Linear Regression to predict
# family home sale prices in Ames, Iowa
# Packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import linear_model, metrics
# Set some options for the output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 120)
# Read in the data
train = pd.read_csv('C:/Users/Jahee
Koo/Desktop/AMES_TRAIN.csv')
test = pd.read_csv('C:/Users/Jahee
Koo/Desktop/AMES_TEST_SFAM.csv')
# Convert all variable names to lower case
train.columns = [col.lower() for col in train.columns]
test.columns = [col.lower() for col in test.columns]
# EDA
print('n----- Summary of Train Data -----n')
print('Object type: ', type(train))
print('Number of observations & variables: ', train.shape)
# Variable names and information
print(train.info())
print(train.dtypes.value_counts())
# Descriptive statistics
print(train.describe())
# show a portion of the beginning of the DataFrame
print(train.head(10))
print(train.shape)
train.loc[:,
train.isnull().any()].isnull().sum().sort_values(ascending=False)
train[train == 0].count().sort_values(ascending=False)
t_null = train.isnull().sum()
t_zero = train[train == 0].count()
t_good = train.shape[0] - (t_null + t_zero)
xx = range(train.shape[1])
plt.figure(figsize=(8,8))
plt.bar(xx, t_good, color='g', width=1,
bottom=t_null+t_zero)
plt.bar(xx, t_zero, color='y', width=1,
bottom=t_null)
plt.bar(xx, t_null, color='r', width=1)
plt.show()
print(t_null[t_null > 1000].sort_values(ascending=False))
print(t_zero[t_zero > 1900].sort_values(ascending=False))
drop_cols = (t_null > 1000) | (t_zero > 1900)
train = train.loc[:, -drop_cols]
# Some quick plots of the data
train.hist(figsize=(18,14))
train.plot(
kind='box',
subplots=True,
layout=(5,9),
sharex=False,
sharey=False,
figsize=(18,14)
)
train.plot.scatter(x='grlivarea', y='saleprice')
train.boxplot(column='saleprice', by='yrsold')
train.plot.scatter(x='subclass', y='saleprice')
train.boxplot(column='saleprice', by='overallqual')
train.boxplot(column='saleprice', by='overallcond')
train.plot.scatter(x='overallcond', y='saleprice')
train.plot.scatter(x='lotarea', y='saleprice')
# Replace NaN values with medians in train data
train = train.fillna(train.median())
train = train.apply(lambda
med:med.fillna(med.value_counts().index[0]))
train.head()
t_null = train.isnull().sum()
t_zero = train[train == 0].count()
t_good = train.shape[0] - (t_null + t_zero)
xx = range(train.shape[1])
plt.figure(figsize=(14,14))
plt.bar(xx, t_good, color='g', width=.8,
bottom=t_null+t_zero)
plt.bar(xx, t_zero, color='y', width=.8,
bottom=t_null)
plt.bar(xx, t_null, color='r', width=.8)
plt.show()
train.bldgtype.unique()
train.housestyle.unique()
# Goal is typical family home
# Drop observations too far from typical
iqr = np.percentile(train.saleprice, 75) -
np.percentile(train.saleprice, 25)
drop_rows = train.saleprice > iqr * 1.5 +
np.percentile(train.saleprice, 75)
train = train.loc[-drop_rows, :]
iqr = np.percentile(train.grlivarea, 75) -
np.percentile(train.grlivarea, 25)
drop_rows = train.grlivarea > iqr * 1.5 +
np.percentile(train.grlivarea, 75)
train = train.loc[-drop_rows, :]
iqr = np.percentile(train.lotarea, 75) -
np.percentile(train.lotarea, 25)
drop_rows = train.lotarea > iqr * 1.5 +
np.percentile(train.lotarea, 75)
train = train.loc[-drop_rows, :]
iqr = np.percentile(train.totalbsmtsf, 75) -
np.percentile(train.totalbsmtsf, 25)
drop_rows = train.totalbsmtsf > iqr * 1.5 +
np.percentile(train.totalbsmtsf, 75)
train = train.loc[-drop_rows, :]
# Replace 0 values with median to living area in train data
m = np.median(train.grlivarea[train.grlivarea > 0])
train = train.replace({'grlivarea': {0: m}})
# Discrete variables
plt.figure()
g = sns.PairGrid(train,
x_vars=["bldgtype",
"exterqual",
"centralair",
"kitchenqual",
"salecondition"],
y_vars=["saleprice"],
aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");
# Print correlations
corr_matrix = train.corr()
print(corr_matrix["saleprice"].sort_values(ascending=False).hea
d(10))
print(corr_matrix["saleprice"].sort_values(ascending=True).hea
d(10))
## Pick 10 variable to focus on
pick_10 = [
'saleprice',
'grlivarea',
'overallqual',
'garagecars',
'yearbuilt',
'totalbsmtsf',
'salecondition',
'bldgtype',
'kitchenqual',
'exterqual',
'centralair'
]
corr = train[pick_10].corr()
blank = np.zeros_like(corr, dtype=np.bool)
blank[np.triu_indices_from(blank)] = True
fig, ax = plt.subplots(figsize=(10, 10))
corr_map = sns.diverging_palette(255, 133, l=60, n=7,
center="dark", as_cmap=True)
sns.heatmap(corr, mask=blank, cmap=corr_map, square=True,
vmax=.3, linewidths=0.25, cbar_kws={"shrink": .5})
# Quick plots
for variable in pick_10[1:]:
if train[variable].dtype.name == 'object':
plt.figure()
sns.stripplot(y="saleprice", x=variable, data=train,
jitter=True)
plt.show()
plt.figure()
sns.factorplot(y="saleprice", x=variable, data=train,
kind="box")
plt.show()
else:
fig, ax = plt.subplots()
ax.set_ylabel('Sale Price')
ax.set_xlabel(variable)
scatter_plot = ax.scatter(
y=train['saleprice'],
x=train[variable],
facecolors = 'none',
edgecolors = 'blue'
)
plt.show()
plt.figure()
sns.factorplot(x="bldgtype", y="saleprice", col="exterqual",
row="kitchenqual",
hue="overallqual", data=train, kind="swarm")
plt.figure()
sns.countplot(y="overallqual", hue="exterqual", data=train,
palette="Greens_d")
# Run simple models
model1 = smf.ols(formula='saleprice ~ grlivarea',
data=train).fit()
model2 = smf.ols(formula='saleprice ~ grlivarea + overallqual',
data=train).fit()
model3 = smf.ols(formula='saleprice ~ grlivarea + overallqual +
garagecars' , data=train).fit()
model4 = smf.ols(formula='saleprice ~ grlivarea + overallqual +
garagecars + yearbuilt' , data=train).fit()
model5 = smf.ols(formula='saleprice ~ grlivarea + overallqual +
garagecars + yearbuilt + totalbsmtsf + kitchenqual + exterqual +
centralair', data=train).fit()
print('nnmodel 1----------n', model1.summary())
print('nnmodel 2----------n', model2.summary())
print('nnmodel 3----------n', model3.summary())
print('nnmodel 4----------n', model4.summary())
print('nnmodel 5----------n', model5.summary())
out = [model1,
model2,
model3,
model4,
model5]
out_df = pd.DataFrame()
out_df['labels'] = ['rsquared', 'rsquared_adj', 'fstatistic', 'aic']
i = 0
for model in out:
train['pred'] = model.fittedvalues
plt.figure()
train.plot.scatter(x='saleprice', y='pred', title='model' +
str(i+1))
plt.show()
out_df['model' + str(i+1)] = [
model.rsquared.round(3),
model.rsquared_adj.round(3),
model.fvalue.round(3),
model.aic.round(3)
]
i += 1
train['predictions'] = model5.fittedvalues
print(train['predictions'])
# Clean test data
test.info()
test[3:] = test[3:].fillna(test[3:].median())
test["kitchenqual"] =
test["kitchenqual"].fillna(test["kitchenqual"].value_counts().ind
ex[0])
test["exterqual"] =
test["exterqual"].fillna(test["exterqual"].value_counts().index[0
])
m = np.median(test.grlivarea[test.grlivarea > 0])
test = test.replace({'grlivarea': {0: m}})
print(test)
# Convert the array predictions to a data frame then merge with
the index for the test data
test_predictions = model5.predict(test)
test_predictions[test_predictions < 0] = train['saleprice'].min()
print(test_predictions)
dat = {'p_saleprice': test_predictions}
df1 = test[['index']]
df2 = pd.DataFrame(data=dat)
submission = pd.concat([df1,df2], axis = 1,
join_axes=[df1.index])
print(submission)
submission.to_csv('C:/Users/Jahee
Koo/Desktop/hw01_predictions.csv')
NAME: AmesHousing.txt
TYPE: Population
SIZE: 2930 observations, 82 variables
ARTICLE TITLE: Ames Iowa: Alternative to the Boston
Housing Data Set
DESCRIPTIVE ABSTRACT: Data set contains information
from the Ames
Assessor’s Office used in computing assessed values for
individual residential
properties sold in Ames, IA from 2006 to 2010.
SOURCES:
Ames, Iowa Assessor’s Office
VARIABLE DESCRIPTIONS:
Tab characters are used to separate variables in the data file.
The data has 82
columns which include 23 nominal, 23 ordinal, 14 discrete, and
20 continuous
variables (and 2 additional observation identifiers).
Order (Discrete): Observation number
PID (Nominal): Parcel identification number - can be used with
city web site for
parcel review.
MS SubClass (Nominal): Identifies the type of dwelling
involved in the sale.
020 1-STORY 1946 & NEWER ALL STYLES
030 1-STORY 1945 & OLDER
040 1-STORY W/FINISHED ATTIC ALL AGES
045 1-1/2 STORY - UNFINISHED ALL AGES
050 1-1/2 STORY FINISHED ALL AGES
060 2-STORY 1946 & NEWER
070 2-STORY 1945 & OLDER
075 2-1/2 STORY ALL AGES
080 SPLIT OR MULTI-LEVEL
085 SPLIT FOYER
090 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 &
NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND
AGES
MS Zoning (Nominal): Identifies the general zoning
classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
Lot Frontage (Continuous): Linear feet of street connected to
property
Lot Area (Continuous): Lot size in square feet
Street (Nominal): Type of road access to property
Grvl Gravel
Pave Paved
Alley (Nominal): Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
Lot Shape (Ordinal): General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
Land Contour (Nominal): Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade
to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities (Ordinal): Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
Lot Config (Nominal): Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
Land Slope (Ordinal): Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood (Nominal): Physical locations within Ames city
limits (map available)
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
Greens Greens
GrnHill Green Hills
IDOTRR Iowa DOT and Rail Road
Landmrk Landmark
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition 1 (Nominal): Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition 2 (Nominal): Proximity to various conditions (if more
than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Bldg Type (Nominal): Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-
family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
House Style (Nominal): Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
Overall Qual (Ordinal): Rates the overall material and finish of
the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
Overall Cond (Ordinal): Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
Year Built (Discrete): Original construction date
Year Remod/Add (Discrete): Remodel date (same as
construction date if no
remodeling or additions)
Roof Style (Nominal): Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
Roof Matl (Nominal): Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior 1 (Nominal): Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior 2 (Nominal): Exterior covering on house (if more than
one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Mas Vnr Type (Nominal): Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
Mas Vnr Area (Continuous): Masonry veneer area in square feet
Exter Qual (Ordinal): Evaluates the quality of the material on
the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Exter Cond (Ordinal): Evaluates the present condition of the
material on the
exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation (Nominal): Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
Bsmt Qual (Ordinal): Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
Bsmt Cond (Ordinal): Evaluates the general condition of the
basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
Bsmt Exposure (Ordinal): Refers to walkout or garden level
walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score
average or
above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFin Type 1 (Ordinal): Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFin SF 1 (Continuous): Type 1 finished square feet
BsmtFinType 2 (Ordinal): Rating of basement finished area (if
multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFin SF 2 (Continuous): Type 2 finished square feet
Bsmt Unf SF (Continuous): Unfinished square feet of basement
area
Total Bsmt SF (Continuous): Total square feet of basement area
Heating (Nominal): Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC (Ordinal): Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Central Air (Nominal): Central air conditioning
N No
Y Yes
Electrical (Ordinal): Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring
(Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring
(poor)
Mix Mixed
1st Flr SF (Continuous): First Floor square feet
2nd Flr SF (Continuous) : Second floor square feet
Low Qual Fin SF (Continuous): Low quality finished square
feet (all floors)
Gr Liv Area (Continuous): Above grade (ground) living area
square feet
Bsmt Full Bath (Discrete): Basement full bathrooms
Bsmt Half Bath (Discrete): Basement half bathrooms
Full Bath (Discrete): Full bathrooms above grade
Half Bath (Discrete): Half baths above grade
Bedroom (Discrete): Bedrooms above grade (does NOT include
basement
bedrooms)
Kitchen (Discrete): Kitchens above grade
KitchenQual (Ordinal): Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd (Discrete): Total rooms above grade (does not
include
bathrooms)
Functional (Ordinal): Home functionality (Assume typical
unless deductions are
warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces (Discrete): Number of fireplaces
FireplaceQu (Ordinal): Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area
or Masonry
Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
Garage Type (Nominal): Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn uilt-In (Garage part of house - typically has room
above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
Garage Yr Blt (Discrete): Year garage was built
Garage Finish (Ordinal) : Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
Garage Cars (Discrete): Size of garage in car capacity
Garage Area (Continuous): Size of garage in square feet
Garage Qual (Ordinal): Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Garage Cond (Ordinal): Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
Paved Drive (Ordinal): Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
Wood Deck SF (Continuous): Wood deck area in square feet
Open Porch SF (Continuous): Open porch area in square feet
Enclosed Porch (Continuous): Enclosed porch area in square
feet
3-Ssn Porch (Continuous): Three season porch area in square
feet
Screen Porch (Continuous): Screen porch area in square feet
Pool Area (Continuous): Pool area in square feet
Pool QC (Ordinal): Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence (Ordinal): Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
Misc Feature (Nominal): Miscellaneous feature not covered in
other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
Misc Val (Continuous): $Value of miscellaneous feature
Mo Sold (Discrete): Month Sold (MM)
Yr Sold (Discrete): Year Sold (YYYY)
Sale Type (Nominal): Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
Sale Condition (Nominal): Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate
deeds, typically
condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed
(associated with
New Homes)
SalePrice (Continuous): Sale price $$
I have to complete EDA assignment using python before JAN.
20th.
I have an approximate python code and a report template
example. I would like to ask you to complete the report based
on these.
I will attach the necessary files for analysis and report
generation.
After completion, please send me a doc. and .py file.
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

More Related Content

Similar to 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
DREAM Principles & User Guide 1.0
DREAM Principles & User Guide 1.0DREAM Principles & User Guide 1.0
DREAM Principles & User Guide 1.0Marcus Drost
 
Summary data modelling
Summary data modellingSummary data modelling
Summary data modellingNovita Sari
 
Normalization
NormalizationNormalization
NormalizationAbuSahama
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
OL 325 Milestone Three Guidelines and Rubric Section
 OL 325 Milestone Three Guidelines and Rubric  Section OL 325 Milestone Three Guidelines and Rubric  Section
OL 325 Milestone Three Guidelines and Rubric SectionMoseStaton39
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
Best structure of taxonomies for the different purposes of analysis
Best structure of taxonomies for the different purposes of analysisBest structure of taxonomies for the different purposes of analysis
Best structure of taxonomies for the different purposes of analysisChie Mitsui
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfDevinSohi
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerVineet Kumar Saini
 
C question-bank-ebook
C question-bank-ebookC question-bank-ebook
C question-bank-ebooketrams1
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Dreamforce07
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification TemplateAlan D. Duncan
 
1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docxdrennanmicah
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 

Similar to 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx (18)

Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Measurement And Validation
Measurement And ValidationMeasurement And Validation
Measurement And Validation
 
DREAM Principles & User Guide 1.0
DREAM Principles & User Guide 1.0DREAM Principles & User Guide 1.0
DREAM Principles & User Guide 1.0
 
Summary data modelling
Summary data modellingSummary data modelling
Summary data modelling
 
Normalization
NormalizationNormalization
Normalization
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
OL 325 Milestone Three Guidelines and Rubric Section
 OL 325 Milestone Three Guidelines and Rubric  Section OL 325 Milestone Three Guidelines and Rubric  Section
OL 325 Milestone Three Guidelines and Rubric Section
 
Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Best structure of taxonomies for the different purposes of analysis
Best structure of taxonomies for the different purposes of analysisBest structure of taxonomies for the different purposes of analysis
Best structure of taxonomies for the different purposes of analysis
 
Data modelling interview question
Data modelling interview questionData modelling interview question
Data modelling interview question
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
 
C question-bank-ebook
C question-bank-ebookC question-bank-ebook
C question-bank-ebook
 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification Template
 
1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx1RUNNING HEAD Normalization2NormalizationNORM.docx
1RUNNING HEAD Normalization2NormalizationNORM.docx
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 

More from honey725342

NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docx
NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docxNRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docx
NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docxhoney725342
 
Now the Earth has had wide variations in atmospheric CO2-level throu.docx
Now the Earth has had wide variations in atmospheric CO2-level throu.docxNow the Earth has had wide variations in atmospheric CO2-level throu.docx
Now the Earth has had wide variations in atmospheric CO2-level throu.docxhoney725342
 
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docx
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docxNR224 Fundamentals SkillsTopic Safety Goals BOOK P.docx
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docxhoney725342
 
Nurse Education Today 87 (2020) 104348Contents lists avail.docx
Nurse Education Today 87 (2020) 104348Contents lists avail.docxNurse Education Today 87 (2020) 104348Contents lists avail.docx
Nurse Education Today 87 (2020) 104348Contents lists avail.docxhoney725342
 
Now that you’ve seen all of the elements contributing to the Devil’s.docx
Now that you’ve seen all of the elements contributing to the Devil’s.docxNow that you’ve seen all of the elements contributing to the Devil’s.docx
Now that you’ve seen all of the elements contributing to the Devil’s.docxhoney725342
 
NR360 We Can But Dare We.docx Revised 5 ‐ 9 .docx
NR360   We   Can   But   Dare   We.docx   Revised   5 ‐ 9 .docxNR360   We   Can   But   Dare   We.docx   Revised   5 ‐ 9 .docx
NR360 We Can But Dare We.docx Revised 5 ‐ 9 .docxhoney725342
 
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docx
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docxNurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docx
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docxhoney725342
 
NURS 6002 Foundations of Graduate StudyAcademic and P.docx
NURS 6002 Foundations of Graduate StudyAcademic and P.docxNURS 6002 Foundations of Graduate StudyAcademic and P.docx
NURS 6002 Foundations of Graduate StudyAcademic and P.docxhoney725342
 
Nurse workforce shortage are predicted to get worse as baby boomers .docx
Nurse workforce shortage are predicted to get worse as baby boomers .docxNurse workforce shortage are predicted to get worse as baby boomers .docx
Nurse workforce shortage are predicted to get worse as baby boomers .docxhoney725342
 
Now, for the exam itself. Below are 4 questions. You need to answer .docx
Now, for the exam itself. Below are 4 questions. You need to answer .docxNow, for the exam itself. Below are 4 questions. You need to answer .docx
Now, for the exam itself. Below are 4 questions. You need to answer .docxhoney725342
 
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docx
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docxNur-501-AP4- Philosophical and Theoretical Evidence-Based research.docx
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docxhoney725342
 
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docx
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docxNU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docx
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docxhoney725342
 
Nurse Working in the CommunityDescribe the community nurses.docx
Nurse Working in the CommunityDescribe the community nurses.docxNurse Working in the CommunityDescribe the community nurses.docx
Nurse Working in the CommunityDescribe the community nurses.docxhoney725342
 
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docx
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docxnursing diagnosis1. Decreased Cardiac Output  related to Alter.docx
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docxhoney725342
 
Nursing Documentation Is it valuable Discuss the value of nursin.docx
Nursing Documentation Is it valuable Discuss the value of nursin.docxNursing Documentation Is it valuable Discuss the value of nursin.docx
Nursing Documentation Is it valuable Discuss the value of nursin.docxhoney725342
 
NR631 Concluding Graduate Experience - Scope Project Managemen.docx
NR631 Concluding Graduate Experience - Scope  Project Managemen.docxNR631 Concluding Graduate Experience - Scope  Project Managemen.docx
NR631 Concluding Graduate Experience - Scope Project Managemen.docxhoney725342
 
Number 11. Describe at least five populations who are vulner.docx
Number 11. Describe at least five populations who are vulner.docxNumber 11. Describe at least five populations who are vulner.docx
Number 11. Describe at least five populations who are vulner.docxhoney725342
 
ntertainment, the media, and sometimes public leaders can perpetuate.docx
ntertainment, the media, and sometimes public leaders can perpetuate.docxntertainment, the media, and sometimes public leaders can perpetuate.docx
ntertainment, the media, and sometimes public leaders can perpetuate.docxhoney725342
 
Now that you have  completed Lesson 23 & 24 and have thought a.docx
Now that you have  completed Lesson 23 & 24 and have thought a.docxNow that you have  completed Lesson 23 & 24 and have thought a.docx
Now that you have  completed Lesson 23 & 24 and have thought a.docxhoney725342
 
nothing wrong with the paper, my professor just wants it to be in an.docx
nothing wrong with the paper, my professor just wants it to be in an.docxnothing wrong with the paper, my professor just wants it to be in an.docx
nothing wrong with the paper, my professor just wants it to be in an.docxhoney725342
 

More from honey725342 (20)

NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docx
NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docxNRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docx
NRS-493 Individual Success PlanREQUIRED PRACTICE HOURS 100 Direct.docx
 
Now the Earth has had wide variations in atmospheric CO2-level throu.docx
Now the Earth has had wide variations in atmospheric CO2-level throu.docxNow the Earth has had wide variations in atmospheric CO2-level throu.docx
Now the Earth has had wide variations in atmospheric CO2-level throu.docx
 
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docx
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docxNR224 Fundamentals SkillsTopic Safety Goals BOOK P.docx
NR224 Fundamentals SkillsTopic Safety Goals BOOK P.docx
 
Nurse Education Today 87 (2020) 104348Contents lists avail.docx
Nurse Education Today 87 (2020) 104348Contents lists avail.docxNurse Education Today 87 (2020) 104348Contents lists avail.docx
Nurse Education Today 87 (2020) 104348Contents lists avail.docx
 
Now that you’ve seen all of the elements contributing to the Devil’s.docx
Now that you’ve seen all of the elements contributing to the Devil’s.docxNow that you’ve seen all of the elements contributing to the Devil’s.docx
Now that you’ve seen all of the elements contributing to the Devil’s.docx
 
NR360 We Can But Dare We.docx Revised 5 ‐ 9 .docx
NR360   We   Can   But   Dare   We.docx   Revised   5 ‐ 9 .docxNR360   We   Can   But   Dare   We.docx   Revised   5 ‐ 9 .docx
NR360 We Can But Dare We.docx Revised 5 ‐ 9 .docx
 
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docx
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docxNurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docx
Nurse Practitioner Diagnosis- Chest Pain.SOAPS-Subjective.docx
 
NURS 6002 Foundations of Graduate StudyAcademic and P.docx
NURS 6002 Foundations of Graduate StudyAcademic and P.docxNURS 6002 Foundations of Graduate StudyAcademic and P.docx
NURS 6002 Foundations of Graduate StudyAcademic and P.docx
 
Nurse workforce shortage are predicted to get worse as baby boomers .docx
Nurse workforce shortage are predicted to get worse as baby boomers .docxNurse workforce shortage are predicted to get worse as baby boomers .docx
Nurse workforce shortage are predicted to get worse as baby boomers .docx
 
Now, for the exam itself. Below are 4 questions. You need to answer .docx
Now, for the exam itself. Below are 4 questions. You need to answer .docxNow, for the exam itself. Below are 4 questions. You need to answer .docx
Now, for the exam itself. Below are 4 questions. You need to answer .docx
 
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docx
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docxNur-501-AP4- Philosophical and Theoretical Evidence-Based research.docx
Nur-501-AP4- Philosophical and Theoretical Evidence-Based research.docx
 
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docx
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docxNU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docx
NU32CH19-Foltz ARI 9 July 2012 1945Population-Level Inter.docx
 
Nurse Working in the CommunityDescribe the community nurses.docx
Nurse Working in the CommunityDescribe the community nurses.docxNurse Working in the CommunityDescribe the community nurses.docx
Nurse Working in the CommunityDescribe the community nurses.docx
 
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docx
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docxnursing diagnosis1. Decreased Cardiac Output  related to Alter.docx
nursing diagnosis1. Decreased Cardiac Output  related to Alter.docx
 
Nursing Documentation Is it valuable Discuss the value of nursin.docx
Nursing Documentation Is it valuable Discuss the value of nursin.docxNursing Documentation Is it valuable Discuss the value of nursin.docx
Nursing Documentation Is it valuable Discuss the value of nursin.docx
 
NR631 Concluding Graduate Experience - Scope Project Managemen.docx
NR631 Concluding Graduate Experience - Scope  Project Managemen.docxNR631 Concluding Graduate Experience - Scope  Project Managemen.docx
NR631 Concluding Graduate Experience - Scope Project Managemen.docx
 
Number 11. Describe at least five populations who are vulner.docx
Number 11. Describe at least five populations who are vulner.docxNumber 11. Describe at least five populations who are vulner.docx
Number 11. Describe at least five populations who are vulner.docx
 
ntertainment, the media, and sometimes public leaders can perpetuate.docx
ntertainment, the media, and sometimes public leaders can perpetuate.docxntertainment, the media, and sometimes public leaders can perpetuate.docx
ntertainment, the media, and sometimes public leaders can perpetuate.docx
 
Now that you have  completed Lesson 23 & 24 and have thought a.docx
Now that you have  completed Lesson 23 & 24 and have thought a.docxNow that you have  completed Lesson 23 & 24 and have thought a.docx
Now that you have  completed Lesson 23 & 24 and have thought a.docx
 
nothing wrong with the paper, my professor just wants it to be in an.docx
nothing wrong with the paper, my professor just wants it to be in an.docxnothing wrong with the paper, my professor just wants it to be in an.docx
nothing wrong with the paper, my professor just wants it to be in an.docx
 

Recently uploaded

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 

Recently uploaded (20)

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

  • 1. 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD September, 2017 Introduction The Masters in Predictive Analytics program at Northwestern University offers graduate courses that cover predictive modeling using several software products such as SAS, R and Python. The Predict 410 course is one of the core courses and this section focuses on using Python. Predict 410 will follow a sequence in the assignments. The first assignment will ask you to perform an EDA(See Ratner1 Chapters 1&2) for the Ames Housing Data dataset to determine the best single variable model. It will be followed by an
  • 2. assignment to expand to a multivariable model. Python software for boxplots, scatterplots and more will help you identify the single variable. However, it is easy to get lost in the programming and lose sight of the objective. Namely, which of the variable choices best explain the variability in the response variable? (You will need to be familiar with the data types and level of measurement. This will be critical in determining the choice of when to use a dummy variable for model building. If this topic is new to you review the definitions at Types of Data before reading further.) This report will help you become familiar with some of the tools for EDA and allow you to interact with the data by using links to a software product, Shiny, that will demonstrate and interact with you to produce various plots of the data. Shiny is located on a cloud server and will allow you to make choices in looking at the plots for the data. Study the plots carefully. This is your initial EDA
  • 3. tool and leads to your model building and your overall understanding of predictive analytics. Single Variable Linear Regression EDA 1. Become Familiar With the Data 2 Identify the variables that are categorical and the variables that are quantitative. For the Ames Housing Data, you should review the Ames Data Description pdf file. 2. Look at Plots of the Data For the variables that are quantitative, you should look at scatter plots vs the response variable saleprice. For the categorical variables, look at boxplots vs saleprice. You have sample Python code to help with the EDA and below are some links that will demonstrate the relationships for the a different building_prices dataset.
  • 4. For the boxplots with Shiny: Click here For the scatterplots with Shiny: Click here 3. Begin Writing Python Code Start with the shell code and improve on the model provided. http://melvin.shinyapps.io/SboxPlot http://melvin.shinyapps.io/SScatter/ http://melvin.shinyapps.io/SScatter/ 3 Single Variable Logistic Regression EDA 1. Become Familiar With the Data In 411 you will have an introduction to logistic regression and again will ask you to perform an EDA. See the file credit data for more info. Make sure you recognize which variables are quantitative and which are categorical. And, for several of
  • 5. these variables, what is the level of measurement? 2. Look at Plots of the Data For logistic regression, the response variable is of the type yes/no. In this dataset it is coded as good/bad. So, the EDA may include histograms for quantitative variables with a separate histogram for each of the response values. For numeric coded explanatory categorical variables, if the response good/bad is recoded as 0/1 then the mean for the response variable for each of the categories will indicate if there is a relationship. For the histograms with Shiny: Click here For the means with Shiny: Click here 3. Begin Writing Python Code OK. You have looked at the plots, which variable do you think will be most useful for predicting or explaining bad credit? After you answer this question, begin
  • 6. writing Python code to see if you can replicate these plots. http://melvin.shinyapps.io/SHisto http://melvin.shinyapps.io/SHisto http://melvin.shinyapps.io/SCRMeans http://melvin.shinyapps.io/SCRMeans 4 The data set CREDIT contains information on 1000 customers. There are 21 variables in the data set: Name Model Role Measurement Level Description AGE Input Interval Age in years AMOUNT Input Interval Amount of credit requested CHECKING Input Nominal or Ordinal
  • 7. Balance in existing checking account: 1 = less than 0 DM 2 = more than 0 but less than 200 DM 3 = at least 200 DM 4 = no checking account COAPP Input Nominal Other debtors or guarantors: 1 = none 2 = co-applicant 3 = guarantor DEPENDS Input Interval Number of dependents DURATION Input Interval Length of loan in months EMPLOYED Input Ordinal Time at present employment: 1 = unemployed 2 = less than 1 year 3 = at least 1, but less than 4 years 4 = at least 4, but less than 7 years 5 = at least 7 years EXISTCR Input Interval Number of existing accounts at this
  • 8. bank FOREIGN Input Binary Foreign worker: 1 = Yes 2 = No GOOD_BAD Target Binary Credit Rating Status (good or bad) 5 HISTORY Input Ordinal Credit History: 0 = no loans taken / all loans paid back in full and on time 1 = all loans at this bank paid back in full and on time 2 = all loans paid back on time until now 3 = late payments on previous loans 4 = critical account / loans in arrears at other banks HOUSING Input Nominal Rent/Own: 1 = rent
  • 9. 2 = own 3 = free housing INSTALLP Input Interval Debt as a percent of disposable income JOB Input Ordinal Employment status: 1 = unemployed / unskilled non-resident 2 = unskilled resident 3 = skilled employee / official 4 = management / self-employed / highly skilled employee / officer MARITAL Input Nominal Marital status and gender 1 = male – divorced/separated 2 = female – divorced/separated/married 3 = male – single 4 = male – married/widowed 5 = female – single OTHER Input Nominal or Ordinal
  • 10. Other installment loans: 1 = bank 2 = stores 3 = none PROPERTY Input Nominal or Ordinal Collateral property for loan: 1 – real estate 2 = if not 1, building society savings agreement / life insurance 3 = if not 1 or 2, car or others 4 = unknown / no property 6 PURPOSE Input Nominal Reason for loan request: 0 = new car 1 = used car 2 = furniture/equipment
  • 11. 3 = radio / television 4 = domestic appliances 5 = repairs 6 = education 7 = vacation 8 = retraining 9 = business x = other RESIDENT Input Interval Years at current address SAVINGS Input Nominal or Ordinal Savings account balance: 1 = less than 100 DM 2 = at least 100, but less than 500 DM 3 = at least 500, but less than 1000 DM 4 = at least 1000 DM 5 = unknown / no savings account TELEPHON Input Binary Telephone:
  • 12. 1 = none 2 = yes, registered under the customer’s name Exploratory Data Analysis (EDA) Ratner1 describes ‘data mining’ “as any process that finds unexpected structures in data and uses the EDA framework to ensure that the process explores the data, not exploits it.” Unexpected suggests that the word exploratory is very appropriate to this process. Tukey2 in his book and in many presentations gave structure to EDA. Others have extended it to include ‘big’ data. Big data has occurred due to our ability to capture huge datasets, store it on servers cost effectively, and analyze it with software that will handle it. Shiny App s To learn more about Shiny applications with RStudio click on the link below:
  • 13. http://rstudio.github.io/shiny/tutorial/ http://rstudio.github.io/shiny/tutorial/ 7 Types of Data Quantitative data are numeric and represent counts or measurements. Categorical data are names or labels such as a,b,c but can often be shown as 1,2,3. They do not suggest counts or measurements. Discrete data are finite or countable numeric data. Continuous data are values that represent a continuous scale of measurement. A nominal level of measurement suggests names or categories. There is no apparent order suggested. Ordinal level data suggest a sequential ordering but mathematical calculations should not be performed on this data. Interval level data are ordinal plus the difference between two data values is
  • 14. meaningful. And, there is no zero level. Ratio level data are interval and have a zero level plus differences and ratios may be calculated. References: 1. Ratner, B. (2012). Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data (2nd ed.). New York: CRC Press [ISBN-13: 9781439860915] 2. Tukey, J.W. (1977). Exploratory Data Analysis. Addison- Wesley. Ames Housing OLS Regression Project (300 Points) The ames_train data set contains approximately 2039 records. See the data
  • 15. description in the file Introduction_to_Ames_Housing_Data. This is a random selection of training data selected from the full dataset. Note, the index numbers have been randomized and the split between train and test is also random so you will not be able to match the test data with sale price values. You are to use OLS (“Linear”) Regression to predict the sale price for homes in the ames_test_sfam dataset by building two models using the ames_train data. Note, the test data set is single family homes, the training data is all homes. DELIVERABLES zip files). Your write up should have five sections. Each section should have enough detail so that I can follow your logic and someone else can replicate your work. (150
  • 16. Points) analysis. I should be able to run this file and get all the output that you got. ames_test_sfam. There will be only two columns in this file: index and p_saleprice. You will be graded on how your model performs versus my model and those of other students in the class. submitting your csv file to kaggle at https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf It is OK to submit to Kaggle many times. You will have to tell me your alias for Kaggle so I can see the score.
  • 17. https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf 이강복 강조 이강복 강조 WRITE UP (200 POINTS) 1. First Steps (40 points) Describe the ames_train data set so that I am convinced you understand it. Use my shell code as a start to explore the data. Apply your creativity and go from there. If you know how to do pivot tables in Excel, it is a great tool for Exploratory Data Analysis (EDA). EDA was well established by John Tukey. He was a great advocate for it and developed much of what we do today.
  • 18. Knowing your data typically consists of three components: (a) a data survey, (b) a data quality check, and (c) an initial exploratory data analysis. (a) A Data Survey - Take a broad overview of the Ames housing data set. Read over the data documentation. What data do you have, and what is it supposed to represent? - In the linear regression component of this course you build linear regression models to predict the value of a property (single family home). Do you have the right data to properly address the problem? Are there observations in the data that should be excluded? - What kinds of problems can you properly address given the data that you have? In particular if you were to build a regression model with the variable SalePrice as the response variable, what types of properties would you be valuing? Be careful about what you are doing here.
  • 19. (b) Define the Sample Population - When building statistical models you have to define the population of interest, and then sample from THAT population. Frequently you will not actively perform the sampling function. Instead, the data will be made available and you will have to sample from it retrospectively, i.e. you will need to carve out the population of interest. In this assignment the objective of is to be able to provide estimates of home values for 'typical' homes in Ames, Iowa. You may not be able to define what 이강복 강조 'typical' is, but can use the data to find out what is atypical. Any values which are not atypical are then considered to be typical. - Define the appropriate sample population for your statistical problem. Hint: You are building regression models for the response variable
  • 20. SalePrice. Are all properties the same? Would you want to include an apartment building in the same sample as a single family residence? Would you want to include a warehouse or a shopping center in the same sample as a single family residence? Would you want to include condominiums in the same sample as a single family residence? - Define your sample using ‘drop conditions’. Create for the drop conditions and include it in your report so that it is clear to any reader what you are excluding from the data set when defining your sample population. The definition of your sample data should be clearly noted in your assignment report. (c) A Data Quality Check - In practice your data will not be 'clean'. You will need to examine your data for errors and outliers. Errors will not always show as outliers, and outliers are not
  • 21. necessarily errors. - If you have a data dictionary that states the set of proper values for each field, then you will want to check your data against the data dictionary. - If you do not have a data dictionary, then you will need to reason and explore your way to a proper data set. Example 1: In this project you will be modeling the sales price of housing transactions. It should be obvious that none of these sales prices should be zero or negative. Observations with a zero or negative sales price should logically be considered to be errors. Example 2: Suppose we had a 'small' number of housing transactions with a sale price over one million dollars, should we consider these sales prices to be valid? In this case these values could be valid data points, which would make them outliers, or they could be errors, such as 140,000.00 entered as 1,400,000. In either case
  • 22. they are not relevant data points if the objective is to model the 'typical' home price for the area. 2. EDA (30 Points) Pick ten variables from the data quality check to explore in your initial exploratory data analysis. Perform an initial exploratory data analysis. How do you perform an exploratory data analysis for continuous versus discrete (or categorical) data? Consider the use of scatterplots, scatterplot smoothers such as LOESS, and boxplots to produce relevant graphics when appropriate. Note that you are particularly interested in the relationships between the response variable and the predictor variables. Suggest you split your EDA into two sections in your report – one section for continuous variables and one section for discrete variables.
  • 23. 3. BUILD MODELS (100 Points) Build at least four different LINEAR REGRESSION models. The first model should be a simple (single prediction variable) model. Find the best single variable model. The next model should be a multiple regression model with two predictor variables. Find the best two variable model. You do not need to build more complex models for this assignment. More complex models will be the topic for hw02. Show all of your models and the statistical significance of the input variables. Discuss the quality of fit, R squared and adjusted R squared, parsimony and anything else you can think of that might be of value to share. Discuss the coefficients in the model you select, do they make
  • 24. sense? Are you keeping the model even though it is counter intuitive? 4. SELECT MODELS (20 Points) Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the metrics in a table to display the results. 5. WRITE MODEL FORMULA (10 Points) 이강복 강조 Write a mathematical formula that will show the model you selected. Explain your formula. Make sure you include this as a section in your report. Do not
  • 25. expect that I will search your report to find it. This step should allow someone else to deploy your model. The variable with the predicted saleprice should be named: p_saleprice SCORED DATA FILE (100 POINTS) Use the python model that you selected. Score the data file ames_test_sfam. Overall scoring for your model is based on providing a prediction for every record in the test data. Make sure you have not deleted any records in the test data and that none of your predictions are out of range. Create a file that has only TWO variables for each record: index p_saleprice
  • 26. The first variable, index, will allow me to match my grading key to your predicted value. If I cannot do this, you won’t get a grade. So please include this value. The second value, p_saleprice is the predicted price for a property per your model. Your values will be compared against … redict the Average value for everybody (MEAN) If your model is not better than simply using an AVERAGE value, you will lose points. BONUS If you want Bonus Points, write a brief section at the top of your Write Up document and tell me exactly what you did and how many points you are attempting.
  • 27. If I cannot see your Bonus work, I cannot give you credit. Bonus is difficult to grade and I don’t have time to go back looking for it. If you don’t tell me it’s there, I cannot give you points. The policy with Bonus is: All Sales are Final ! the results the same? Are there any differences? run with it. I might give you points. PENALTY BOX e file
  • 28. names of any files you hand in you hand in 1 Assignment Template New and Revised for September 2017 In the real world, you will be building predictive models and doing analytic work. But that is not your only function. After you do the work, you need to explain it to other people (most of whom will not understand analytics). Therefore, it is critical that you are able to explain your results in such a way that non analytic people can
  • 29. understand it. If you dump 20 or 30 pages of output on the person and say “it’s all in here”, then they won’t read it. In fact, that person will likely just ignore your results go about with their day to day business without giving your work a second thought. This is not a desirable outcome. You must write your report so that it can be understood by others and it must contain enough detail that it can be replicated. In my work I am often handed the work of others and asked to provide a critique. If I am unable to replicate their work because it is lacking in detail then the critique will be very negative. It is not enough that you build a great model. You also have to sell it. DOs AND DON’Ts: our document in PDF Format
  • 30. “Homework_03_Fred_Smith.pdf”) plaining the output discussion format mework_03.pdf” scroll through it discussing it diagram at the end of the document … UNLESS IT IS ABSOLUTELY NECESSARY to do that) 2
  • 31. Example Report (with a lot of commentary) Assignment #3 Fred Smith PREDICT 410 Section 58 INTRODUCTION The introduction should describe the purpose of the assignment and what you are going to do in order to complete the assignment. It should be clear that you understand why you are performing certain steps in an analysis. BAD INTRODUCTION The purpose of this report is to analyze baseball data. GOOD INTRODUCTION The purpose of the assignment is to analyze data from somewhere in order to predict the number of something. This will be accomplished by generating simple and multivariate regression models using different variable
  • 32. selection techniques including, but not limited to, Forward, Stepwise, and Backward regression. From these techniques, the best model will be selected. This best model will then be further analyzed to determine if it is an adequate model to predict or if further analysis is necessary. Make sure you follow the assignment instructions. To get points for each of these sections, you have to show them in your report. Each assignment will require a different type report. This template is fairly generic so adjust to assignment instructions. If I don’t see the section in your report you will get 0 points for it. 1. Data Exploration Important step. This is where you make or break model building. Spend time on this.
  • 33. 3 o many charts, Bar Charts, Box Plots, Scatter Plots of the data variables?) be imputed “fixed”? will cause test records to be deleted, fix them. 2. Data Preparation Also, a critical section. Experiment with this step. Be creative. I like creative ideas even if they don’t work. Fix outliers
  • 34. to create new variables 3. Build Models These are instructions from Assignment 1 but will be similar in the other assignments. Build at least two different LINEAR REGRESSION models using different variables. Show all of your models and the statistical significance of the input variables. Discuss the coefficients in the model, do they make sense? Are you keeping the model even though it is counter intuitive? Why? Display the Python results for your assignment and comment on the results. Your
  • 35. discussion of the results should be intertwined with (or linked to) the Python output, i.e. the discussion should be on or near the page containing the output. You should not be showing a lot of unnecessary Python output. 4 Discuss the results thoroughly. Include such discussion points as: g different be done? GOOD DESCRIPTION OF A DIAGRAM The analysis continues by examining the plot of the residual values versus the
  • 36. predicted variables given in Figure 1. In this type of analysis, a visual inspection of the chart is conducted to determine whether or not any patterns exist in the residuals. Some patterns might include errors that increase or decrease with larger predictive variables or some other type of pattern such as a curve. In an ideal situation, the data will appear to be random. An inspection of Figure 1 suggests that the data points are randomly distributed and no obvious patterns exist in the data. Therefore, there are no immediate concerns with the distribution of the errors. Figure 1 Housing Data Predicted vs Residual Graph BAD DESCRIPTION OF A DIAGRAM I examined the output at the end of the document. There are no patterns in the data. GOOD DESCRIPTION OF AN EQUATION
  • 37. The model chosen from the different candidates was the XXX model because it had the highest Adjusted R-Squared value and the lowest AIC and SBC values. Using these metrics, it was far superior to the other models. The formula given for the predicted sale price is: p_saleprice = 50000 + 5000 * X1 LotFrontage + 6000 * X2 LotArea + 3000 * X3 OverallCond 5 The formula makes intuitive sense for the most part because sale price coefficients reflect that size and condition add to the value of a property. However, the data should be analyzed for multi-collinearity
  • 38. which can result in sign changes. Also, it might be wise to remove the variable from the model if no explanation can be found. BAD DESCRIPTION OF AN EQUATION This is the formula I chose. p_saleprice = 10.4901 + 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 + 2.65534 * X5 - 3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9 Additionally, it is important to note that this data was developed on data from XXXX years, so it is unknown as to whether this data will translate into years in the future. Further analysis will need to be done to determine whether this model will be robust and translate outside the XXXX year time window. NOTE: This is a made up formula, so don’t go investing in housing in New York
  • 39. based on this model. Come to think of it, it’s probably not a good idea to invest in New York unless you are very familiar with New York. 4. Select Models Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the results in a table to display and discuss. 5. Model Formula If you expect points for this step, show it in your report and explain it. You will get 0 points if it is somewhere in your code and left out of the report. Don’t expect that I will search your code for it. Write python code that will score new data and predict the sale price. The variable with the predicted sale price should be named:
  • 40. 6 p_saleprice 6. Scored Data File Make sure you submit as a csv file. Use the stand alone program that you wrote in the previous section. Score the data file ames_test. Create a file that has only TWO variables for each record: index p_saleprice The first variable, index, will allow me to match my grading key to your predicted value. If I cannot do this, you won’t get a grade. The second value, p_saleprice is the predicted sale price of a home based on the data given to you.
  • 41. Your values will be compared against … body (MEAN) If your model is not better than simply using an AVERAGE value, you will lose points. CONCLUSION: A short wrap up of the assignment including a discussion of results and what was learned. GOOD CONCLUSION: Several models were developed to predict the sale price of a home using Ames
  • 42. Housing data. The best model was derived using XXXX. Although there were no problems with the model from a statistical standpoint, the winning model did have a 7 sign issue with one of the variables where seemingly bad construction would result in a higher sale price. This issue needs further investigation but is beyond the scope of this document. BAD CONCLUSION: I built some models that were good and I learned a lot. CODE: Attach as a separate file or paste your code in at the end. BONUS Place all bonus work at the end of the document. Clearly identify what you are
  • 43. doing and how many points you are trying to earn. Exploratory Data Analysis (EDA) Assignment #1 Jahee Koo PREDICT 410 Section 58 (It is just a template example, so should change all contents based on written instructions.) INTRODUCTION The purpose of the assignment is to analyze data from somewhere in order to predict the number of something. This will be accomplished by generating simple and multivariate regression models using different variable selection techniques including, but not limited to, Forward, Stepwise, and Backward regression. From these techniques, the best model will be selected. This best model will then be further analyzed to determine if it is an adequate model to predict or if further analysis is necessary. Make sure you follow the assignment instructions. To get points for each of these sections, you have to show them in your report. Each assignment will require a different type report. This template is fairly generic so adjust to assignment instructions. 1. Data Exploration of the data variables?) be imputed “fixed”?
  • 44. fix them. 2. Data Preparation variables (such as ratios or adding or multiplying) to create new variables 3. Build Models These are instructions from Assignment 1 but will be similar in the other assignments. Build at least two different LINEAR REGRESSION models using different variables. Show all of your models and the statistical significance of the input variables. Discuss the coefficients in the model, do they make sense? Are you keeping the model even though it is counter intuitive? Why? Display the Python results for your assignment and comment on the results. Your discussion of the results should be intertwined with (or linked to) the Python output, i.e. the discussion should be on or near the page containing the output. You should not be showing a lot of unnecessary Python output. Discuss the results thoroughly. Include such discussion points What is observed in the graph / table / output nse?
  • 45. GOOD DESCRIPTION OF A DIAGRAM The analysis continues by examining the plot of the residual values versus the predicted variables given in Figure 1. In this type of analysis, a visual inspection of the chart is conducted to determine whether or not any patterns exist in the residuals. Some patterns might include errors that increase or decrease with larger predictive variables or some other type of pattern such as a curve. In an ideal situation, the data will appear to be random. An inspection of Figure 1 suggests that the data points are randomly distributed and no obvious patterns exist in the data. Therefore, there are no immediate concerns with the distribution of the errors. Figure 1 Housing Data Predicted vs Residual Graph GOOD DESCRIPTION OF AN EQUATION The model chosen from the different candidates was the XXX model because it had the highest Adjusted R-Squared value and the lowest AIC and SBC values. Using these metrics, it was far superior to the other models. The formula given for the predicted sale price is: p_saleprice = 50000 + 5000 * X1 LotFrontage + 6000 * X2 LotArea + 3000 * X3 OverallCond 5 The formula makes intuitive sense for the most part because sale price coefficients reflect that size and condition add to the value of a property. However, the data should be analyzed for multi-collinearity which can result in sign changes. Also, it might be wise to remove the variable from the model if no explanation can be found. 4. Select Models
  • 46. Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the results in a table to display and discuss. 5. Model Formula Write python code that will score new data and predict the sale price. The variable with the predicted sale price should be named: p_saleprice 6. Scored Data File Make sure you submit as a csv file. Use the stand alone program that you wrote in the previous section. Score the data file ames_test. Create a file that has only TWO variables for each record: index p_saleprice CONCLUSION Several models were developed to predict the sale price of a home using Ames Housing data. The best model was derived using XXXX. Although there were no problems with the model from a statistical standpoint, the winning model did have a … CODE: Attach as a separate file or paste your code in at the end. BONUS Place all bonus work at the end of the document. Clearly identify what you are doing and how many points you are trying to earn.
  • 47. # -*- coding: utf-8 -*- """ Created on Tue Jan 16 22:58:46 2018 @author: Paul Lee """ # Using Linear Regression to predict # family home sale prices in Ames, Iowa # Packages import pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import matplotlib.pyplot as plt from scipy import stats from sklearn import linear_model, metrics # Set some options for the output pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 40) pd.set_option('display.max_rows', 10) pd.set_option('display.width', 120) # Read in the data train = pd.read_csv('C:/Users/Jahee Koo/Desktop/AMES_TRAIN.csv') test = pd.read_csv('C:/Users/Jahee Koo/Desktop/AMES_TEST_SFAM.csv') # Convert all variable names to lower case train.columns = [col.lower() for col in train.columns] test.columns = [col.lower() for col in test.columns]
  • 48. # EDA print('n----- Summary of Train Data -----n') print('Object type: ', type(train)) print('Number of observations & variables: ', train.shape) # Variable names and information print(train.info()) print(train.dtypes.value_counts()) # Descriptive statistics print(train.describe()) # show a portion of the beginning of the DataFrame print(train.head(10)) print(train.shape) train.loc[:, train.isnull().any()].isnull().sum().sort_values(ascending=False) train[train == 0].count().sort_values(ascending=False) t_null = train.isnull().sum() t_zero = train[train == 0].count() t_good = train.shape[0] - (t_null + t_zero) xx = range(train.shape[1]) plt.figure(figsize=(8,8)) plt.bar(xx, t_good, color='g', width=1, bottom=t_null+t_zero) plt.bar(xx, t_zero, color='y', width=1, bottom=t_null) plt.bar(xx, t_null, color='r', width=1) plt.show() print(t_null[t_null > 1000].sort_values(ascending=False)) print(t_zero[t_zero > 1900].sort_values(ascending=False))
  • 49. drop_cols = (t_null > 1000) | (t_zero > 1900) train = train.loc[:, -drop_cols] # Some quick plots of the data train.hist(figsize=(18,14)) train.plot( kind='box', subplots=True, layout=(5,9), sharex=False, sharey=False, figsize=(18,14) ) train.plot.scatter(x='grlivarea', y='saleprice') train.boxplot(column='saleprice', by='yrsold') train.plot.scatter(x='subclass', y='saleprice') train.boxplot(column='saleprice', by='overallqual') train.boxplot(column='saleprice', by='overallcond') train.plot.scatter(x='overallcond', y='saleprice') train.plot.scatter(x='lotarea', y='saleprice') # Replace NaN values with medians in train data train = train.fillna(train.median()) train = train.apply(lambda med:med.fillna(med.value_counts().index[0])) train.head() t_null = train.isnull().sum() t_zero = train[train == 0].count() t_good = train.shape[0] - (t_null + t_zero) xx = range(train.shape[1]) plt.figure(figsize=(14,14)) plt.bar(xx, t_good, color='g', width=.8, bottom=t_null+t_zero) plt.bar(xx, t_zero, color='y', width=.8,
  • 50. bottom=t_null) plt.bar(xx, t_null, color='r', width=.8) plt.show() train.bldgtype.unique() train.housestyle.unique() # Goal is typical family home # Drop observations too far from typical iqr = np.percentile(train.saleprice, 75) - np.percentile(train.saleprice, 25) drop_rows = train.saleprice > iqr * 1.5 + np.percentile(train.saleprice, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.grlivarea, 75) - np.percentile(train.grlivarea, 25) drop_rows = train.grlivarea > iqr * 1.5 + np.percentile(train.grlivarea, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.lotarea, 75) - np.percentile(train.lotarea, 25) drop_rows = train.lotarea > iqr * 1.5 + np.percentile(train.lotarea, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.totalbsmtsf, 75) - np.percentile(train.totalbsmtsf, 25) drop_rows = train.totalbsmtsf > iqr * 1.5 + np.percentile(train.totalbsmtsf, 75) train = train.loc[-drop_rows, :] # Replace 0 values with median to living area in train data m = np.median(train.grlivarea[train.grlivarea > 0]) train = train.replace({'grlivarea': {0: m}})
  • 51. # Discrete variables plt.figure() g = sns.PairGrid(train, x_vars=["bldgtype", "exterqual", "centralair", "kitchenqual", "salecondition"], y_vars=["saleprice"], aspect=.75, size=3.5) g.map(sns.violinplot, palette="pastel"); # Print correlations corr_matrix = train.corr() print(corr_matrix["saleprice"].sort_values(ascending=False).hea d(10)) print(corr_matrix["saleprice"].sort_values(ascending=True).hea d(10)) ## Pick 10 variable to focus on pick_10 = [ 'saleprice', 'grlivarea', 'overallqual', 'garagecars', 'yearbuilt', 'totalbsmtsf', 'salecondition', 'bldgtype', 'kitchenqual', 'exterqual', 'centralair' ] corr = train[pick_10].corr()
  • 52. blank = np.zeros_like(corr, dtype=np.bool) blank[np.triu_indices_from(blank)] = True fig, ax = plt.subplots(figsize=(10, 10)) corr_map = sns.diverging_palette(255, 133, l=60, n=7, center="dark", as_cmap=True) sns.heatmap(corr, mask=blank, cmap=corr_map, square=True, vmax=.3, linewidths=0.25, cbar_kws={"shrink": .5}) # Quick plots for variable in pick_10[1:]: if train[variable].dtype.name == 'object': plt.figure() sns.stripplot(y="saleprice", x=variable, data=train, jitter=True) plt.show() plt.figure() sns.factorplot(y="saleprice", x=variable, data=train, kind="box") plt.show() else: fig, ax = plt.subplots() ax.set_ylabel('Sale Price') ax.set_xlabel(variable) scatter_plot = ax.scatter( y=train['saleprice'], x=train[variable], facecolors = 'none', edgecolors = 'blue' ) plt.show() plt.figure() sns.factorplot(x="bldgtype", y="saleprice", col="exterqual", row="kitchenqual", hue="overallqual", data=train, kind="swarm")
  • 53. plt.figure() sns.countplot(y="overallqual", hue="exterqual", data=train, palette="Greens_d") # Run simple models model1 = smf.ols(formula='saleprice ~ grlivarea', data=train).fit() model2 = smf.ols(formula='saleprice ~ grlivarea + overallqual', data=train).fit() model3 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars' , data=train).fit() model4 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars + yearbuilt' , data=train).fit() model5 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars + yearbuilt + totalbsmtsf + kitchenqual + exterqual + centralair', data=train).fit() print('nnmodel 1----------n', model1.summary()) print('nnmodel 2----------n', model2.summary()) print('nnmodel 3----------n', model3.summary()) print('nnmodel 4----------n', model4.summary()) print('nnmodel 5----------n', model5.summary()) out = [model1, model2, model3, model4, model5] out_df = pd.DataFrame() out_df['labels'] = ['rsquared', 'rsquared_adj', 'fstatistic', 'aic'] i = 0 for model in out: train['pred'] = model.fittedvalues plt.figure()
  • 54. train.plot.scatter(x='saleprice', y='pred', title='model' + str(i+1)) plt.show() out_df['model' + str(i+1)] = [ model.rsquared.round(3), model.rsquared_adj.round(3), model.fvalue.round(3), model.aic.round(3) ] i += 1 train['predictions'] = model5.fittedvalues print(train['predictions']) # Clean test data test.info() test[3:] = test[3:].fillna(test[3:].median()) test["kitchenqual"] = test["kitchenqual"].fillna(test["kitchenqual"].value_counts().ind ex[0]) test["exterqual"] = test["exterqual"].fillna(test["exterqual"].value_counts().index[0 ]) m = np.median(test.grlivarea[test.grlivarea > 0]) test = test.replace({'grlivarea': {0: m}}) print(test) # Convert the array predictions to a data frame then merge with the index for the test data test_predictions = model5.predict(test) test_predictions[test_predictions < 0] = train['saleprice'].min() print(test_predictions)
  • 55. dat = {'p_saleprice': test_predictions} df1 = test[['index']] df2 = pd.DataFrame(data=dat) submission = pd.concat([df1,df2], axis = 1, join_axes=[df1.index]) print(submission) submission.to_csv('C:/Users/Jahee Koo/Desktop/hw01_predictions.csv') NAME: AmesHousing.txt TYPE: Population SIZE: 2930 observations, 82 variables ARTICLE TITLE: Ames Iowa: Alternative to the Boston Housing Data Set DESCRIPTIVE ABSTRACT: Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. SOURCES: Ames, Iowa Assessor’s Office
  • 56. VARIABLE DESCRIPTIONS: Tab characters are used to separate variables in the data file. The data has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers). Order (Discrete): Observation number PID (Nominal): Parcel identification number - can be used with city web site for parcel review. MS SubClass (Nominal): Identifies the type of dwelling involved in the sale. 020 1-STORY 1946 & NEWER ALL STYLES 030 1-STORY 1945 & OLDER 040 1-STORY W/FINISHED ATTIC ALL AGES 045 1-1/2 STORY - UNFINISHED ALL AGES 050 1-1/2 STORY FINISHED ALL AGES 060 2-STORY 1946 & NEWER 070 2-STORY 1945 & OLDER
  • 57. 075 2-1/2 STORY ALL AGES 080 SPLIT OR MULTI-LEVEL 085 SPLIT FOYER 090 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES MS Zoning (Nominal): Identifies the general zoning classification of the sale. A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density
  • 58. RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density Lot Frontage (Continuous): Linear feet of street connected to property Lot Area (Continuous): Lot size in square feet Street (Nominal): Type of road access to property Grvl Gravel Pave Paved Alley (Nominal): Type of alley access to property Grvl Gravel Pave Paved NA No alley access Lot Shape (Ordinal): General shape of property Reg Regular IR1 Slightly irregular IR2 Moderately Irregular
  • 59. IR3 Irregular Land Contour (Nominal): Flatness of the property Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression Utilities (Ordinal): Type of utilities available AllPub All public Utilities (E,G,W,& S) NoSewr Electricity, Gas, and Water (Septic Tank) NoSeWa Electricity and Gas Only ELO Electricity only Lot Config (Nominal): Lot configuration Inside Inside lot Corner Corner lot CulDSac Cul-de-sac
  • 60. FR2 Frontage on 2 sides of property FR3 Frontage on 3 sides of property Land Slope (Ordinal): Slope of property Gtl Gentle slope Mod Moderate Slope Sev Severe Slope Neighborhood (Nominal): Physical locations within Ames city limits (map available) Blmngtn Bloomington Heights Blueste Bluestem BrDale Briardale BrkSide Brookside ClearCr Clear Creek CollgCr College Creek Crawfor Crawford Edwards Edwards
  • 61. Gilbert Gilbert Greens Greens GrnHill Green Hills IDOTRR Iowa DOT and Rail Road Landmrk Landmark MeadowV Meadow Village Mitchel Mitchell Names North Ames NoRidge Northridge NPkVill Northpark Villa NridgHt Northridge Heights NWAmes Northwest Ames OldTown Old Town SWISU South & West of Iowa State University Sawyer Sawyer SawyerW Sawyer West Somerst Somerset StoneBr Stone Brook
  • 62. Timber Timberland Veenker Veenker Condition 1 (Nominal): Proximity to various conditions Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad Condition 2 (Nominal): Proximity to various conditions (if more than one is present) Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal
  • 63. RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad Bldg Type (Nominal): Type of dwelling 1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one- family dwelling Duplx Duplex TwnhsE Townhouse End Unit TwnhsI Townhouse Inside Unit House Style (Nominal): Style of dwelling 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished
  • 64. 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level Overall Qual (Ordinal): Rates the overall material and finish of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
  • 65. Overall Cond (Ordinal): Rates the overall condition of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor Year Built (Discrete): Original construction date Year Remod/Add (Discrete): Remodel date (same as construction date if no remodeling or additions) Roof Style (Nominal): Type of roof Flat Flat
  • 66. Gable Gable Gambrel Gabrel (Barn) Hip Hip Mansard Mansard Shed Shed Roof Matl (Nominal): Roof material ClyTile Clay or Tile CompShg Standard (Composite) Shingle Membran Membrane Metal Metal Roll Roll Tar&Grv Gravel & Tar WdShake Wood Shakes WdShngl Wood Shingles Exterior 1 (Nominal): Exterior covering on house
  • 67. AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles Exterior 2 (Nominal): Exterior covering on house (if more than
  • 68. one material) AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding
  • 69. WdShing Wood Shingles Mas Vnr Type (Nominal): Masonry veneer type BrkCmn Brick Common BrkFace Brick Face CBlock Cinder Block None None Stone Stone Mas Vnr Area (Continuous): Masonry veneer area in square feet Exter Qual (Ordinal): Evaluates the quality of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Exter Cond (Ordinal): Evaluates the present condition of the material on the
  • 70. exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Foundation (Nominal): Type of foundation BrkTil Brick & Tile CBlock Cinder Block PConc Poured Contrete Slab Slab Stone Stone Wood Wood Bsmt Qual (Ordinal): Evaluates the height of the basement Ex Excellent (100+ inches) Gd Good (90-99 inches)
  • 71. TA Typical (80-89 inches) Fa Fair (70-79 inches) Po Poor (<70 inches NA No Basement Bsmt Cond (Ordinal): Evaluates the general condition of the basement Ex Excellent Gd Good TA Typical - slight dampness allowed Fa Fair - dampness or some cracking or settling Po Poor - Severe cracking, settling, or wetness NA No Basement Bsmt Exposure (Ordinal): Refers to walkout or garden level walls Gd Good Exposure Av Average Exposure (split levels or foyers typically score
  • 72. average or above) Mn Mimimum Exposure No No Exposure NA No Basement BsmtFin Type 1 (Ordinal): Rating of basement finished area GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFin SF 1 (Continuous): Type 1 finished square feet BsmtFinType 2 (Ordinal): Rating of basement finished area (if multiple types) GLQ Good Living Quarters
  • 73. ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFin SF 2 (Continuous): Type 2 finished square feet Bsmt Unf SF (Continuous): Unfinished square feet of basement area Total Bsmt SF (Continuous): Total square feet of basement area Heating (Nominal): Type of heating Floor Floor Furnace GasA Gas forced warm air furnace GasW Gas hot water or steam heat Grav Gravity furnace
  • 74. OthW Hot water or steam heat other than gas Wall Wall furnace HeatingQC (Ordinal): Heating quality and condition Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Central Air (Nominal): Central air conditioning N No Y Yes Electrical (Ordinal): Electrical system SBrkr Standard Circuit Breakers & Romex FuseA Fuse Box over 60 AMP and all Romex wiring (Average) FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
  • 75. Mix Mixed 1st Flr SF (Continuous): First Floor square feet 2nd Flr SF (Continuous) : Second floor square feet Low Qual Fin SF (Continuous): Low quality finished square feet (all floors) Gr Liv Area (Continuous): Above grade (ground) living area square feet Bsmt Full Bath (Discrete): Basement full bathrooms Bsmt Half Bath (Discrete): Basement half bathrooms Full Bath (Discrete): Full bathrooms above grade Half Bath (Discrete): Half baths above grade Bedroom (Discrete): Bedrooms above grade (does NOT include basement bedrooms)
  • 76. Kitchen (Discrete): Kitchens above grade KitchenQual (Ordinal): Kitchen quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor TotRmsAbvGrd (Discrete): Total rooms above grade (does not include bathrooms) Functional (Ordinal): Home functionality (Assume typical unless deductions are warranted) Typ Typical Functionality Min1 Minor Deductions 1 Min2 Minor Deductions 2 Mod Moderate Deductions
  • 77. Maj1 Major Deductions 1 Maj2 Major Deductions 2 Sev Severely Damaged Sal Salvage only Fireplaces (Discrete): Number of fireplaces FireplaceQu (Ordinal): Fireplace quality Ex Excellent - Exceptional Masonry Fireplace Gd Good - Masonry Fireplace in main level TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement Fa Fair - Prefabricated Fireplace in basement Po Poor - Ben Franklin Stove NA No Fireplace Garage Type (Nominal): Garage location 2Types More than one type of garage
  • 78. Attchd Attached to home Basment Basement Garage BuiltIn uilt-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage Garage Yr Blt (Discrete): Year garage was built Garage Finish (Ordinal) : Interior finish of the garage Fin Finished RFn Rough Finished Unf Unfinished NA No Garage Garage Cars (Discrete): Size of garage in car capacity Garage Area (Continuous): Size of garage in square feet
  • 79. Garage Qual (Ordinal): Garage quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage Garage Cond (Ordinal): Garage condition Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage Paved Drive (Ordinal): Paved driveway Y Paved P Partial Pavement
  • 80. N Dirt/Gravel Wood Deck SF (Continuous): Wood deck area in square feet Open Porch SF (Continuous): Open porch area in square feet Enclosed Porch (Continuous): Enclosed porch area in square feet 3-Ssn Porch (Continuous): Three season porch area in square feet Screen Porch (Continuous): Screen porch area in square feet Pool Area (Continuous): Pool area in square feet Pool QC (Ordinal): Pool quality Ex Excellent Gd Good TA Average/Typical Fa Fair
  • 81. NA No Pool Fence (Ordinal): Fence quality GdPrv Good Privacy MnPrv Minimum Privacy GdWo Good Wood MnWw Minimum Wood/Wire NA No Fence Misc Feature (Nominal): Miscellaneous feature not covered in other categories Elev Elevator Gar2 2nd Garage (if not described in garage section) Othr Other Shed Shed (over 100 SF) TenC Tennis Court NA None Misc Val (Continuous): $Value of miscellaneous feature Mo Sold (Discrete): Month Sold (MM)
  • 82. Yr Sold (Discrete): Year Sold (YYYY) Sale Type (Nominal): Type of sale WD Warranty Deed - Conventional CWD Warranty Deed - Cash VWD Warranty Deed - VA Loan New Home just constructed and sold COD Court Officer Deed/Estate Con Contract 15% Down payment regular terms ConLw Contract Low Down payment and low interest ConLI Contract Low Interest ConLD Contract Low Down Oth Other Sale Condition (Nominal): Condition of sale Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale
  • 83. AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes) SalePrice (Continuous): Sale price $$ I have to complete EDA assignment using python before JAN. 20th. I have an approximate python code and a report template example. I would like to ask you to complete the report based on these. I will attach the necessary files for analysis and report generation. After completion, please send me a doc. and .py file.