Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance

1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 1/63
Zillow Dataset Analysis and Visualization
MSDS 7331 Data Mining - Section 403 - Lab 1
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Business Understanding
10 points
Description:
Describe the purpose of the data set you selected (i.e., why was this data collected in the first
place?). Describe how you would define and measure the outcomes from the dataset. That is, why
is this data important and how do you know if you have mined useful knowledge from the dataset?
How would you measure the effectiveness of a good prediction algorithm? Be specific.
Answer:
Origin and purpose of dataset
This is a dataset from a Kaggle competition: "Zillow Prize: Zillow’s Home Value Prediction
(Zestimate)". To download all accompanied dataset refer to this link:
https://www.kaggle.com/c/zillow-prize-1/data (https://www.kaggle.com/c/zillow-prize-1/data)
Note: The dataset has 2985217 rows and 58 columns and it requires at least 2GB of free RAM to
load.
Zillow, a leading real estate and rental marketplace platform, developed a model to estimate the
property price based on property features, which they call the "Zestimate". As with every real world
model, the Zestimate has some error associated with it. Zestimates are estimated home values
based on 7.5 million statistical and machine learning models that analyze hundreds of data points
on each property.
The purpose of this dataset and Kaggle competition is to minimize the error between the Zestimate
(what we will predict) and the actual sale price, given certain features of a home.
Description of dataset
We are provided with a full dataset of real estate properties in three counties in California: Los
Angeles, Orange, and Ventura in 2016. The dataset contains:
ID for the listing
57 variables describing the property features such as the number of bedrooms and various
measurements in square feet
Two resulting variables: logerror and transactiondate
The dataset has two parts:

1/12/2018 lab1
Training data (90275 rows), which contains logerror and transactiondate and has all the
transactions before October 15, 2016, plus some of the transactions after October 15,
2016.
Testing data (2895067 rows), which contains the rest of the transactions between October
15 and December 31, 2016.
A successful measure of how well we predict log error will be how well we can clean and train our
data measured by our placement in the Kaggle competition. Kaggle measures the effectiveness of a
good prediction algorithm by taking the log error of the Zestimate and the actual sales price. The log
error is defined as:
where logerror < 0 represent Zestimates lower than the actual sell price and logerror > 0 represent
Zestimates higher than the actual sell price.
Our notebook
This notebook is an exploratory analysis for the dataset described above. Our study is organized as
follows:
Data Meaning
Data Quality (EDA)
Review of variables
Identification of missing values and outliers
Data cleansing
Visualizations
Simple Statistics
Visualize Attributes
Explore Joint Attributes
Explore Attributes and Classes
New Features
Exceptional Work
References/Citations
Conclusion
From the correlation table, random forest, and linear regression feature importance, we found out
that regionidzip, calculatedfinishedsquarefeet, bedroomcount, censustractandblock,
regionidneighborhood, and taxdelinquencyyear are the most important variables towards building
our prediction model.
Future work
In the future lab notebooks, we will predict the logerror from a regression model. To measure the
effectiveness of a good prediction algorithm, we will first apply cross-validation by splitting the
training dataset to training, validation, and testing to model our prediction error. A final prediction
error will be given by Kaggle when we submit our predictions to the competition.
logerror =log(Zestimate)−log(SalePrice)

1/12/2018 lab1
In [1]:
Data Meaning
10 points
Description:
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
Below is a table of all of the variables in the dataset. We list the variable name, type of data, scale,
and a description.
/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:878: UserWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use th
e latter.
warnings.warn(self.msg_depr % (key, alt_key))
Out[1]: 'The dataset has 2985342 rows and 60 columns'
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# load datasets here:
train_data = pd.read_csv('../input/train_2016_v2.csv')
data = pd.read_csv('../input/properties_2016.csv', low_memory=False)
data = pd.merge(data, train_data, how='left', on='parcelid')
'The dataset has %d rows and %d columns' % data.shape

1/12/2018 lab1
In [2]: from IPython.display import display, HTML
variables_description = [
['airconditioningtypeid', 'nominal', 'TBD', 'Type of cooling system present in the
,['architecturalstyletypeid', 'nominal', 'TBD', 'Architectural style of the home
,['assessmentyear', 'interval', 'TBD', 'The year of the property tax assessment']
,['basementsqft', 'ratio', 'TBD', 'Finished living area below or partially below g
,['bathroomcnt', 'ordinal', 'TBD', 'Number of bathrooms in home including fractio
,['bedroomcnt', 'ordinal', 'TBD', 'Number of bedrooms in home']
,['buildingclasstypeid', 'nominal', 'TBD', 'The building framing type (steel frame
,['buildingqualitytypeid', 'ordinal', 'TBD', 'Overall assessment of condition of t
,['calculatedbathnbr', 'ordinal', 'TBD', 'Number of bathrooms in home including f
,['calculatedfinishedsquarefeet', 'ratio', 'TBD', 'Calculated total finished livi
,['censustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined -
,['decktypeid', 'nominal', 'TBD', 'Type of deck (if any) present on parcel']
,['finishedfloor1squarefeet', 'ratio', 'TBD', 'Size of the finished living area o
,['finishedsquarefeet12', 'ratio', 'TBD', 'Finished living area']
,['finishedsquarefeet13', 'ratio', 'TBD', 'Perimeter living area']
,['finishedsquarefeet15', 'ratio', 'TBD', 'Total area']
,['finishedsquarefeet50', 'ratio', 'TBD', 'Size of the finished living area on the
,['finishedsquarefeet6', 'ratio', 'TBD', 'Base unfinished and finished area']
,['fips', 'nominal', 'TBD', 'Federal Information Processing Standard code - see ht
,['fireplacecnt', 'ordinal', 'TBD', 'Number of fireplaces in a home (if any)']
,['fireplaceflag', 'ordinal', 'TBD', 'Is a fireplace present in this home']
,['fullbathcnt', 'ordinal', 'TBD', 'Number of full bathrooms (sink, shower + batht
,['garagecarcnt', 'ordinal', 'TBD', 'Total number of garages on the lot including
,['garagetotalsqft', 'ratio', 'TBD', 'Total number of square feet of all garages o
,['hashottuborspa', 'ordinal', 'TBD', 'Does the home have a hot tub or spa']
,['heatingorsystemtypeid', 'nominal', 'TBD', 'Type of home heating system']
,['landtaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the land area of
,['latitude', 'interval', 'TBD', 'Latitude of the middle of the parcel multiplied
,['logerror', 'interval', 'TBD', 'Error or the Zillow model response variable']
,['longitude', 'interval', 'TBD', 'Longitude of the middle of the parcel multiplie
,['lotsizesquarefeet', 'ratio', 'TBD', 'Area of the lot in square feet']
,['numberofstories', 'ordinal', 'TBD', 'Number of stories or levels the home has'
,['parcelid', 'nominal', 'TBD', 'Unique identifier for parcels (lots)']
,['poolcnt', 'ordinal', 'TBD', 'Number of pools on the lot (if any)']
,['poolsizesum', 'ratio', 'TBD', 'Total square footage of all pools on property']
,['pooltypeid10', 'nominal', 'TBD', 'Spa or Hot Tub']
,['pooltypeid2', 'nominal', 'TBD', 'Pool with Spa/Hot Tub']
,['pooltypeid7', 'nominal', 'TBD', 'Pool without hot tub']
,['propertycountylandusecode', 'nominal', 'TBD', 'County land use code i.e. it's
,['propertylandusetypeid', 'nominal', 'TBD', 'Type of land use the property is zo
,['propertyzoningdesc', 'nominal', 'TBD', 'Description of the allowed land uses (
,['rawcensustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined
,['regionidcity', 'nominal', 'TBD', 'City in which the property is located (if any
,['regionidcounty', 'nominal', 'TBD', 'County in which the property is located']
,['regionidneighborhood', 'nominal', 'TBD', 'Neighborhood in which the property i
,['regionidzip', 'nominal', 'TBD', 'Zip code in which the property is located']
,['roomcnt', 'ordinal', 'TBD', 'Total number of rooms in the principal residence'
,['storytypeid', 'nominal', 'TBD', 'Type of floors in a multi-story house (i.e. b
,['structuretaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the built
,['taxamount', 'ratio', 'TBD', 'The total property tax assessed for that assessme
,['taxdelinquencyflag', 'nominal', 'TBD', 'Property taxes for this parcel are past
,['taxdelinquencyyear', 'interval', 'TBD', 'Year']
,['taxvaluedollarcnt', 'ratio', 'TBD', 'The total tax assessed value of the parce

1/12/2018 lab1
Out[2]: Variable Type Scale Description
airconditioningtypeid nominal
[nan, 1.0, 13.0, 5.0,
11.0, 9.0, 12.0, 3.0]
Type of cooling system present in the h
any)
architecturalstyletypeid nominal
[nan, 7.0, 21.0, 8.0,
2.0, 3.0, 5.0, 10.0,
27.0]
Architectural style of the home (i.e. ran
colonial, split-level, etc…)
assessmentyear interval (2000, 2016) The year of the property tax assessme
basementsqft ratio (20, 8516)
Finished living area below or partially b
ground level
bathroomcnt ordinal
[0.0, 2.0, 4.0, 3.0,
1.0, ... (38 More)]
Number of bathrooms in home includin
fractional bathrooms
bedroomcnt ordinal
[0.0, 4.0, 5.0, 2.0,
3.0, ... (22 More)]
Number of bedrooms in home
buildingclasstypeid nominal
[nan, 3.0, 4.0, 5.0,
2.0, 1.0]
The building framing type (steel frame,
frame, concrete/brick)
,['threequarterbathnbr', 'ordinal', 'TBD', 'Number of 3/4 bathrooms in house (show
,['transactiondate', 'nominal', 'TBD', 'Date of the transaction response variable
,['typeconstructiontypeid', 'nominal', 'TBD', 'What type of construction material
,['unitcnt', 'ordinal', 'TBD', 'Number of units the structure is built into (i.e.
,['yardbuildingsqft17', 'interval', 'TBD', 'Patio in yard']
,['yardbuildingsqft26', 'interval', 'TBD', 'Storage shed/building in yard']
,['yearbuilt', 'interval', 'TBD', 'The Year the principal residence was built']
]
variables = pd.DataFrame(variables_description, columns=['name', 'type', 'scale',
variables = variables.set_index('name')
variables = variables.loc[data.columns]
def output_variables_table(variables):
variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th>
for vname, atts in variables.iterrows():
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = data[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d)' % (data[vname].min(), data[vname].max(
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row
return HTML('<table>%s</table>' % ''.join(rows))
output_variables_table(variables)

1/12/2018 lab1
buildingqualitytypeid ordinal
[nan, 7.0, 4.0, 10.0,
1.0, ... (13 More)]
Overall assessment of condition of the
from best (lowest) to worst (highest)
calculatedbathnbr ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (35 More)]
Number of bathrooms in home includin
fractional bathroom
calculatedfinishedsquarefeet ratio (1, 952576)
Calculated total finished living area of t
home
censustractandblock nominal
[nan,
6.1110010011e+13,
6.1110009032e+13,
6.1110010024e+13,
6.1110010023e+13,
... (96772 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
decktypeid nominal [nan, 66.0] Type of deck (if any) present on parcel
finishedfloor1squarefeet ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet12 ratio (1, 290345) Finished living area
finishedsquarefeet13 ratio (120, 2688) Perimeter living area
finishedsquarefeet15 ratio (112, 820242) Total area
finishedsquarefeet50 ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet6 ratio (117, 952576) Base unfinished and finished area
fips nominal
[6037.0, 6059.0,
6111.0, nan]
Federal Information Processing Standa
- see
https://en.wikipedia.org/wiki/FIPS_coun
for more details
fireplacecnt ordinal
[nan, 3.0, 1.0, 2.0,
4.0, ... (10 More)]
Number of fireplaces in a home (if any
fireplaceflag ordinal [nan, True] Is a fireplace present in this home
fullbathcnt ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (21 More)]
Number of full bathrooms (sink, showe
bathtub, and toilet) present in home
garagecarcnt ordinal
[nan, 2.0, 4.0, 1.0,
3.0, ... (25 More)]
Total number of garages on the lot incl
attached garage
garagetotalsqft ratio (0, 7749)
Total number of square feet of all garag
lot including an attached garage
hashottuborspa ordinal [nan, True] Does the home have a hot tub or spa
heatingorsystemtypeid nominal
[nan, 2.0, 7.0, 20.0,
6.0, ... (15 More)]
Type of home heating system
landtaxvaluedollarcnt ratio (1, 90246219)
The assessed value of the land area o
parcel

1/12/2018 lab1
latitude interval
(33324388,
34819650)
Latitude of the middle of the parcel mu
by 10e6
logerror interval (-4, 4) Error or the Zillow model response var
longitude interval
(-119475780,
-117554316)
Longitude of the middle of the parcel m
by 10e6
lotsizesquarefeet ratio (100, 328263808) Area of the lot in square feet
numberofstories ordinal
[nan, 1.0, 4.0, 2.0,
3.0, ... (13 More)]
Number of stories or levels the home h
parcelid nominal
[10754147,
10759547,
10843547,
10859147,
10879947, ...
(2985217 More)]
Unique identifier for parcels (lots)
poolcnt ordinal [nan, 1.0] Number of pools on the lot (if any)
poolsizesum ratio (19, 17410) Total square footage of all pools on pro
pooltypeid10 nominal [nan, 1.0] Spa or Hot Tub
pooltypeid2 nominal [nan, 1.0] Pool with Spa/Hot Tub
pooltypeid7 nominal [nan, 1.0] Pool without hot tub
propertycountylandusecode nominal
[010D, 0109, 1200,
1210, 010V, ... (241
More)]
County land use code i.e. it's zoning at
county level
propertylandusetypeid nominal
[269.0, 261.0, 47.0,
31.0, 260.0, ... (16
More)]
Type of land use the property is zoned
propertyzoningdesc nominal
[nan, LCA11*,
LAC2, LAM1,
LAC4, ... (5639
More)]
Description of the allowed land uses (z
for that property
rawcensustractandblock nominal
[60378002.041,
60378001.011,
60377030.012,
60371412.023,
60371232.052, ...
(99394 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
regionidcity nominal
[37688.0, 51617.0,
12447.0, 396054.0,
47547.0, ... (187
More)]
City in which the property is located (if
regionidcounty nominal
[3101.0, 1286.0,
2061.0, nan]
County in which the property is located

1/12/2018 lab1
Data Quality
15 points
regionidneighborhood nominal
[nan, 27080.0,
46795.0, 274049.0,
31817.0, ... (529
More)]
Neighborhood in which the property is
regionidzip nominal
[96337.0, 96095.0,
96424.0, 96450.0,
96446.0, ... (406
More)]
Zip code in which the property is locate
roomcnt ordinal
[0.0, 8.0, 4.0, 5.0,
7.0, ... (37 More)]
Total number of rooms in the principal
residence
storytypeid nominal [nan, 7.0]
Type of floors in a multi-story house (i.e
basement and main level, split-level, a
etc.). See tab for details.
structuretaxvaluedollarcnt ratio (1, 251486000)
The assessed value of the built structu
the parcel
taxamount ratio (1, 3458861)
The total property tax assessed for tha
assessment year
taxdelinquencyflag nominal [nan, Y]
Property taxes for this parcel are past d
of 2015
taxdelinquencyyear interval (0, 99) Year
taxvaluedollarcnt ratio (1, 282786000) The total tax assessed value of the par
threequarterbathnbr ordinal
[nan, 1.0, 2.0, 4.0,
3.0, 6.0, 5.0, 7.0]
Number of 3/4 bathrooms in house (sh
sink + toilet)
transactiondate nominal
[nan, 2016-01-27,
2016-03-30, 2016-
05-27, 2016-06-07,
... (353 More)]
Date of the transaction response varia
typeconstructiontypeid nominal
[nan, 6.0, 4.0, 10.0,
13.0, 11.0]
What type of construction material was
construct the home
unitcnt ordinal
[nan, 2.0, 1.0, 3.0,
5.0, ... (147 More)]
Number of units the structure is built in
= duplex, 3 = triplex, etc...)
yardbuildingsqft17 interval (10, 7983) Patio in yard
yardbuildingsqft26 interval (10, 6141) Storage shed/building in yard
yearbuilt interval (1801, 2015) The Year the principal residence was b

1/12/2018 lab1
Description:
Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes?
How do you deal with these problems? Give justifications for your methods.
Examining Distribution of Missing Values
From the observations, most of the rows have about 30 missing values. For the observations that
have 57 missing values, it means that most of the features are missing and we choose to remove
those. We will add in values to those missing where appropriate, below.
In [3]:
All observations have a value for parcelid
In [4]:
0.38 percent of the data has only parcelid present and all other variables
Out[4]: 0
plt.rcParams['figure.figsize'] = [10, 7]
number_missing_per_row = data.isnull().sum(axis=1)
sns.distplot(number_missing_per_row, color="#34495e", kde=False);
plt.title('Distribution of Missing Values', fontsize=15)
plt.xlabel('Number of Missing Values', fontsize=15)
plt.ylabel('Number of Rows', fontsize=15);
data['parcelid'].isnull().sum()

1/12/2018 lab1
missing
We choose to remove those observations because they don't present any value
In [5]:
Table of Missing Values
Of the available variables, here is a table that describes the number of missing values as well as the
percent missing.
(0.0, 'percent of the data has no data features outside of parcelid')
print(round(len(number_missing_per_row[number_missing_per_row >= 57]) / len(data)
data = data[number_missing_per_row < 57]

1/12/2018 lab1
In [9]:
Out[9]: Variable Name Number Missing Values Percent Missing
0 parcelid 0 0.000000
1 airconditioningtypeid 2162353 72.710897
2 architecturalstyletypeid 2967843 99.796160
3 basementsqft 2972277 99.945257
4 bathroomcnt 25 0.000841
5 bedroomcnt 13 0.000437
6 buildingclasstypeid 2961276 99.575339
7 buildingqualitytypeid 1035337 34.814058
8 calculatedbathnbr 117481 3.950395
9 decktypeid 2956809 99.425133
10 finishedfloor1squarefeet 2771182 93.183272
11 calculatedfinishedsquarefeet 44131 1.483941
12 finishedsquarefeet12 264610 8.897729
17 fips 0 0.000000
18 fireplacecnt 2661258 89.486988
19 fullbathcnt 117481 3.950395
20 garagecarcnt 2090598 70.298076
21 garagetotalsqft 2090598 70.298076
22 hashottuborspa 2904889 97.679280
23 heatingorsystemtypeid 1167429 39.255760
24 latitude 0 0.000000
25 longitude 0 0.000000
26 lotsizesquarefeet 264676 8.899948
27 poolcnt 2456346 82.596653
28 poolsizesum 2945942 99.059721
missing_values = data.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values['Percent Missing'] = missing_values['Number Missing Values'] / len
missing_values['Percent Missing'] = missing_values['Percent Missing'].replace(np.
missing_values

1/12/2018 lab1
Variable Name Number Missing Values Percent Missing
29 pooltypeid10 2936964 98.757829
30 pooltypeid2 2941830 98.921452
31 pooltypeid7 2488421 83.675201
32 propertycountylandusecode 840 0.028246
33 propertylandusetypeid 0 0.000000
34 propertyzoningdesc 995195 33.464250
35 rawcensustractandblock 0 0.000000
36 regionidcity 51410 1.728704
37 regionidcounty 0 0.000000
38 regionidneighborhood 1817447 61.113149
39 regionidzip 2543 0.085510
40 roomcnt 38 0.001278
41 storytypeid 2972281 99.945392
42 threequarterbathnbr 2662261 89.520714
43 typeconstructiontypeid 2967157 99.773093
44 unitcnt 996333 33.502516
45 yardbuildingsqft17 2893549 97.297963
46 yardbuildingsqft26 2971258 99.910992
47 yearbuilt 48494 1.630651
48 numberofstories 2291806 77.063860
49 fireplaceflag 2968740 99.826323
50 structuretaxvaluedollarcnt 43547 1.464304
51 taxvaluedollarcnt 31113 1.046200
52 assessmentyear 2 0.000067
53 landtaxvaluedollarcnt 56296 1.892999
54 taxamount 19813 0.666228
55 taxdelinquencyflag 2917435 98.101150
56 taxdelinquencyyear 2917433 98.101083
57 censustractandblock 63691 2.141662
58 logerror 2883630 96.964429
59 transactiondate 2883630 96.964429

1/12/2018 lab1
Examining Variables for Missing Values and Outliers
For variables that are nominal, ratio, and interval where appropriate, we wrote a function that
replaces outliers 5 standard deviations from the mean and assigning them as 5 standard deviations
above or below the mean, respectively.
In [10]:
Variable: airconditioningtypeid - Type of cooling system present in the
home (if any)
Has datatype: nominal and 72.710860 percent of values missing
For this variable, missing values indicate the absence of a cooling system. We replace all missing
values with 0 to represent no cooling system. We changed the column datatype to integer.
In [11]:
Variable: architecturalstyletypeid - Architectural style of the home (i.e.
ranch, colonial, split-level, etc…)
Architectural style describes the home design. As such, it is not something we can extrapolate a
value for. With over 99% of values missing, we decided to eliminate this variable.
('Before', array([ nan, 1., 13., 5., 11., 9., 12., 3.]))
('After', array([ 0, 1, 13, 5, 11, 9, 12, 3]))
def fix_outliers(data, column):
mean = data[column].mean()
std = data[column].std()
max_value = mean + std * 5
min_value = mean - std * 5
if data[column].max() < max_value and data[column].min() > min_value:
print('No outliers found')
return
print('Outliers found!')
f, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=[15, 7])
f.subplots_adjust(hspace=.4)
sns.boxplot(data[column].dropna(), ax=ax0, color="#34495e").set_title('Before
sns.distplot(data[column].dropna(), ax=ax2, color="#34495e").set_title('Before
data.loc[data[column] > max_value, column] = max_value
data.loc[data[column] < min_value, column] = min_value
sns.boxplot(data[column].dropna(), ax=ax1, color="#34495e").set_title('After'
sns.distplot(data[column].dropna(), ax=ax3, color="#34495e").set_title('After
print('Before', data['airconditioningtypeid'].unique())
data['airconditioningtypeid'] = data['airconditioningtypeid'].fillna(0).astype(np
print('After', data['airconditioningtypeid'].unique())

1/12/2018 lab1
In [12]:
Variable: assessmentyear - year of the property tax assessment
Has datatype: interval and has 2 values missing
We replaced the missing values with the latest tax year which also happens to be the median tax
year. We changed the column datatype to integer.
In [13]:
Variable: basementsqft - Finished living area below or partially below
ground level
Has datatype: ratio and 99.945255 percent of values missing
Basements are not standard home features. Whenever a basement is not a feature of the home,
the value for area was entered as a missing value. With over 99% of values missing, we decided to
eliminate this variable.
In [14]:
Variable: bathroomcnt - Number of bathrooms in home including
fractional bathrooms
Has datatype: ordinal and 0.000841 percent of values missing
We decided it is potentially possible for the property to not have a bathroom so we decided to
replace missing values with zeros since there are only very few. We changed the column datatype
to a float.
('Before', array([ 2015., 2014., 2003., 2012., 2001., 2011., 2013., 201
6.,
2010., nan, 2004., 2005., 2002., 2000., 2009.]))
('After', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010, 2004, 20
05,
2002, 2000, 2009]))
del data['architecturalstyletypeid']
print('Before', data['assessmentyear'].unique())
median_value = data['assessmentyear'].median()
data['assessmentyear'] = data['assessmentyear'].fillna(median_value).astype(np.int
print('After', data['assessmentyear'].unique())
del data['basementsqft']

1/12/2018 lab1
In [15]:
Variable: bedroomcnt - Number of bedrooms in home
We decided to replace missing values with zeros since there are only very few to represent a studio
apartment. We changed the column datatype to integer.
In [16]:
Variable: buildingclasstypeid - The building framing type (steel frame,
wood frame, concrete/brick)
With this much missing values and the difficulty of assigning a building framing type, we decided to
remove this variable.
In [17]:
Variable: buildingqualitytypeid - Overall assessment of condition of the
building from best (lowest) to worst (highest)
We chose to replace the missing values with the median of the condition assessment instead of
giving the missing values the best or worst value. We changed the column datatype to integer.
('Before', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , nan, 18. , 16. , 1.75,
17. , 19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , 18. , 16. , 1.75, 17. ,
19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('Before', array([ 0., 4., 5., 2., 3., 1., 6., 7., 8., 12.,
11.,
9., 10., 14., 16., 13., nan, 15., 17., 18., 20., 19.]))
('After', array([ 0, 4, 5, 2, 3, 1, 6, 7, 8, 12, 11, 9, 10, 14, 16, 1
3, 15,
17, 18, 20, 19]))
print('Before', data['bathroomcnt'].unique())
data['bathroomcnt'] = data['bathroomcnt'].fillna(0).astype(np.float32)
print('After', data['bathroomcnt'].unique())
print('Before', data['bedroomcnt'].unique())
data['bedroomcnt'] = data['bedroomcnt'].fillna(0).astype(np.int32)
print('After', data['bedroomcnt'].unique())
del data['buildingclasstypeid']

1/12/2018 lab1
In [18]:
Variable: calculatedbathnbr - Number of bathrooms in home including
fractional bathroom
With a low number of missing values, we decided to assign 0 to all missing values since we decided
above it is possible that a property could have 0 bathrooms. We changed the column datatype to a
float.
In [19]:
Variable: calculatedfinishedsquarefeet - Calculated total finished living
area of the home
These missing values appear to be consistent with 0 or missing values for variables associated with
a building or structure on the property such as bathroomcnt, bedroomcnt, or architecturalstyletypeid.
We can assume that no structures exist on these properties and we decided to impute zeros to
these. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
('Before', array([ nan, 7., 4., 10., 1., 12., 8., 3., 6., 9.,
5.,
11., 2.]))
('After', array([ 7, 4, 10, 1, 12, 8, 3, 6, 9, 5, 11, 2]))
('Before', array([ nan, 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.
5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
print('Before', data['buildingqualitytypeid'].unique())
medianQuality = data['buildingqualitytypeid'].median()
data['buildingqualitytypeid'] = data['buildingqualitytypeid'].fillna(medianQuality
print('After', data['buildingqualitytypeid'].unique())
print('Before', data['calculatedbathnbr'].unique())
data['calculatedbathnbr'] = data['calculatedbathnbr'].fillna(0).astype(np.float32
print('After', data['calculatedbathnbr'].unique())

1/12/2018 lab1
In [20]:
Variable: censustractandblock - census tract and census block ID
With such a small amount of missing values, we decided to replace them with the median. A better
approach in the future could be taking into account zip code and then median for the missing
values. We changed the column datatype to a float.
In [21]:
Variable: decktypeid - Type of deck (if any) present on parcel
Outliers found!
('Before', [nan, 10925.92657277406, 5068.0, 1776.0, 2400.0, 3611.0, 3754.0, 247
0.0, '...'])
('After', [0, 10925, 5068, 1776, 2400, 3611, 3754, 2470, '...'])
('Before', [nan, 61110010011023.0, 61110009032019.0, 61110010024015.0, 61110010
023002.0, 61110010024021.0, 61110010021029.0, 61110010022038.0, '...'])
('After', [60375714234368.0, 61110011035648.0, 61110006841344.0, 6111000264704
0.0, 61110015229952.0, 61110019424256.0, 61110023618560.0, 61110027812864.0,
'...'])
fix_outliers(data, 'calculatedfinishedsquarefeet')
print('Before', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['..
data['calculatedfinishedsquarefeet'] = data['calculatedfinishedsquarefeet'].fillna
print('After', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['...
print('Before', data['censustractandblock'].unique()[:8].tolist() + ['...'])
median_value = data['censustractandblock'].median()
data['censustractandblock'] = data['censustractandblock'].fillna(median_value)
data['censustractandblock'] = data['censustractandblock'].astype(np.float32)
print('After', data['censustractandblock'].unique()[:8].tolist() + ['...'])

1/12/2018 lab1
Missing values is most likely an indication of an absence of this feature in the property. With 99%
missing values, we will remove this column.
In [22]:
Variable: finishedfloor1squarefeet - Size of the finished living area on
the first (entry) floor of the home
Having this much missing values and the availability of an alternate variable -
calculatedfinishedsquarefeet - with very few missing values, we decided to eliminate this variable.
In [23]:
Variable: finishedsquarefeet12 - Finished living area
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Missing values are
therefore zeros. We changed the column datatype to integer. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
In [24]:
Outliers found!
('Before', array([ nan, 4000., 3633., ..., 317., 268., 161.]))
('After', array([ 0, 4000, 3633, ..., 317, 268, 161]))
del data['decktypeid']
del data['finishedfloor1squarefeet']
fix_outliers(data, 'finishedsquarefeet12')
print('Before', data['finishedsquarefeet12'].unique())
data['finishedsquarefeet12'] = data['finishedsquarefeet12'].fillna(0).astype(np.i
print('After', data['finishedsquarefeet12'].unique())

1/12/2018 lab1
Variable: finishedsquarefeet13 - Finished living area
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 99%
missing values we will remove this from the dataset.
In [25]:
Variable: finishedsquarefeet15 - Total area
missing values we will remove this from the dataset.
In [26]:
Variable: finishedsquarefeet50 - Size of the finished living area on the
first (entry) floor of the home
missing values we will replace the missing values with 0. We changed the column datatype to float.
In [27]:
Variable: finishedsquarefeet6 - Base unfinished and finished area
With 99% missing values, we decided to delete this variable.
In [28]:
Variable: fips - Federal Information Processing Standard code - see
https://en.wikipedia.org/wiki/FIPS_county_code
(https://en.wikipedia.org/wiki/FIPS_county_code) for more details
Has datatype: nominal with values [6037.0, 6059.0, 6111.0] and no missing values
We changed the column datatype to integer.
In [29]:
Variable: fireplacecnt - Number of fireplaces in a home (if any)
del data['finishedsquarefeet13']
data['finishedsquarefeet50'] = data['finishedsquarefeet50'].fillna(0).astype(np.f
data['fips'] = data['fips'].astype(np.int32)

1/12/2018 lab1
In this dataset, missing value represents 0 fireplaces. We replaced all missing values with zero and
change the column datatype to integer. We changed the column datatype to integer.
In [30]:
Variable: fireplaceflag - does the home have a fireplace
With 99% missing values, we decided to delete the variable.
In [31]:
Variable: fullbathcnt - Number of full bathrooms (sink, shower +
bathtub, and toilet) present in home
We first replaced its missing values with the values of bathroomcnt which is a similar measure. After
that, we have 25 observations missing and we replace them with 0. We changed the column
datatype to a float.
In [32]:
Variable: garagecarcnt - Total number of garages on the lot including an
attached garage
We assume that missing values will represent no garage and replace all missing values with zero.
We changed the column datatype to integer.
('Before', array([ nan, 3., 1., 2., 4., 9., 5., 7., 6., 8.]))
('After', array([0, 3, 1, 2, 4, 9, 5, 7, 6, 8]))
('Before', array([ nan, 2., 4., 3., 1., 5., 7., 6., 10., 8.,
9.,
12., 11., 13., 14., 20., 19., 15., 18., 16., 17.]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 5. , 7. , 6.
,
10. , 8. , 9. , 12. , 11. , 7.5 , 2.5 , 4.5 ,
1.5 , 13. , 14. , 20. , 3.5 , 19. , 5.5 , 15. ,
18. , 16. , 1.75, 6.5 , 17. , 0.5 , 8.5 ]))
print('Before', data['fireplacecnt'].unique())
data['fireplacecnt'] = data['fireplacecnt'].fillna(0).astype(np.int32)
print('After', data['fireplacecnt'].unique())
del data['fireplaceflag']
print('Before', data['fullbathcnt'].unique())
missing_fullbathcnt = data['fullbathcnt'].isnull()
data.loc[missing_fullbathcnt, 'fullbathcnt'] = data['bathroomcnt'][missing_fullbat
data['fullbathcnt'] = data['fullbathcnt'].astype(np.float32)
print('After', data['fullbathcnt'].unique())

1/12/2018 lab1
In [33]:
Variable: garagetotalsqft - Total number of garages on the lot including
an attached garage
We first replaced missing values where garagecarcnt is 0 with 0 garagetotalsqft. We changed the
column datatype to a float. We then replaced all outliers with a maximum and minimum value of
(mean ± 5 * std), respectively.
In [34]:
Variable: hashottuborspa - Does the home have a hot tub or spa
In this dataset missing values represent doesn't have a hot tub or spa. we replaced all missing
values with 0 and all True values with 1. We changed the column datatype to integer.
[ 0 2 4 1 3 5 7 6 8 9 12 11 10 13 14 15 25 21 18 17 24 19 16 20]
Outliers found!
data['garagecarcnt'] = data['garagecarcnt'].fillna(0).astype(np.int32)
print(data['garagecarcnt'].unique())
fix_outliers(data, 'garagetotalsqft')
data.loc[data['garagecarcnt'] == 0, 'garagetotalsqft'] = 0
data['garagecarcnt'] = data['garagecarcnt'].astype(np.float32)
assert data['garagetotalsqft'].isnull().sum() == 0

1/12/2018 lab1
In [35]:
Variable: heatingorsystemtypeid - Type of home heating system
We replaced all missing values with 0 which will represent a missing heating system type id. We
changed the column datatype to integer.
In [36]:
Variable: landtaxvaluedollarcnt - the assessed value of the land
We replaced all missing values with the median assessed land values. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
('Before', array([nan, True], dtype=object))
('After', array([0, 1]))
('Before', array([ nan, 2., 7., 20., 6., 13., 18., 24., 12., 10.,
1.,
14., 21., 11., 19.]))
('After', array([ 0, 2, 7, 20, 6, 13, 18, 24, 12, 10, 1, 14, 21, 11, 19]))
print('Before', data['hashottuborspa'].unique())
data['hashottuborspa'] = data['hashottuborspa'].fillna(0).replace('True', 1).asty
print('After', data['hashottuborspa'].unique())
print('Before', data['heatingorsystemtypeid'].unique())
data['heatingorsystemtypeid'] = data['heatingorsystemtypeid'].fillna(0).astype(np
print('After', data['heatingorsystemtypeid'].unique())

1/12/2018 lab1
In [37]:
Variables: latitude and longitude
Has datatype: interval and no missing values. We changed the column datatype to float.
In [38]:
Variable: logerror - Error or the Zillow model response variable
Has datatype: interval and 96.964429 percent of values missing
We will not fill any missing values because they represent the test part of the dataset. We changed
the column datatype to float.
In [39]:
Variable: lotsizesquarefeet - Area of the lot in square feet
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 7.62631000e+05, ...,
1.28007500e+06, 3.61063000e+05, 9.54574000e+05]))
('After', array([ 9, 27516, 762631, ..., 1280075, 361063, 954574]))
fix_outliers(data, 'landtaxvaluedollarcnt')
print('Before', data['landtaxvaluedollarcnt'].unique())
median_value = data['landtaxvaluedollarcnt'].median()
data['landtaxvaluedollarcnt'] = data['landtaxvaluedollarcnt'].fillna(median_value
print('After', data['landtaxvaluedollarcnt'].unique())
data[['latitude', 'longitude']] = data[['latitude', 'longitude']].astype(np.float
data['logerror'] = data['logerror'].astype(np.float32)

1/12/2018 lab1
We replace all missing values with 0 which will represent no lot. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [40]:
Variable: numberofstories - number of stories or levels the home
has
We replace all missing values with 1 after removing all outliers to represent a single story home. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
fix_outliers(data, 'lotsizesquarefeet')
data['lotsizesquarefeet'] = data['lotsizesquarefeet'].fillna(0).astype(np.float32

1/12/2018 lab1
In [41]:
Variable: parcelid - Unique identifier for parcels (lots)
Has datatype: nominal and no values missing. We changed the column datatype to integer.
In [42]:
Variable: poolcnt - Number of pools on the lot (if any)
We replaced all missing values with 0 which will represent no pools. We changed the column
datatype to integer.
In [43]:
Variable: poolsizesum - Total square footage of all pools on
Outliers found!
('Before', array([ nan, 1. , 4. , 2. , 3.
,
4.09684575]))
('After', array([1, 4, 2, 3]))
('Before', array([ nan, 1.]))
fix_outliers(data, 'numberofstories')
print('Before', data['numberofstories'].unique())
data['numberofstories'] = data['numberofstories'].fillna(1).astype(np.int32)
print('After', data['numberofstories'].unique())
data['parcelid'] = data['parcelid'].astype(np.int32)
print('Before', data['poolcnt'].unique())
data['poolcnt'] = data['poolcnt'].fillna(0).astype(np.int32)
print('After', data['poolcnt'].unique())

1/12/2018 lab1
property
Has datatype: ratio and 99 percent of values missing
We replaced all missing values with 0 if number of pools is 0 or with the average poolsizesum
otherwise. We changed the column datatype to a float. We then replaced all outliers with a
In [44]:
Variable: pooltypeid10 - Spa or Hot Tub
We replaced all missing values with 0 which will represent no Spa or Hot Tub. We changed the
column datatype to integer.
In [45]:
Variable: pooltypeid2 - Pool with Spa/Hot Tub
Outliers found!
fix_outliers(data, 'poolsizesum')
data.loc[data['poolsizesum'].isnull(), 'poolsizesum'] = int(data['poolsizesum'].me
data.loc[data['poolcnt'] == 0, 'poolsizesum'] = 0
data['poolcnt'] = data['poolcnt'].astype(np.float32)
print('Before', data['pooltypeid10'].unique())
data['pooltypeid10'] = data['pooltypeid10'].fillna(0).astype(np.int32)
print('After', data['pooltypeid10'].unique())

1/12/2018 lab1
We replaced all missing values with 0 which will represent no Pool with Spa/Hot Tub. We changed
the column datatype to integer.
In [46]:
Variable: pooltypeid7 - Pool without hot tub
We replaced all missing values with 0 which will represent no pool without hot tub. We changed the
column datatype to integer.
In [47]:
Variable: propertycountylandusecode - County land use code i.e. it's
zoning at the county level
We replaced all missing values with 0 which will represent no county land use code. We changed
the column datatype to string.
In [48]:
Variable: propertylandusetypeid - Type of land use the property is zoned
for
Has datatype: nominal and 0 percent of values missing.
We are just changing the datatype to integer
In [49]:
('Before', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
('After', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
print('Before', data['propertycountylandusecode'].unique()[:8].tolist() + ['...']
data['propertycountylandusecode'] = data['propertycountylandusecode'].fillna(0).a
print('After', data['propertycountylandusecode'].unique()[:8].tolist() + ['...'])
data['propertylandusetypeid'] = data['propertylandusetypeid'].astype(np.int32)

1/12/2018 lab1
Variable: propertyzoningdesc - Description of the allowed land uses
(zoning) for that property
We replaced all missing values with 0 which will represent no description of the allowed land uses.
We changed the column datatype to string.
In [50]:
Variable: rawcensustractandblock - Census tract and block ID combined
- also contains blockgroup assignment by extension
Has datatype: nominal and 0 percent of values missing
We are just changing the datatype to integer
In [51]:
Variable: regionidcity - City in which the property is located (if
any)
we will replace any missing values with 0 to represent no city ID. We are just changing the datatype
to integer
In [52]:
Variable: regionidcounty - County in which the property is located
('Before', array([nan, 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('After', array(['0', 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('Before', array([ 60378002.041 , 60378001.011002, 60377030.012017, ...,
60590878.032022, 60590626.211013, 60379012.091563]))
('After', array([60378002, 60378001, 60377030, ..., 61110057, 60375324, 6037599
1]))
('Before', [37688.0, 51617.0, 12447.0, 396054.0, 47547.0, nan, 54311.0, 40227.
0, '...'])
('After', [37688, 51617, 12447, 396054, 47547, 0, 54311, 40227, '...'])
print('Before', data['propertyzoningdesc'].unique())
data['propertyzoningdesc'] = data['propertyzoningdesc'].fillna(0).astype(np.str)
print('After', data['propertyzoningdesc'].unique())
print('Before', data['rawcensustractandblock'].unique())
data['rawcensustractandblock'] = data['rawcensustractandblock'].fillna(0).astype(
print('After', data['rawcensustractandblock'].unique())
print('Before', data['regionidcity'].unique()[:8].tolist() + ['...'])
data['regionidcity'] = data['regionidcity'].fillna(0).astype(np.int32)
print('After', data['regionidcity'].unique()[:8].tolist() + ['...'])

1/12/2018 lab1
Has datatype: nominal and 0 percent of values missing. We changed the column datatype to
integer.
In [53]:
Variable: regionidneighborhood - Neighborhood in which the property is
located
We replaced all missing values with 0 which will represent no region ID neighborhood. We changed
the column datatype to integer.
In [54]:
Variable: regionidzip - Zip code in which the property is located
We replaced all missing values with 0 which will represent no zip code. We changed the column
datatype to integer.
In [55]:
Variable: roomcnt - Total number of rooms in the principal
residence
We replaced all missing values with 1 which will represent no Total number of rooms in the principal
residence reported. We changed the column datatype to integer. We then replaced all outliers with a
('Before', array([ 3101., 1286., 2061.]))
('After', array([3101, 1286, 2061]))
('Before', [nan, 27080.0, 46795.0, 274049.0, 31817.0, 37739.0, 115729.0, 7877.
0, '...'])
('After', [0, 27080, 46795, 274049, 31817, 37739, 115729, 7877, '...'])
('Before', [96337.0, 96095.0, 96424.0, 96450.0, 96446.0, 96049.0, 96434.0, 9643
6.0, '...'])
('After', [96337, 96095, 96424, 96450, 96446, 96049, 96434, 96436, '...'])
print('Before', data['regionidcounty'].unique())
data['regionidcounty'] = data['regionidcounty'].astype(np.int32)
print('After', data['regionidcounty'].unique())
print('Before', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
data['regionidneighborhood'] = data['regionidneighborhood'].fillna(0).astype(np.i
print('After', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
print('Before', data['regionidzip'].unique()[:8].tolist() + ['...'])
data['regionidzip'] = data['regionidzip'].fillna(0).astype(np.int32)
print('After', data['regionidzip'].unique()[:8].tolist() + ['...'])

1/12/2018 lab1
In [56]:
Variable: storytypeid - Type of floors in a multi-story house (i.e.
basement and main level, split-level, attic, etc.). See tab for
details.
With 99% missing values, we decided to remove this variable.
In [57]:
Variable: structuretaxvaluedollarcnt - the assessed value of the
building
We replaced all missing values with the median assessed building tax. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
Outliers found!
('Before', array([ 0. , 8. , 4. , 5. ,
7. , 6. , 11. , 3. ,
10. , 9. , 2. , 12. ,
15.67699991, 13. , 15. , 14. ,
1. , nan]))
('After', array([ 0, 8, 4, 5, 7, 6, 11, 3, 10, 9, 2, 12, 15, 13, 14,
1]))
fix_outliers(data, 'roomcnt')
print('Before', data['roomcnt'].unique())
data['roomcnt'] = data['roomcnt'].fillna(1).astype(np.int32)
print('After', data['roomcnt'].unique())
del data['storytypeid']

1/12/2018 lab1
In [58]:
Variable: taxamount - property tax for the assessment year
We replaced all missing values with the median property taxes for the assessment year. We
Outliers found!
('Before', array([ nan, 650756., 571346., ..., 409940., 463704., 43776
5.]))
('After', array([122590, 650756, 571346, ..., 409940, 463704, 437765]))
fix_outliers(data, 'structuretaxvaluedollarcnt')
print('Before', data['structuretaxvaluedollarcnt'].unique())
medTax = np.nanmedian(data['structuretaxvaluedollarcnt'])
data['structuretaxvaluedollarcnt'] = data['structuretaxvaluedollarcnt'].fillna(med
print('After', data['structuretaxvaluedollarcnt'].unique())

1/12/2018 lab1
In [59]:
Variable: taxdelinquencyflag - property taxes from 2015 that are past
due
We replaced all missing values with 0 representing no past due property taxes and all Y values with
1 representing that there are past due property taxes. We changed the column datatype to integer.
In [60]:
Variable: taxdelinquencyyear - years of delinquency
We replaced all missing values with 0 representing no years of property tax delinquencies. We
Outliers found!
('Before', array([ nan, 20800.37, 14557.57, ..., 33604.04, 12627.18,
15546.14]))
('After', array([ 3991.7800293 , 20800.36914062, 14557.5703125 , ...,
33604.0390625 , 12627.1796875 , 15546.13964844]))
('Before', array([nan, 'Y'], dtype=object))
fix_outliers(data, 'taxamount')
print('Before', data['taxamount'].unique())
median_value = data['taxamount'].median()
data['taxamount'] = data['taxamount'].fillna(median_value).astype(np.float32)
print('After', data['taxamount'].unique())
print('Before', data['taxdelinquencyflag'].unique())
data['taxdelinquencyflag'] = data['taxdelinquencyflag'].fillna(0).replace('Y', 1)
print('After', data['taxdelinquencyflag'].unique())

1/12/2018 lab1
In [61]:
Variable: taxvaluedollarcnt - total tax
We replaced all missing values with the median total tax amount. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
Outliers found!
('Before', array([ nan, 13. , 15. , 11. ,
14. , 9. , 10. , 8. ,
12. , 7. , 6. , 2. ,
26.79676804, 5. , 3. , 4. ,
0.98797484, 1. ]))
('After', array([ 0, 13, 15, 11, 14, 9, 10, 8, 12, 7, 6, 2, 26, 5, 3,
4, 1]))
fix_outliers(data, 'taxdelinquencyyear')
print('Before', data['taxdelinquencyyear'].unique())
data['taxdelinquencyyear'] = data['taxdelinquencyyear'].fillna(0).astype(np.int32
print('After', data['taxdelinquencyyear'].unique())

1/12/2018 lab1
In [62]:
Variable: threequarterbathnbr - Number of 3/4 bathrooms in house
(shower + sink + toilet)
We replaced all missing values with 0 which will represent no Number of 3/4 bathrooms in the
property. We changed the column datatype to integer.
In [63]:
Variable: transactiondate - Date of the transaction response
variable
Will not fill any missing values because they represent the test part of the dataset
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 1.41338700e+06, ...,
4.70248000e+05, 6.43794000e+05, 5.30550000e+05]))
('After', array([ 9, 27516, 1413387, ..., 470248, 643794, 530550]))
('Before', array([ nan, 1., 2., 4., 3., 6., 5., 7.]))
('After', array([0, 1, 2, 4, 3, 6, 5, 7]))
fix_outliers(data, 'taxvaluedollarcnt')
print('Before', data['taxvaluedollarcnt'].unique())
median_value = data['taxvaluedollarcnt'].median()
data['taxvaluedollarcnt'] = data['taxvaluedollarcnt'].fillna(median_value).astype
print('After', data['taxvaluedollarcnt'].unique())
print('Before', data['threequarterbathnbr'].unique())
data['threequarterbathnbr'] = data['threequarterbathnbr'].fillna(0).astype(np.int
print('After', data['threequarterbathnbr'].unique())

1/12/2018 lab1
In [64]:
Variable: typeconstructiontypeid - What type of construction material
was used to construct the home
With 99% missing values, we decided to remove this variable.
In [65]:
Variable: unitcnt - number of units in the building
We replaced all missing values with 1 to represent a single family home for any with no values. We
In [66]:
Variable: yardbuildingsqft17 - sq feet of patio in yard
Outliers found!
('Before', [nan, 2.0, 1.0, 3.0, 5.0, 4.0, 9.0, 13.420418204007635, '...'])
('After', array([ 1, 2, 3, 5, 4, 9, 13, 12, 6, 7, 8, 10, 11]))
data['transactiondate'] = pd.to_datetime(data['transactiondate'])
del data['typeconstructiontypeid']
fix_outliers(data, 'unitcnt')
print('Before', data['unitcnt'].unique()[:8].tolist() + ['...'])
data['unitcnt'] = data['unitcnt'].fillna(1).astype(np.int32)
print('After', data['unitcnt'].unique())

1/12/2018 lab1
We replaced all missing values with 0 representing no patio. We changed the column datatype to
integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [67]:
Variable: yardbuildingsqft26 - storage shed/building in yard
We replaced all missing values with 0 which will represent no (square ft) storage shed or building in
the yard. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 450., 94., ..., 969., 1359., 1079.]))
('After', array([ 0, 450, 94, ..., 969, 1359, 1079]))
fix_outliers(data, 'yardbuildingsqft17')
print('Before', data['yardbuildingsqft17'].unique())
data['yardbuildingsqft17'] = data['yardbuildingsqft17'].fillna(0).astype(np.int32
print('After', data['yardbuildingsqft17'].unique())

1/12/2018 lab1
In [68]:
Variable: yearbuilt - The Year the residence was built
We replaced all missing values with the median year built of 1963 until we have a better method to
impute. We changed the column datatype to integer.
In [69]:
End of data cleaning
We went through every variable and next cell will confirm that the dataset has no missing values.
In [70]:
Simple Statistics
Outliers found!
('Before', [nan, 1948.0, 1947.0, 1943.0, 1946.0, 1978.0, 1958.0, 1949.0,
'...'])
('After', [1963, 1948, 1947, 1943, 1946, 1978, 1958, 1949, '...'])
fix_outliers(data, 'yardbuildingsqft26')
data['yardbuildingsqft26'] = data['yardbuildingsqft26'].fillna(0).astype(np.float
#there's too many values to print, before and after data redacted
print('Before', data['yearbuilt'].unique()[:8].tolist() + ['...'])
medYear = data['yearbuilt'].median()
data['yearbuilt'] = data['yearbuilt'].fillna(medYear).astype(np.int32)
print('After', data['yearbuilt'].unique()[:8].tolist() + ['...'])
# 'logerror' and 'transactiondate' are future variables and only exist in the tran
explanatory_vars = data.columns[~data.columns.isin(['logerror', 'transactiondate'
assert np.all(~data[explanatory_vars].isnull())

1/12/2018 lab1
10 points
Description:
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of
attributes. Describe anything meaningful you found from this or if you found something potentially
interesting. Note: You can also use data from other sources for comparison. Explain why the
statistics run are meaningful.
Table of Binary Variables (0 or 1)
We standardized all Yes/No and True/False variables to 1 or 0, respectively. The table below shows
that all binary flags in this dataset represent rare features such a pool, hot tub, tax delinquency flag,
and three quarter bathroom.
In [71]:
Summary Statistics of All Continuous Variables
To make the table more readable, we converted all simple statistics of continuous variables to
integers. We lose some precision but we get a better overview. For each variable, we have already
accounted for outliers and standardized missing values. We can immediately see that 0 is the most
common value for many of the variables. To explore further, we chose to visualize each variable that
had non-zero 25% to 75% values in the form of a boxplot and histogram.
Out[71]: Percent with value equal to 1
hashottuborspa 2.320720
poolcnt 17.403347
pooltypeid2 1.078548
taxdelinquencyflag 1.898850
threequarterbathnbr 10.584165
bin_vars = ['hashottuborspa', 'poolcnt', 'pooltypeid2', 'pooltypeid7', 'pooltypeid
bin_data = data[bin_vars]
result_table = bin_data.mean() * 100
pd.DataFrame(result_table, columns=['Percent with value equal to 1'])

1/12/2018 lab1
In [72]:
Calculated Finished Square Feet
For calculated square feet, most values were 0 with a range from 0 to 10898 sqft. Note that we
removed outliers earlier while cleaning the data. The median of 1561 was a little smaller than the
mean of 1784 so we expect to see a slight right skew, which we do below. What is interesting here
is the peak at 0 of values and then another peak around 1600 to 1800. We continue to have few
properties with very large (higher than the 75th percentile of 2124) which is fairly normal for any
area to have the middle class homes with few larger homes mixed in.
Out[72]: count mean std min 25% 50% 75% max
calculatedfinishedsquarefeet 2973905 1784 984 0 1199 1561 2124 1092
finishedsquarefeet12 2973905 1596 958 0 1092 1466 1996 6615
finishedsquarefeet50 2973905 94 390 0 0 0 0 3130
garagetotalsqft 2973905 113 217 0 0 0 0 1610
lotsizesquarefeet 2973905 19810 73796 0 5200 6700 9243 1710
poolsizesum 2973905 90 196 0 0 0 0 1476
yardbuildingsqft17 2973905 8 61 0 0 0 0 1485
yardbuildingsqft26 2973905 0 12 0 0 0 0 2126
yearbuilt 2973905 1964 23 1801 1950 1963 1981 2015
structuretaxvaluedollarcnt 2973905 166367 179850 1 75440 122590 195143 2181
taxvaluedollarcnt 2973905 407695 429374 1 181179 306086 485000 4052
assessmentyear 2973905 2014 0 2000 2015 2015 2015 2016
landtaxvaluedollarcnt 2973905 242391 287722 1 76724 167043 303002 2477
taxamount 2973905 5229 5284 1 2471 3991 6178 5129
taxdelinquencyyear 2973905 0 1 0 0 0 0 26
logerror 90275 0 0 -4 0 0 0 4
train_data = data[~data['logerror'].isnull()]
continous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continous_vars = continous_vars[continous_vars.isin(data.columns)]
continous_vars = continous_vars[~continous_vars.isin(['longitude', 'latitude'])]
output_table = data[continous_vars].describe().T
mode_range = data[continous_vars].mode().T
mode_range.columns = ['mode']
mode_range['range'] = data[continous_vars].max() - data[continous_vars].min()
output_table = output_table.join(mode_range)
output_table.astype(int)

1/12/2018 lab1
In [73]:
Finished Living Area
Similar to calculated finished square feet, finished living area had outliers which we already fixed
above. The range for finished living area is 0 to 6871 with 0 being the mode of the data. The mean
(1596) is about 100sqft larger than the median (1466) so they are relatively the same since the
variance is 962.
This variable is bimodal in with a large spike at 0 and another peak with a fairly normal distribution
and long right tail at around 1400.
We also see a slight spike at the very end of the tail of the dataset. This means there were a lot of
outliers that were set to the maximum (mean + 6 * std).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['calculatedfinishedsquarefeet'], ax=ax0, color="#34495e").set_tit
sns.distplot(data['calculatedfinishedsquarefeet'], ax=ax2, color="#34495e");

1/12/2018 lab1
In [74]:
Lot Size Square Feet
Lot size square feet has the largest range from 0 to 1,710,750 even after removing all outliers
(mean + std * 5). The mode for this variable is 0 so we see below a spike at 0 and a very long right
tail.
What is interesting with this variable is the large variance of 73796. The 25th to 75th percentile
values are 5200 and 9243 respectively so we will skipped over the box plot and plotted the
histogram below.
In the histogram, we see a right skewed distribution which makes sense considering the mean is
19810 and the median is 6700 - again, with such a large variance it is difficult for the eye to see the
difference. The main takeaway here is the large number of 0s.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['finishedsquarefeet12'], ax=ax0, color="#34495e").set_title('Fin
sns.distplot(data['finishedsquarefeet12'], ax=ax2, color="#34495e");

1/12/2018 lab1
In [75]:
Year Built
The year the properties were built ranges from 1801 to 2015. The mode and median of 1963 is only
a year difference from the mean of 1964. The distribution seems to be fairly normal with the peak in
the early 1960s and dropping off on both sides. We see a number of homes that were built before
1905 (the low whisker of the boxplot) which gives us a long left tail.
We see a few other spikes in homes built which could correlate to a number of other factors such as
healthy economic growth, political backing on mortgages, or rises in population. The baby boomers
born early 1960 shows many houses being built and around the time they turned 18 more houses
seem to have been built. We see an apparent fall right before 2000 which could be the dot com
burst and another drop in the housing burst of 2007. Because our data was collected in 2016, we
expect to see fewer homes built the previous year.
What will be interesting with this variable is how old a home has to be to begin to "fall apart" or need
major renovations to the piping or foundation. Will a home built in a certain year have many homes
made from a faulty material that causes damages later on? Will the Zestimate take into account the
disclosures of a home that each sale price typically does?
f, (ax0) = plt.subplots(nrows=1, ncols=1)
sns.distplot(data['lotsizesquarefeet'], ax=ax0, color="#34495e").set_title('Lot s

1/12/2018 lab1
In [76]:
Total Tax Value
The total tax value of the property ranges from 1 to 4,052,186. The median of 306,086 is the same
as the mode and a little smaller than the mean of 407,695 which is evident in the right skewed
distribution below. These values have already been adjusted for outliers which is why we see a
slight spike at the maximum value for larger developments and unique mansions.
The distribution is fairly similar to square feet above because the tax is calculated by value
assessed * square feet. What is interesting to note here is the missing values for tax were replaced
by the median (hence the median and mode being the same) where the square footage missing
values were replaced with 0s (hence the 0 as the mode and second peak in the distribution).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['yearbuilt'].dropna(), ax=ax0, color="#34495e").set_title('Year
sns.distplot(data['yearbuilt'].dropna(), ax=ax2, color="#34495e");

1/12/2018 lab1
In [77]:
Building and Land Tax
The building or structure tax has a similar right skewed distribution as total tax. The values range
from 1 to 2,165,929, already adjusted for outliers and cleaned up with missing values set to median.
That being said, the median and mode are the same at 122,590 which is lower than the mean of
166,344.
The land tax values range from 1 to 2,477,536, also adjusted for outliers and cleaned up with
missing values set to median. Because of this, the median and mode are the same at 167,043
which is lower than the mean of 242,391.
Land tax seems to have a larger range of values from the 25th to 75th percentile than the building
tax. This means that the land is valued at a greater variance (287k) than the buildings in certain
areas (variance of 179k). We think this could be due to location itself as better neighborhoods, safer
areas, or better schools could result in a higher assessment than other locations, thus widening the
variance.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['taxvaluedollarcnt'], ax=ax0, color="#34495e").set_title('Total t
sns.distplot(data['taxvaluedollarcnt'], ax=ax2, color="#34495e");

1/12/2018 lab1
In [78]:
Assessment Year
Assessment year is the year that the property was assessed. The 25th through 75th percentile of
values are all from the year 2015 so reading a box plot is not very helpful. Instead we list out the
unique values for assessment year along with our histogram.
In the state of California, the base year value is set when you originally purchase the property,
based on the sales price listed on the deed. However, there are exceptions which is why we see a
few assessment years from 2000 to 2016 thrown in.
In order for assessment year to be useful for our predictions, we should find out what each
exception is and what the cause of it not to be assessed at the point of sale. This could affect the
predicted log error.
f, (ax0,ax2,ax3,ax4) = plt.subplots(nrows=4, ncols=1, figsize=[15, 14])
sns.boxplot(data['structuretaxvaluedollarcnt'], ax=ax0, color="#34495e").set_title
sns.distplot(data['structuretaxvaluedollarcnt'], ax=ax2, color="#34495e");
sns.boxplot(data['landtaxvaluedollarcnt'], ax=ax3, color="#34495e").set_title('La
sns.distplot(data['landtaxvaluedollarcnt'], ax=ax4, color="#34495e");

1/12/2018 lab1
In [79]:
Visualize Attributes
15 points
Description:
Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting).
Important: Interpret the implications for each visualization. Explain for each attribute why the chosen
visualization is appropriate.
Distribution of Target Variable: Logerror
In the training dataset, logerror is the response variable so we are interested in seeing the
distribution of log error that we are training on. We visualize this using a boxplot and histogram to
get a general picture of the overall distribution. It is symmetric around zero which implies that the
model generating the logerror has no bias and is very accurate in most instances.
('Unique years:', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010,
2004, 2005,
2002, 2000, 2009]))
print('Unique years:', data['assessmentyear'].unique())
f, (ax2) = plt.subplots(nrows=1, ncols=1, figsize=[15, 4])
sns.distplot(data['assessmentyear'], ax=ax2, color="#34495e")
plt.title('Assessment year distribution');

1/12/2018 lab1
In [80]:
Count of Bathrooms
We think that the number of bathrooms in a home could be interesting because our data was
collected in California where rent is very high. It is common to buy a rental property and have
random tenants. Tenants that do not know each other may want their own bathroom. In our case,
most homes have 2 bathrooms. Notably, there are outliers with no bathrooms or suspiciously high
x = train_data['logerror']
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,
gridspec_kw={"height_ratios": (.15, .85)}, figsize=(10, 10))
sns.boxplot(train_data['logerror'][train_data['logerror'].abs()<1], ax=ax_box, co
sns.distplot(
train_data['logerror'][train_data['logerror'].abs()<1],
ax=ax_hist, bins=400, kde=False, color="#34495e");

1/12/2018 lab1
counts. We see records in the dataset with no bathroom which we justified above as being possible.
Because we are looking at frequency, we chose to visualize the sum of each number of bathrooms
(as a category) in a bar chart.
In [81]:
Count of Bedrooms
For the same reasons we were interested in the number of bathrooms, we are also interested in the
number of bedrooms. In our dataset, most properties have 3 bedrooms and we see fewer instances
as we go up or down one bedroom in the data. Here we still see records without any bedrooms
which we justified as studios above. We chose the same visualization (using number of bedrooms
as a category and counting the frequency of each category) displayed in a bar chart below.
sns.countplot(data['bathroomcnt'], color="#34495e")
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathrooms', fontsize=12)
plt.title("Frequency of Bathroom count", fontsize=15);

1/12/2018 lab1
In [82]:
Bed to Bath Ratio
After visualizing the distribution of bathroom and bedroom counts, we also thought it would be
interesting to try to see if the number of bathrooms were dependent on the number of bedrooms.
We chose to stick with a bar chart, only this time using the ratio of bedrooms to bathrooms as the
category to find the sum counts of. What we found is most homes have about a ratio of 1.5
bedrooms per 1 bathroom in a property.
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bedrooms', fontsize=12)
plt.title("Frequency of Bedrooms count", fontsize=15)
sns.countplot(data['bedroomcnt'], color="#34495e");

1/12/2018 lab1
In [83]:
Average Tax Per Square Feet
For our last attribute, we calculated the tax per square foot to see if we could find any trends. We
again chose to use a bar chart to plot the ratio and the sum counts. What we found is that plotting
this exposes extreme outliers for possible elimination. Most properties are under a few dollars per
square foot but as the visualization reveals, there are suspicious records. However, because this is
southern California and land space is limited for continuous growth, there could be a reason that
some places have high tax per square feet due to better real estate areas.
non_zero_mask = data['bathroomcnt'] > 0
bedroom = data[non_zero_mask]['bedroomcnt']
bathroom = data[non_zero_mask]['bathroomcnt']
bedroom_to_bath_ratio = bedroom / bathroom
bedroom_to_bath_ratio = bedroom_to_bath_ratio[bedroom_to_bath_ratio<6]
sns.distplot(bedroom_to_bath_ratio, color="#34495e", kde=False)
plt.title('Bed to Bath ratio', fontsize=15)
plt.xlabel('Ratio', fontsize=15)
plt.ylabel('Count', fontsize=15);

1/12/2018 lab1
In [84]:
Explore Joint Attributes
15 points
Description:
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-
tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.
Absolute Log Error and Number of Occurrences Per
Month
We compared amount of absolute error based monthly average and found out that the error could
be cyclical for the year, where the error dips during the Spring and Summer months and rises during
the Winter months.
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
tax_per_sqft = tax / sqft
tax_per_sqft = tax_per_sqft[tax_per_sqft<10]
sns.distplot(tax_per_sqft, color="#34495e", kde=False)
plt.title('Tax Per Square Feet', fontsize=15)
plt.xlabel('Ratio', fontsize=15)

1/12/2018 lab1
We compared amount of transactions based monthly average and the transactions are the highest
during the Spring, Summer, and Fall seasons possibly due to an optimal time to sell property. The
transactions are at its lowest during the Winter season.
For a cross comparison, we have high number of transactions during the Spring and Summer
seasons while the log error is relatively low and we have low number of transactions during the
Winter season while the log error is relatively high.
In [85]:
Number of Transactions and Mean Absolute Log Error Per
Day of the Week
Saturdays and Sundays are non-work days, hence why there is a dip in absolute log error and
number of transactions
For the workdays, Friday has the most transactions while Monday has the least.
months = train_data['transactiondate'].dt.month
month_names = ['January','February','March','April','May','June','July','August',
train_data['abs_logerror'] = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
per_month = train_data.groupby(months)["abs_logerror"].mean()
per_month.index = month_names
ax0.set_title('Average Log Error Across Month Of 2016')
ax0.set_xlabel('Month Of The Year', fontsize=15)
ax0.set_ylabel('Log Error', fontsize=15)
sns.pointplot(x=per_month.index, y=per_month, color="#34495e", ax=ax0)
per_month = train_data.groupby(months)["logerror"].count()
per_month.index = month_names
ax1.set_title('No Of Occurunces per month In 2016')
ax1.set_xlabel('Month Of The Year', fontsize=15)
ax1.set_ylabel('Nimber of Occurences', fontsize=15)
sns.barplot(x=per_month.index, y=per_month, color="#34495e", ax=ax1);

1/12/2018 lab1
For the workdays, Monday has relatively the most log errors while Friday has relatively the least log
errors.
For cross analysis, Monday has the least transactions with the most error while Friday has the most
transactions with the least errors. Sunday and Saturday are special cases and does not have
substantial evidence to provide any trends.
In [86]:
Continuous Variable Correlation Heatmap
Heatmap of correlations are represented where the warmer colors are highly correlated, white is
non-correlated, and colder colors are negatively correlated. We see that calculated finished square
feet is correlated with finished square feet, due to collinearity. Tax amounts and year built are also
highly correlated to finished square feet as well as with one another.
Latitude and longitude are negatively correlated with each other possibly because the beachfront
properties are more expensive.
weekday = train_data['transactiondate'].dt.weekday
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'S
abs_logerror = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
to_plot = abs_logerror.groupby(weekday).count()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax0)
ax0.set_title('Number of Transactions Per Day')
ax0.set_ylabel('Number of Transactions', fontsize=15)
ax0.set_xlabel('Day', fontsize=15)
to_plot = abs_logerror.groupby(weekday).mean()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax1)
ax1.set_title('Mean Absolute Log Error Per Day')
ax1.set_ylabel('Mean Absolute Log Error', fontsize=15)
ax1.set_xlabel('Day', fontsize=15);

1/12/2018 lab1
In [87]:
Longitude and Latitude Data Points
continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continuous_vars = continuous_vars[continuous_vars.isin(data.columns)]
continuous_vars = continuous_vars.sort_values()
corrs = train_data[continuous_vars].corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrs, ax=ax)
plt.title("Variables correlation map", fontsize=20)
plt.xlabel('Continuous Variables', fontsize=15)
plt.ylabel('Continuous Variables', fontsize=15);

1/12/2018 lab1
From a simple graph, we can see the shoreline of California as well as possible areas of
obstruction, such as mountains that prevent property growth in those areas. The majority of
properties are in the center to upper left of the graph.
In [88]:
Number of Stories vs Year Built
As architectural feats improved, we started to see more properties with 2 or more stories by 1950.
The number of one story properties also increased during that time. The baby boomers, the end of
WWII and readily available steel, and mortgage incentives may be the cause of the increase of
more properties being built as well as more stories per property. Note: because we filled in missing
values as the median value, the 1965 spike in the data is artificial until we use other methods to
impute year built.
<matplotlib.figure.Figure at 0x114670f10>
plt.figure(figsize=(12,12));
sns.jointplot(x=data.latitude.values, y=data.longitude.values, size=10, color="#34
plt.ylabel('Longitude', fontsize=15)
plt.xlabel('Latitude', fontsize=15)
plt.title('Longitude and Latitude Data Points', fontsize=15);

1/12/2018 lab1
In [89]:
Explore Attributes and Class
10 points
Description:
Identify and explain interesting relationships between features and the class you are trying to
predict (i.e., relationships with variables and the target classification).
Correlation of Continuous Variables and Log
Error (Target Variable)
We see that calculatedfinishedsquarefeet has the highest correlation with log error (0.04) while price
per square feet has the highest negative correlation with log error (-0.02). taxvaluedollarcnt has
relatively low correlation with log error. We choose to further explore finishedsquarefeet12 and its
relationship with log error.
fig,ax1= plt.subplots()
fig.set_size_inches(20,10)
yearMerged = data.groupby(['yearbuilt', 'numberofstories'])["parcelid"].count().u
yearMerged = yearMerged.loc[1900:]
yearMerged.index.name = 'Year Built'
plt.title('Number of Stories Per Year Built', fontsize=15)
yearMerged.plot(ax=ax1, linewidth=4);

1/12/2018 lab1
In [90]: train_data = data[~data['logerror'].isnull()]
continuous_vars = continuous_vars[~continuous_vars.isin(['logerror', 'transactiond
labels = []
values = []
for column in continuous_vars:
labels.append(column)
values.append(train_data[column].corr(train_data['logerror']))
corr = pd.DataFrame({'labels':labels, 'values':values}).fillna(0.)
corr = corr.sort_values(by='values')
labels = corr['labels'].values
values = corr['values'].values
fig, ax = plt.subplots(figsize=(10,10))
plt.barh(range(len(labels)), values, color="#34495e")
plt.title("Correlation of Continuous Variables", fontsize=15);
plt.xlabel('Correlation', fontsize=15)
plt.ylabel('Continuous Variable', fontsize=15)
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels, rotation='horizontal');

1/12/2018 lab1
Scatterplot of Log Error and Calculated Finished Square
Feet
We are plotting our best correlated variable calculatedfinishedsquarefeet against the logerror. We
don't see any linear relationship from the scatter plot below even though it is evenly distributed.
In [91]:
New Features
5 points
Description:
Are there other features that could be added to the data or created from existing features? Which
ones?
column = "calculatedfinishedsquarefeet"
sns.jointplot(train_data[column], train_data['logerror'], size=10, color="#34495e
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated Finished Square Feet', fontsize=15)
plt.title("Calculated Finished Square Feet Vs Log Error", fontsize=15);

1/12/2018 lab1
Tax Per Square Feet
We created a tax per square feet feature. It is negatively correlated with log error and we hope that
it will add value to a predictive model.
In [92]:
City zip code details
The Zillow dataset has a variable: 'regionidcity' which is a numerical ID, representing the city in
which the property is located (if any). We don't have a string variable showing the city name.
We found a government dataset publicly available on the internet, containing all zip codes as well
as other information associated with each zip code. We have downloaded the dataset from here:
http://federalgovernmentzipcodes.us (http://federalgovernmentzipcodes.us) and joined it with our
dataset with the cell below.
This will give us the actual name of the cities, zip code type and location type.
New Variables Joined:
zipcode_type Standard, PO BOX Only, Unique, Military (implies APO or FPO) - Zip code
type may provide useful insight towards prediction
city USPS official city name(s) - this will distinguish one county from another that was
lacking in the original dataset
location_type Primary, Acceptable, Not Acceptable - because these are all valid location
properties, they will most likely be acceptable.
In [93]:
Out[92]: ('Correlation with log error:', -0.014065552662672554)
The zips dataset has 81831 rows and 4 columns
The merged dataset has 3857451 rows and 53 columns
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
data['price_per_sqft'] = tax / sqft
'Correlation with log error:', data['price_per_sqft'].corr(data['logerror'])
# data from http://federalgovernmentzipcodes.us
zips = pd.read_csv('../input/free-zipcode-database.csv', low_memory=False)
zips = zips[['Zipcode','ZipCodeType','City','LocationType']]
zips.columns = ['zipcode', 'zipcode_type', 'city', 'location_type']
assert np.all(~zips.isnull())
zips = zips.rename(columns={'zipcode':'regionidzip'})
data = pd.merge(data, zips, how='left', on='regionidzip')
print('The zips dataset has %d rows and %d columns' % zips.shape)
print('The merged dataset has %d rows and %d columns' % data.shape)

1/12/2018 lab1
Table of New Variables
Just focusing on the new features added to the dataset, here are the value types and descriptions.
In [94]:
Other Ideas For New Features
Other features that we thought about that we could include in the future are last remodel date of
kitchen or bathroom, key words in the description of overpriced Zestimates or underpriced
Zestimates, and how close a home is to a grocery store, Starbucks, a mall, or a place of interest.
A recently remodeled home could raise the actual sale price much higher than the Zestimate.
Certain words in the listing description could be associated with lower sale prices or people who bid
a higher sale price. Lastly the walkability, or how close a home is to a grocery store, Starbucks, a
mall, or place of interest could increase the final sale price as well.
Exceptional Work
10 points
Description:
You have free reign to provide additional analyses. One idea: implement dimensionality reduction,
then visualize and interpret the results.
Categorical Feature Importance
Out[94]: Variable Type Scale Description
city nominal
[APO, WHISKEYTOWN, nan,
REDDING, FPO, ... (239 More)]
USPS offical city name(s)
location_type nominal
[PRIMARY, nan, ACCEPTABLE,
NOT ACCEPTABLE]
Primary, Acceptable, Not
Acceptable
price_per_sqft ratio (0, 11911) Tax per SQFT
zipcode_type nominal
[MILITARY, PO BOX, nan,
STANDARD, UNIQUE]
Standard, PO BOX Only, Unique,
Military(implies APO or FPO)
variables_description = [
['price_per_sqft', 'ratio', 'TBD', 'Tax per SQFT']
,['zipcode_type', 'nominal', 'TBD', 'Standard, PO BOX Only, Unique, Military(impl
,['city', 'nominal', 'TBD', 'USPS offical city name(s)']
,['location_type', 'nominal', 'TBD', 'Primary, Acceptable, Not Acceptable']
]
new_variables = pd.DataFrame(variables_description, columns=['name', 'type', 'sca
new_variables = new_variables.set_index('name')
new_variables = new_variables.loc[new_variables.index.isin(data.columns)]
variables = variables.append(new_variables)
output_variables_table(new_variables)

1/12/2018 lab1
g p
According to a random forest model with seed 0, region id zip, bedroom count, census tract and
block, and region id neighborhood explain the most variance for log error. Even though the
importance of the other variables are relatively lower, they could provide greater importance if we
add interaction terms or use a different nonlinear model.
In [95]:
Continuous Feature Importance
from sklearn import ensemble
categorical_vars = variables[variables['type'].isin(['ordinal', 'nominal'])].index
categorical_vars = categorical_vars[categorical_vars.isin(data.columns)]
categorical_vars = categorical_vars[~categorical_vars.isin(['parcelid', 'logerror
X = train_data[categorical_vars]
# remove string types
categorical_vars = categorical_vars[X.dtypes != object]
X = X[categorical_vars]
y = train_data['logerror']
model = ensemble.ExtraTreesRegressor(random_state=0)
model.fit(X.fillna(0), y)
index = pd.Index(categorical_vars, name='Variable Name')
importance = pd.Series(model.feature_importances_, index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Categorical Feature Importance')
plt.xlabel('Importance', fontsize=15);

1/12/2018 lab1
According to the linear regression model, the feature tax delinquency year explains the most
variance of log error. Even though the importance of the other variables are relatively lower, they
could provide greater importance if we add interaction terms or a higher order polynomials.
In [96]:
Exporting the cleaned datasets
from sklearn.linear_model import LinearRegression
continuous_vars = continuous_vars[~continuous_vars.isin(['parcelid', 'logerror',
X = train_data[continuous_vars]
y = train_data['logerror']
model = LinearRegression()
model.fit(X.fillna(0), y)
index = pd.Index(continuous_vars, name='Variable Name')
importance = pd.Series(np.abs(model.coef_), index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Continuous Feature Importance')
plt.xlabel('Importance', fontsize=15);

1/12/2018 lab1
In [97]:
References
Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels
(https://www.kaggle.com/c/zillow-prize-1/kernels)
Pandas cookbook: https://pandas.pydata.org/pandas-docs/stable/cookbook.html
(https://pandas.pydata.org/pandas-docs/stable/cookbook.html)
Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas
(https://stackoverflow.com/questions/tagged/pandas)
test_mask = data['logerror'].isnull()
train_data = data[~test_mask]
test_data = data[test_mask]
train_data.to_csv('../datasets/train.csv', index=False)
test_data.to_csv('../datasets/test.csv', index=False)
variables.index.name = 'name'
variables.to_csv('../datasets/variables.csv', index=True)

Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance

Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance (20)

More from Yao Yao

More from Yao Yao (18)

Recently uploaded

Recently uploaded (20)

Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance