SlideShare a Scribd company logo
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 1/63
Zillow Dataset Analysis and Visualization
MSDS 7331 Data Mining - Section 403 - Lab 1
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Business Understanding
10 points
Description:
Describe the purpose of the data set you selected (i.e., why was this data collected in the first
place?). Describe how you would define and measure the outcomes from the dataset. That is, why
is this data important and how do you know if you have mined useful knowledge from the dataset?
How would you measure the effectiveness of a good prediction algorithm? Be specific.
Answer:
Origin and purpose of dataset
This is a dataset from a Kaggle competition: "Zillow Prize: Zillow’s Home Value Prediction
(Zestimate)". To download all accompanied dataset refer to this link:
https://www.kaggle.com/c/zillow-prize-1/data (https://www.kaggle.com/c/zillow-prize-1/data)
Note: The dataset has 2985217 rows and 58 columns and it requires at least 2GB of free RAM to
load.
Zillow, a leading real estate and rental marketplace platform, developed a model to estimate the
property price based on property features, which they call the "Zestimate". As with every real world
model, the Zestimate has some error associated with it. Zestimates are estimated home values
based on 7.5 million statistical and machine learning models that analyze hundreds of data points
on each property.
The purpose of this dataset and Kaggle competition is to minimize the error between the Zestimate
(what we will predict) and the actual sale price, given certain features of a home.
Description of dataset
We are provided with a full dataset of real estate properties in three counties in California: Los
Angeles, Orange, and Ventura in 2016. The dataset contains:
ID for the listing
57 variables describing the property features such as the number of bedrooms and various
measurements in square feet
Two resulting variables: logerror and transactiondate
The dataset has two parts:
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 2/63
Training data (90275 rows), which contains logerror and transactiondate and has all the
transactions before October 15, 2016, plus some of the transactions after October 15,
2016.
Testing data (2895067 rows), which contains the rest of the transactions between October
15 and December 31, 2016.
A successful measure of how well we predict log error will be how well we can clean and train our
data measured by our placement in the Kaggle competition. Kaggle measures the effectiveness of a
good prediction algorithm by taking the log error of the Zestimate and the actual sales price. The log
error is defined as:
where logerror < 0 represent Zestimates lower than the actual sell price and logerror > 0 represent
Zestimates higher than the actual sell price.
Our notebook
This notebook is an exploratory analysis for the dataset described above. Our study is organized as
follows:
Data Meaning
Data Quality (EDA)
Review of variables
Identification of missing values and outliers
Data cleansing
Visualizations
Simple Statistics
Visualize Attributes
Explore Joint Attributes
Explore Attributes and Classes
New Features
Exceptional Work
References/Citations
Conclusion
From the correlation table, random forest, and linear regression feature importance, we found out
that regionidzip, calculatedfinishedsquarefeet, bedroomcount, censustractandblock,
regionidneighborhood, and taxdelinquencyyear are the most important variables towards building
our prediction model.
Future work
In the future lab notebooks, we will predict the logerror from a regression model. To measure the
effectiveness of a good prediction algorithm, we will first apply cross-validation by splitting the
training dataset to training, validation, and testing to model our prediction error. A final prediction
error will be given by Kaggle when we submit our predictions to the competition.
logerror =log(Zestimate)−log(SalePrice)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 3/63
In [1]:
Data Meaning
10 points
Description:
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
Below is a table of all of the variables in the dataset. We list the variable name, type of data, scale,
and a description.
/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:878: UserWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use th
e latter.
warnings.warn(self.msg_depr % (key, alt_key))
Out[1]: 'The dataset has 2985342 rows and 60 columns'
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# load datasets here:
train_data = pd.read_csv('../input/train_2016_v2.csv')
data = pd.read_csv('../input/properties_2016.csv', low_memory=False)
data = pd.merge(data, train_data, how='left', on='parcelid')
'The dataset has %d rows and %d columns' % data.shape
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 4/63
In [2]: from IPython.display import display, HTML
variables_description = [
['airconditioningtypeid', 'nominal', 'TBD', 'Type of cooling system present in the
,['architecturalstyletypeid', 'nominal', 'TBD', 'Architectural style of the home
,['assessmentyear', 'interval', 'TBD', 'The year of the property tax assessment']
,['basementsqft', 'ratio', 'TBD', 'Finished living area below or partially below g
,['bathroomcnt', 'ordinal', 'TBD', 'Number of bathrooms in home including fractio
,['bedroomcnt', 'ordinal', 'TBD', 'Number of bedrooms in home']
,['buildingclasstypeid', 'nominal', 'TBD', 'The building framing type (steel frame
,['buildingqualitytypeid', 'ordinal', 'TBD', 'Overall assessment of condition of t
,['calculatedbathnbr', 'ordinal', 'TBD', 'Number of bathrooms in home including f
,['calculatedfinishedsquarefeet', 'ratio', 'TBD', 'Calculated total finished livi
,['censustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined -
,['decktypeid', 'nominal', 'TBD', 'Type of deck (if any) present on parcel']
,['finishedfloor1squarefeet', 'ratio', 'TBD', 'Size of the finished living area o
,['finishedsquarefeet12', 'ratio', 'TBD', 'Finished living area']
,['finishedsquarefeet13', 'ratio', 'TBD', 'Perimeter living area']
,['finishedsquarefeet15', 'ratio', 'TBD', 'Total area']
,['finishedsquarefeet50', 'ratio', 'TBD', 'Size of the finished living area on the
,['finishedsquarefeet6', 'ratio', 'TBD', 'Base unfinished and finished area']
,['fips', 'nominal', 'TBD', 'Federal Information Processing Standard code - see ht
,['fireplacecnt', 'ordinal', 'TBD', 'Number of fireplaces in a home (if any)']
,['fireplaceflag', 'ordinal', 'TBD', 'Is a fireplace present in this home']
,['fullbathcnt', 'ordinal', 'TBD', 'Number of full bathrooms (sink, shower + batht
,['garagecarcnt', 'ordinal', 'TBD', 'Total number of garages on the lot including
,['garagetotalsqft', 'ratio', 'TBD', 'Total number of square feet of all garages o
,['hashottuborspa', 'ordinal', 'TBD', 'Does the home have a hot tub or spa']
,['heatingorsystemtypeid', 'nominal', 'TBD', 'Type of home heating system']
,['landtaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the land area of
,['latitude', 'interval', 'TBD', 'Latitude of the middle of the parcel multiplied
,['logerror', 'interval', 'TBD', 'Error or the Zillow model response variable']
,['longitude', 'interval', 'TBD', 'Longitude of the middle of the parcel multiplie
,['lotsizesquarefeet', 'ratio', 'TBD', 'Area of the lot in square feet']
,['numberofstories', 'ordinal', 'TBD', 'Number of stories or levels the home has'
,['parcelid', 'nominal', 'TBD', 'Unique identifier for parcels (lots)']
,['poolcnt', 'ordinal', 'TBD', 'Number of pools on the lot (if any)']
,['poolsizesum', 'ratio', 'TBD', 'Total square footage of all pools on property']
,['pooltypeid10', 'nominal', 'TBD', 'Spa or Hot Tub']
,['pooltypeid2', 'nominal', 'TBD', 'Pool with Spa/Hot Tub']
,['pooltypeid7', 'nominal', 'TBD', 'Pool without hot tub']
,['propertycountylandusecode', 'nominal', 'TBD', 'County land use code i.e. it's
,['propertylandusetypeid', 'nominal', 'TBD', 'Type of land use the property is zo
,['propertyzoningdesc', 'nominal', 'TBD', 'Description of the allowed land uses (
,['rawcensustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined
,['regionidcity', 'nominal', 'TBD', 'City in which the property is located (if any
,['regionidcounty', 'nominal', 'TBD', 'County in which the property is located']
,['regionidneighborhood', 'nominal', 'TBD', 'Neighborhood in which the property i
,['regionidzip', 'nominal', 'TBD', 'Zip code in which the property is located']
,['roomcnt', 'ordinal', 'TBD', 'Total number of rooms in the principal residence'
,['storytypeid', 'nominal', 'TBD', 'Type of floors in a multi-story house (i.e. b
,['structuretaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the built
,['taxamount', 'ratio', 'TBD', 'The total property tax assessed for that assessme
,['taxdelinquencyflag', 'nominal', 'TBD', 'Property taxes for this parcel are past
,['taxdelinquencyyear', 'interval', 'TBD', 'Year']
,['taxvaluedollarcnt', 'ratio', 'TBD', 'The total tax assessed value of the parce
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 5/63
Out[2]: Variable Type Scale Description
airconditioningtypeid nominal
[nan, 1.0, 13.0, 5.0,
11.0, 9.0, 12.0, 3.0]
Type of cooling system present in the h
any)
architecturalstyletypeid nominal
[nan, 7.0, 21.0, 8.0,
2.0, 3.0, 5.0, 10.0,
27.0]
Architectural style of the home (i.e. ran
colonial, split-level, etc…)
assessmentyear interval (2000, 2016) The year of the property tax assessme
basementsqft ratio (20, 8516)
Finished living area below or partially b
ground level
bathroomcnt ordinal
[0.0, 2.0, 4.0, 3.0,
1.0, ... (38 More)]
Number of bathrooms in home includin
fractional bathrooms
bedroomcnt ordinal
[0.0, 4.0, 5.0, 2.0,
3.0, ... (22 More)]
Number of bedrooms in home
buildingclasstypeid nominal
[nan, 3.0, 4.0, 5.0,
2.0, 1.0]
The building framing type (steel frame,
frame, concrete/brick)
,['threequarterbathnbr', 'ordinal', 'TBD', 'Number of 3/4 bathrooms in house (show
,['transactiondate', 'nominal', 'TBD', 'Date of the transaction response variable
,['typeconstructiontypeid', 'nominal', 'TBD', 'What type of construction material
,['unitcnt', 'ordinal', 'TBD', 'Number of units the structure is built into (i.e.
,['yardbuildingsqft17', 'interval', 'TBD', 'Patio in yard']
,['yardbuildingsqft26', 'interval', 'TBD', 'Storage shed/building in yard']
,['yearbuilt', 'interval', 'TBD', 'The Year the principal residence was built']
]
variables = pd.DataFrame(variables_description, columns=['name', 'type', 'scale',
variables = variables.set_index('name')
variables = variables.loc[data.columns]
def output_variables_table(variables):
variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th>
for vname, atts in variables.iterrows():
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = data[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d)' % (data[vname].min(), data[vname].max(
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row
return HTML('<table>%s</table>' % ''.join(rows))
output_variables_table(variables)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 6/63
buildingqualitytypeid ordinal
[nan, 7.0, 4.0, 10.0,
1.0, ... (13 More)]
Overall assessment of condition of the
from best (lowest) to worst (highest)
calculatedbathnbr ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (35 More)]
Number of bathrooms in home includin
fractional bathroom
calculatedfinishedsquarefeet ratio (1, 952576)
Calculated total finished living area of t
home
censustractandblock nominal
[nan,
6.1110010011e+13,
6.1110009032e+13,
6.1110010024e+13,
6.1110010023e+13,
... (96772 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
decktypeid nominal [nan, 66.0] Type of deck (if any) present on parcel
finishedfloor1squarefeet ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet12 ratio (1, 290345) Finished living area
finishedsquarefeet13 ratio (120, 2688) Perimeter living area
finishedsquarefeet15 ratio (112, 820242) Total area
finishedsquarefeet50 ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet6 ratio (117, 952576) Base unfinished and finished area
fips nominal
[6037.0, 6059.0,
6111.0, nan]
Federal Information Processing Standa
- see
https://en.wikipedia.org/wiki/FIPS_coun
for more details
fireplacecnt ordinal
[nan, 3.0, 1.0, 2.0,
4.0, ... (10 More)]
Number of fireplaces in a home (if any
fireplaceflag ordinal [nan, True] Is a fireplace present in this home
fullbathcnt ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (21 More)]
Number of full bathrooms (sink, showe
bathtub, and toilet) present in home
garagecarcnt ordinal
[nan, 2.0, 4.0, 1.0,
3.0, ... (25 More)]
Total number of garages on the lot incl
attached garage
garagetotalsqft ratio (0, 7749)
Total number of square feet of all garag
lot including an attached garage
hashottuborspa ordinal [nan, True] Does the home have a hot tub or spa
heatingorsystemtypeid nominal
[nan, 2.0, 7.0, 20.0,
6.0, ... (15 More)]
Type of home heating system
landtaxvaluedollarcnt ratio (1, 90246219)
The assessed value of the land area o
parcel
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 7/63
latitude interval
(33324388,
34819650)
Latitude of the middle of the parcel mu
by 10e6
logerror interval (-4, 4) Error or the Zillow model response var
longitude interval
(-119475780,
-117554316)
Longitude of the middle of the parcel m
by 10e6
lotsizesquarefeet ratio (100, 328263808) Area of the lot in square feet
numberofstories ordinal
[nan, 1.0, 4.0, 2.0,
3.0, ... (13 More)]
Number of stories or levels the home h
parcelid nominal
[10754147,
10759547,
10843547,
10859147,
10879947, ...
(2985217 More)]
Unique identifier for parcels (lots)
poolcnt ordinal [nan, 1.0] Number of pools on the lot (if any)
poolsizesum ratio (19, 17410) Total square footage of all pools on pro
pooltypeid10 nominal [nan, 1.0] Spa or Hot Tub
pooltypeid2 nominal [nan, 1.0] Pool with Spa/Hot Tub
pooltypeid7 nominal [nan, 1.0] Pool without hot tub
propertycountylandusecode nominal
[010D, 0109, 1200,
1210, 010V, ... (241
More)]
County land use code i.e. it's zoning at
county level
propertylandusetypeid nominal
[269.0, 261.0, 47.0,
31.0, 260.0, ... (16
More)]
Type of land use the property is zoned
propertyzoningdesc nominal
[nan, LCA11*,
LAC2, LAM1,
LAC4, ... (5639
More)]
Description of the allowed land uses (z
for that property
rawcensustractandblock nominal
[60378002.041,
60378001.011,
60377030.012,
60371412.023,
60371232.052, ...
(99394 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
regionidcity nominal
[37688.0, 51617.0,
12447.0, 396054.0,
47547.0, ... (187
More)]
City in which the property is located (if
regionidcounty nominal
[3101.0, 1286.0,
2061.0, nan]
County in which the property is located
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 8/63
Data Quality
15 points
regionidneighborhood nominal
[nan, 27080.0,
46795.0, 274049.0,
31817.0, ... (529
More)]
Neighborhood in which the property is
regionidzip nominal
[96337.0, 96095.0,
96424.0, 96450.0,
96446.0, ... (406
More)]
Zip code in which the property is locate
roomcnt ordinal
[0.0, 8.0, 4.0, 5.0,
7.0, ... (37 More)]
Total number of rooms in the principal
residence
storytypeid nominal [nan, 7.0]
Type of floors in a multi-story house (i.e
basement and main level, split-level, a
etc.). See tab for details.
structuretaxvaluedollarcnt ratio (1, 251486000)
The assessed value of the built structu
the parcel
taxamount ratio (1, 3458861)
The total property tax assessed for tha
assessment year
taxdelinquencyflag nominal [nan, Y]
Property taxes for this parcel are past d
of 2015
taxdelinquencyyear interval (0, 99) Year
taxvaluedollarcnt ratio (1, 282786000) The total tax assessed value of the par
threequarterbathnbr ordinal
[nan, 1.0, 2.0, 4.0,
3.0, 6.0, 5.0, 7.0]
Number of 3/4 bathrooms in house (sh
sink + toilet)
transactiondate nominal
[nan, 2016-01-27,
2016-03-30, 2016-
05-27, 2016-06-07,
... (353 More)]
Date of the transaction response varia
typeconstructiontypeid nominal
[nan, 6.0, 4.0, 10.0,
13.0, 11.0]
What type of construction material was
construct the home
unitcnt ordinal
[nan, 2.0, 1.0, 3.0,
5.0, ... (147 More)]
Number of units the structure is built in
= duplex, 3 = triplex, etc...)
yardbuildingsqft17 interval (10, 7983) Patio in yard
yardbuildingsqft26 interval (10, 6141) Storage shed/building in yard
yearbuilt interval (1801, 2015) The Year the principal residence was b
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 9/63
Description:
Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes?
How do you deal with these problems? Give justifications for your methods.
Examining Distribution of Missing Values
From the observations, most of the rows have about 30 missing values. For the observations that
have 57 missing values, it means that most of the features are missing and we choose to remove
those. We will add in values to those missing where appropriate, below.
In [3]:
All observations have a value for parcelid
In [4]:
0.38 percent of the data has only parcelid present and all other variables
Out[4]: 0
plt.rcParams['figure.figsize'] = [10, 7]
number_missing_per_row = data.isnull().sum(axis=1)
sns.distplot(number_missing_per_row, color="#34495e", kde=False);
plt.title('Distribution of Missing Values', fontsize=15)
plt.xlabel('Number of Missing Values', fontsize=15)
plt.ylabel('Number of Rows', fontsize=15);
data['parcelid'].isnull().sum()
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 10/63
missing
We choose to remove those observations because they don't present any value
In [5]:
Table of Missing Values
Of the available variables, here is a table that describes the number of missing values as well as the
percent missing.
(0.0, 'percent of the data has no data features outside of parcelid')
print(round(len(number_missing_per_row[number_missing_per_row >= 57]) / len(data)
data = data[number_missing_per_row < 57]
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 11/63
In [9]:
Out[9]: Variable Name Number Missing Values Percent Missing
0 parcelid 0 0.000000
1 airconditioningtypeid 2162353 72.710897
2 architecturalstyletypeid 2967843 99.796160
3 basementsqft 2972277 99.945257
4 bathroomcnt 25 0.000841
5 bedroomcnt 13 0.000437
6 buildingclasstypeid 2961276 99.575339
7 buildingqualitytypeid 1035337 34.814058
8 calculatedbathnbr 117481 3.950395
9 decktypeid 2956809 99.425133
10 finishedfloor1squarefeet 2771182 93.183272
11 calculatedfinishedsquarefeet 44131 1.483941
12 finishedsquarefeet12 264610 8.897729
13 finishedsquarefeet13 2966233 99.742023
14 finishedsquarefeet15 2783098 93.583958
15 finishedsquarefeet50 2771182 93.183272
16 finishedsquarefeet6 2951902 99.260131
17 fips 0 0.000000
18 fireplacecnt 2661258 89.486988
19 fullbathcnt 117481 3.950395
20 garagecarcnt 2090598 70.298076
21 garagetotalsqft 2090598 70.298076
22 hashottuborspa 2904889 97.679280
23 heatingorsystemtypeid 1167429 39.255760
24 latitude 0 0.000000
25 longitude 0 0.000000
26 lotsizesquarefeet 264676 8.899948
27 poolcnt 2456346 82.596653
28 poolsizesum 2945942 99.059721
missing_values = data.isnull().sum().reset_index()
missing_values.columns = ['Variable Name', 'Number Missing Values']
missing_values['Percent Missing'] = missing_values['Number Missing Values'] / len
missing_values['Percent Missing'] = missing_values['Percent Missing'].replace(np.
missing_values
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 12/63
Variable Name Number Missing Values Percent Missing
29 pooltypeid10 2936964 98.757829
30 pooltypeid2 2941830 98.921452
31 pooltypeid7 2488421 83.675201
32 propertycountylandusecode 840 0.028246
33 propertylandusetypeid 0 0.000000
34 propertyzoningdesc 995195 33.464250
35 rawcensustractandblock 0 0.000000
36 regionidcity 51410 1.728704
37 regionidcounty 0 0.000000
38 regionidneighborhood 1817447 61.113149
39 regionidzip 2543 0.085510
40 roomcnt 38 0.001278
41 storytypeid 2972281 99.945392
42 threequarterbathnbr 2662261 89.520714
43 typeconstructiontypeid 2967157 99.773093
44 unitcnt 996333 33.502516
45 yardbuildingsqft17 2893549 97.297963
46 yardbuildingsqft26 2971258 99.910992
47 yearbuilt 48494 1.630651
48 numberofstories 2291806 77.063860
49 fireplaceflag 2968740 99.826323
50 structuretaxvaluedollarcnt 43547 1.464304
51 taxvaluedollarcnt 31113 1.046200
52 assessmentyear 2 0.000067
53 landtaxvaluedollarcnt 56296 1.892999
54 taxamount 19813 0.666228
55 taxdelinquencyflag 2917435 98.101150
56 taxdelinquencyyear 2917433 98.101083
57 censustractandblock 63691 2.141662
58 logerror 2883630 96.964429
59 transactiondate 2883630 96.964429
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 13/63
Examining Variables for Missing Values and Outliers
For variables that are nominal, ratio, and interval where appropriate, we wrote a function that
replaces outliers 5 standard deviations from the mean and assigning them as 5 standard deviations
above or below the mean, respectively.
In [10]:
Variable: airconditioningtypeid - Type of cooling system present in the
home (if any)
Has datatype: nominal and 72.710860 percent of values missing
For this variable, missing values indicate the absence of a cooling system. We replace all missing
values with 0 to represent no cooling system. We changed the column datatype to integer.
In [11]:
Variable: architecturalstyletypeid - Architectural style of the home (i.e.
ranch, colonial, split-level, etc…)
Has datatype: nominal and 99.796185 percent of values missing
Architectural style describes the home design. As such, it is not something we can extrapolate a
value for. With over 99% of values missing, we decided to eliminate this variable.
('Before', array([ nan, 1., 13., 5., 11., 9., 12., 3.]))
('After', array([ 0, 1, 13, 5, 11, 9, 12, 3]))
def fix_outliers(data, column):
mean = data[column].mean()
std = data[column].std()
max_value = mean + std * 5
min_value = mean - std * 5
if data[column].max() < max_value and data[column].min() > min_value:
print('No outliers found')
return
print('Outliers found!')
f, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=[15, 7])
f.subplots_adjust(hspace=.4)
sns.boxplot(data[column].dropna(), ax=ax0, color="#34495e").set_title('Before
sns.distplot(data[column].dropna(), ax=ax2, color="#34495e").set_title('Before
data.loc[data[column] > max_value, column] = max_value
data.loc[data[column] < min_value, column] = min_value
sns.boxplot(data[column].dropna(), ax=ax1, color="#34495e").set_title('After'
sns.distplot(data[column].dropna(), ax=ax3, color="#34495e").set_title('After
print('Before', data['airconditioningtypeid'].unique())
data['airconditioningtypeid'] = data['airconditioningtypeid'].fillna(0).astype(np
print('After', data['airconditioningtypeid'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 14/63
In [12]:
Variable: assessmentyear - year of the property tax assessment
Has datatype: interval and has 2 values missing
We replaced the missing values with the latest tax year which also happens to be the median tax
year. We changed the column datatype to integer.
In [13]:
Variable: basementsqft - Finished living area below or partially below
ground level
Has datatype: ratio and 99.945255 percent of values missing
Basements are not standard home features. Whenever a basement is not a feature of the home,
the value for area was entered as a missing value. With over 99% of values missing, we decided to
eliminate this variable.
In [14]:
Variable: bathroomcnt - Number of bathrooms in home including
fractional bathrooms
Has datatype: ordinal and 0.000841 percent of values missing
We decided it is potentially possible for the property to not have a bathroom so we decided to
replace missing values with zeros since there are only very few. We changed the column datatype
to a float.
('Before', array([ 2015., 2014., 2003., 2012., 2001., 2011., 2013., 201
6.,
2010., nan, 2004., 2005., 2002., 2000., 2009.]))
('After', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010, 2004, 20
05,
2002, 2000, 2009]))
del data['architecturalstyletypeid']
print('Before', data['assessmentyear'].unique())
median_value = data['assessmentyear'].median()
data['assessmentyear'] = data['assessmentyear'].fillna(median_value).astype(np.int
print('After', data['assessmentyear'].unique())
del data['basementsqft']
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 15/63
In [15]:
Variable: bedroomcnt - Number of bedrooms in home
Has datatype: ordinal and 0.000437 percent of values missing
We decided to replace missing values with zeros since there are only very few to represent a studio
apartment. We changed the column datatype to integer.
In [16]:
Variable: buildingclasstypeid - The building framing type (steel frame,
wood frame, concrete/brick)
Has datatype: nominal and 99.576949 percent of values missing
With this much missing values and the difficulty of assigning a building framing type, we decided to
remove this variable.
In [17]:
Variable: buildingqualitytypeid - Overall assessment of condition of the
building from best (lowest) to worst (highest)
Has datatype: ordinal and 34.81 percent of values missing
We chose to replace the missing values with the median of the condition assessment instead of
giving the missing values the best or worst value. We changed the column datatype to integer.
('Before', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , nan, 18. , 16. , 1.75,
17. , 19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , 18. , 16. , 1.75, 17. ,
19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('Before', array([ 0., 4., 5., 2., 3., 1., 6., 7., 8., 12.,
11.,
9., 10., 14., 16., 13., nan, 15., 17., 18., 20., 19.]))
('After', array([ 0, 4, 5, 2, 3, 1, 6, 7, 8, 12, 11, 9, 10, 14, 16, 1
3, 15,
17, 18, 20, 19]))
print('Before', data['bathroomcnt'].unique())
data['bathroomcnt'] = data['bathroomcnt'].fillna(0).astype(np.float32)
print('After', data['bathroomcnt'].unique())
print('Before', data['bedroomcnt'].unique())
data['bedroomcnt'] = data['bedroomcnt'].fillna(0).astype(np.int32)
print('After', data['bedroomcnt'].unique())
del data['buildingclasstypeid']
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 16/63
In [18]:
Variable: calculatedbathnbr - Number of bathrooms in home including
fractional bathroom
Has datatype: ordinal and 3.95 percent of values missing
With a low number of missing values, we decided to assign 0 to all missing values since we decided
above it is possible that a property could have 0 bathrooms. We changed the column datatype to a
float.
In [19]:
Variable: calculatedfinishedsquarefeet - Calculated total finished living
area of the home
Has datatype: ratio and 1.48 percent of values missing
These missing values appear to be consistent with 0 or missing values for variables associated with
a building or structure on the property such as bathroomcnt, bedroomcnt, or architecturalstyletypeid.
We can assume that no structures exist on these properties and we decided to impute zeros to
these. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
('Before', array([ nan, 7., 4., 10., 1., 12., 8., 3., 6., 9.,
5.,
11., 2.]))
('After', array([ 7, 4, 10, 1, 12, 8, 3, 6, 9, 5, 11, 2]))
('Before', array([ nan, 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.
5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
print('Before', data['buildingqualitytypeid'].unique())
medianQuality = data['buildingqualitytypeid'].median()
data['buildingqualitytypeid'] = data['buildingqualitytypeid'].fillna(medianQuality
print('After', data['buildingqualitytypeid'].unique())
print('Before', data['calculatedbathnbr'].unique())
data['calculatedbathnbr'] = data['calculatedbathnbr'].fillna(0).astype(np.float32
print('After', data['calculatedbathnbr'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 17/63
In [20]:
Variable: censustractandblock - census tract and census block ID
Has datatype: nominal and 2.14 percent of values missing
With such a small amount of missing values, we decided to replace them with the median. A better
approach in the future could be taking into account zip code and then median for the missing
values. We changed the column datatype to a float.
In [21]:
Variable: decktypeid - Type of deck (if any) present on parcel
Has datatype: nominal and 99.427311 percent of values missing
Outliers found!
('Before', [nan, 10925.92657277406, 5068.0, 1776.0, 2400.0, 3611.0, 3754.0, 247
0.0, '...'])
('After', [0, 10925, 5068, 1776, 2400, 3611, 3754, 2470, '...'])
('Before', [nan, 61110010011023.0, 61110009032019.0, 61110010024015.0, 61110010
023002.0, 61110010024021.0, 61110010021029.0, 61110010022038.0, '...'])
('After', [60375714234368.0, 61110011035648.0, 61110006841344.0, 6111000264704
0.0, 61110015229952.0, 61110019424256.0, 61110023618560.0, 61110027812864.0,
'...'])
fix_outliers(data, 'calculatedfinishedsquarefeet')
print('Before', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['..
data['calculatedfinishedsquarefeet'] = data['calculatedfinishedsquarefeet'].fillna
print('After', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['...
print('Before', data['censustractandblock'].unique()[:8].tolist() + ['...'])
median_value = data['censustractandblock'].median()
data['censustractandblock'] = data['censustractandblock'].fillna(median_value)
data['censustractandblock'] = data['censustractandblock'].astype(np.float32)
print('After', data['censustractandblock'].unique()[:8].tolist() + ['...'])
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 18/63
Missing values is most likely an indication of an absence of this feature in the property. With 99%
missing values, we will remove this column.
In [22]:
Variable: finishedfloor1squarefeet - Size of the finished living area on
the first (entry) floor of the home
Has datatype: ratio and 93.18 percent of values missing
Having this much missing values and the availability of an alternate variable -
calculatedfinishedsquarefeet - with very few missing values, we decided to eliminate this variable.
In [23]:
Variable: finishedsquarefeet12 - Finished living area
Has datatype: ratio and 8.89 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Missing values are
therefore zeros. We changed the column datatype to integer. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
In [24]:
Outliers found!
('Before', array([ nan, 4000., 3633., ..., 317., 268., 161.]))
('After', array([ 0, 4000, 3633, ..., 317, 268, 161]))
del data['decktypeid']
del data['finishedfloor1squarefeet']
fix_outliers(data, 'finishedsquarefeet12')
print('Before', data['finishedsquarefeet12'].unique())
data['finishedsquarefeet12'] = data['finishedsquarefeet12'].fillna(0).astype(np.i
print('After', data['finishedsquarefeet12'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 19/63
Variable: finishedsquarefeet13 - Finished living area
Has datatype: ratio and 99.743000 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 99%
missing values we will remove this from the dataset.
In [25]:
Variable: finishedsquarefeet15 - Total area
Has datatype: ratio and 93.58 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93%
missing values we will remove this from the dataset.
In [26]:
Variable: finishedsquarefeet50 - Size of the finished living area on the
first (entry) floor of the home
Has datatype: ratio and 93.18 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93%
missing values we will replace the missing values with 0. We changed the column datatype to float.
In [27]:
Variable: finishedsquarefeet6 - Base unfinished and finished area
Has datatype: ratio and 99.26 percent of values missing
With 99% missing values, we decided to delete this variable.
In [28]:
Variable: fips - Federal Information Processing Standard code - see
https://en.wikipedia.org/wiki/FIPS_county_code
(https://en.wikipedia.org/wiki/FIPS_county_code) for more details
Has datatype: nominal with values [6037.0, 6059.0, 6111.0] and no missing values
We changed the column datatype to integer.
In [29]:
Variable: fireplacecnt - Number of fireplaces in a home (if any)
del data['finishedsquarefeet13']
del data['finishedsquarefeet15']
data['finishedsquarefeet50'] = data['finishedsquarefeet50'].fillna(0).astype(np.f
del data['finishedsquarefeet6']
data['fips'] = data['fips'].astype(np.int32)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 20/63
Has datatype: ordinal and 89.486882 percent of values missing
In this dataset, missing value represents 0 fireplaces. We replaced all missing values with zero and
change the column datatype to integer. We changed the column datatype to integer.
In [30]:
Variable: fireplaceflag - does the home have a fireplace
Has datatype: ordinal and 99.82 percent of values missing
With 99% missing values, we decided to delete the variable.
In [31]:
Variable: fullbathcnt - Number of full bathrooms (sink, shower +
bathtub, and toilet) present in home
Has datatype: ordinal and 3.95 percent of values missing
We first replaced its missing values with the values of bathroomcnt which is a similar measure. After
that, we have 25 observations missing and we replace them with 0. We changed the column
datatype to a float.
In [32]:
Variable: garagecarcnt - Total number of garages on the lot including an
attached garage
Has datatype: ordinal and 70.298173 percent of values missing
We assume that missing values will represent no garage and replace all missing values with zero.
We changed the column datatype to integer.
('Before', array([ nan, 3., 1., 2., 4., 9., 5., 7., 6., 8.]))
('After', array([0, 3, 1, 2, 4, 9, 5, 7, 6, 8]))
('Before', array([ nan, 2., 4., 3., 1., 5., 7., 6., 10., 8.,
9.,
12., 11., 13., 14., 20., 19., 15., 18., 16., 17.]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 5. , 7. , 6.
,
10. , 8. , 9. , 12. , 11. , 7.5 , 2.5 , 4.5 ,
1.5 , 13. , 14. , 20. , 3.5 , 19. , 5.5 , 15. ,
18. , 16. , 1.75, 6.5 , 17. , 0.5 , 8.5 ]))
print('Before', data['fireplacecnt'].unique())
data['fireplacecnt'] = data['fireplacecnt'].fillna(0).astype(np.int32)
print('After', data['fireplacecnt'].unique())
del data['fireplaceflag']
print('Before', data['fullbathcnt'].unique())
missing_fullbathcnt = data['fullbathcnt'].isnull()
data.loc[missing_fullbathcnt, 'fullbathcnt'] = data['bathroomcnt'][missing_fullbat
data['fullbathcnt'] = data['fullbathcnt'].astype(np.float32)
print('After', data['fullbathcnt'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 21/63
In [33]:
Variable: garagetotalsqft - Total number of garages on the lot including
an attached garage
Has datatype: ratio and 70.298173 percent of values missing
We first replaced missing values where garagecarcnt is 0 with 0 garagetotalsqft. We changed the
column datatype to a float. We then replaced all outliers with a maximum and minimum value of
(mean ± 5 * std), respectively.
In [34]:
Variable: hashottuborspa - Does the home have a hot tub or spa
Has datatype: ordinal and 97.679250 percent of values missing
In this dataset missing values represent doesn't have a hot tub or spa. we replaced all missing
values with 0 and all True values with 1. We changed the column datatype to integer.
[ 0 2 4 1 3 5 7 6 8 9 12 11 10 13 14 15 25 21 18 17 24 19 16 20]
Outliers found!
data['garagecarcnt'] = data['garagecarcnt'].fillna(0).astype(np.int32)
print(data['garagecarcnt'].unique())
fix_outliers(data, 'garagetotalsqft')
data.loc[data['garagecarcnt'] == 0, 'garagetotalsqft'] = 0
data['garagecarcnt'] = data['garagecarcnt'].astype(np.float32)
assert data['garagetotalsqft'].isnull().sum() == 0
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 22/63
In [35]:
Variable: heatingorsystemtypeid - Type of home heating system
Has datatype: nominal and 39.255728 percent of values missing
We replaced all missing values with 0 which will represent a missing heating system type id. We
changed the column datatype to integer.
In [36]:
Variable: landtaxvaluedollarcnt - the assessed value of the land
Has datatype: ratio and 1.89 percent of values missing
We replaced all missing values with the median assessed land values. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
('Before', array([nan, True], dtype=object))
('After', array([0, 1]))
('Before', array([ nan, 2., 7., 20., 6., 13., 18., 24., 12., 10.,
1.,
14., 21., 11., 19.]))
('After', array([ 0, 2, 7, 20, 6, 13, 18, 24, 12, 10, 1, 14, 21, 11, 19]))
print('Before', data['hashottuborspa'].unique())
data['hashottuborspa'] = data['hashottuborspa'].fillna(0).replace('True', 1).asty
print('After', data['hashottuborspa'].unique())
print('Before', data['heatingorsystemtypeid'].unique())
data['heatingorsystemtypeid'] = data['heatingorsystemtypeid'].fillna(0).astype(np
print('After', data['heatingorsystemtypeid'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 23/63
In [37]:
Variables: latitude and longitude
Has datatype: interval and no missing values. We changed the column datatype to float.
In [38]:
Variable: logerror - Error or the Zillow model response variable
Has datatype: interval and 96.964429 percent of values missing
We will not fill any missing values because they represent the test part of the dataset. We changed
the column datatype to float.
In [39]:
Variable: lotsizesquarefeet - Area of the lot in square feet
Has datatype: ratio and 8.9 percent of values missing
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 7.62631000e+05, ...,
1.28007500e+06, 3.61063000e+05, 9.54574000e+05]))
('After', array([ 9, 27516, 762631, ..., 1280075, 361063, 954574]))
fix_outliers(data, 'landtaxvaluedollarcnt')
print('Before', data['landtaxvaluedollarcnt'].unique())
median_value = data['landtaxvaluedollarcnt'].median()
data['landtaxvaluedollarcnt'] = data['landtaxvaluedollarcnt'].fillna(median_value
print('After', data['landtaxvaluedollarcnt'].unique())
data[['latitude', 'longitude']] = data[['latitude', 'longitude']].astype(np.float
data['logerror'] = data['logerror'].astype(np.float32)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 24/63
We replace all missing values with 0 which will represent no lot. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [40]:
Variable: numberofstories - number of stories or levels the home
has
Has datatype: ordinal and 77.06 percent of values missing
We replace all missing values with 1 after removing all outliers to represent a single story home. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
fix_outliers(data, 'lotsizesquarefeet')
data['lotsizesquarefeet'] = data['lotsizesquarefeet'].fillna(0).astype(np.float32
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 25/63
In [41]:
Variable: parcelid - Unique identifier for parcels (lots)
Has datatype: nominal and no values missing. We changed the column datatype to integer.
In [42]:
Variable: poolcnt - Number of pools on the lot (if any)
Has datatype: ordinal and 82.6 percent of values missing
We replaced all missing values with 0 which will represent no pools. We changed the column
datatype to integer.
In [43]:
Variable: poolsizesum - Total square footage of all pools on
Outliers found!
('Before', array([ nan, 1. , 4. , 2. , 3.
,
4.09684575]))
('After', array([1, 4, 2, 3]))
('Before', array([ nan, 1.]))
('After', array([0, 1]))
fix_outliers(data, 'numberofstories')
print('Before', data['numberofstories'].unique())
data['numberofstories'] = data['numberofstories'].fillna(1).astype(np.int32)
print('After', data['numberofstories'].unique())
data['parcelid'] = data['parcelid'].astype(np.int32)
print('Before', data['poolcnt'].unique())
data['poolcnt'] = data['poolcnt'].fillna(0).astype(np.int32)
print('After', data['poolcnt'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 26/63
property
Has datatype: ratio and 99 percent of values missing
We replaced all missing values with 0 if number of pools is 0 or with the average poolsizesum
otherwise. We changed the column datatype to a float. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
In [44]:
Variable: pooltypeid10 - Spa or Hot Tub
Has datatype: nominal and 98.8 percent of values missing
We replaced all missing values with 0 which will represent no Spa or Hot Tub. We changed the
column datatype to integer.
In [45]:
Variable: pooltypeid2 - Pool with Spa/Hot Tub
Has datatype: nominal and 98.9 percent of values missing
Outliers found!
('Before', array([ nan, 1.]))
('After', array([0, 1]))
fix_outliers(data, 'poolsizesum')
data.loc[data['poolsizesum'].isnull(), 'poolsizesum'] = int(data['poolsizesum'].me
data.loc[data['poolcnt'] == 0, 'poolsizesum'] = 0
data['poolcnt'] = data['poolcnt'].astype(np.float32)
print('Before', data['pooltypeid10'].unique())
data['pooltypeid10'] = data['pooltypeid10'].fillna(0).astype(np.int32)
print('After', data['pooltypeid10'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 27/63
We replaced all missing values with 0 which will represent no Pool with Spa/Hot Tub. We changed
the column datatype to integer.
In [46]:
Variable: pooltypeid7 - Pool without hot tub
Has datatype: nominal and 83.6 percent of values missing
We replaced all missing values with 0 which will represent no pool without hot tub. We changed the
column datatype to integer.
In [47]:
Variable: propertycountylandusecode - County land use code i.e. it's
zoning at the county level
Has datatype: nominal and 0.02 percent of values missing
We replaced all missing values with 0 which will represent no county land use code. We changed
the column datatype to string.
In [48]:
Variable: propertylandusetypeid - Type of land use the property is zoned
for
Has datatype: nominal and 0 percent of values missing.
We are just changing the datatype to integer
In [49]:
('Before', array([ nan, 1.]))
('After', array([0, 1]))
('Before', array([ nan, 1.]))
('After', array([0, 1]))
('Before', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
('After', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
print('Before', data['pooltypeid2'].unique())
data['pooltypeid2'] = data['pooltypeid2'].fillna(0).astype(np.int32)
print('After', data['pooltypeid2'].unique())
print('Before', data['pooltypeid7'].unique())
data['pooltypeid7'] = data['pooltypeid7'].fillna(0).astype(np.int32)
print('After', data['pooltypeid7'].unique())
print('Before', data['propertycountylandusecode'].unique()[:8].tolist() + ['...']
data['propertycountylandusecode'] = data['propertycountylandusecode'].fillna(0).a
print('After', data['propertycountylandusecode'].unique()[:8].tolist() + ['...'])
data['propertylandusetypeid'] = data['propertylandusetypeid'].astype(np.int32)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 28/63
Variable: propertyzoningdesc - Description of the allowed land uses
(zoning) for that property
Has datatype: nominal and 33.4 percent of values missing
We replaced all missing values with 0 which will represent no description of the allowed land uses.
We changed the column datatype to string.
In [50]:
Variable: rawcensustractandblock - Census tract and block ID combined
- also contains blockgroup assignment by extension
Has datatype: nominal and 0 percent of values missing
We are just changing the datatype to integer
In [51]:
Variable: regionidcity - City in which the property is located (if
any)
Has datatype: nominal and 1.72 percent of values missing
we will replace any missing values with 0 to represent no city ID. We are just changing the datatype
to integer
In [52]:
Variable: regionidcounty - County in which the property is located
('Before', array([nan, 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('After', array(['0', 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('Before', array([ 60378002.041 , 60378001.011002, 60377030.012017, ...,
60590878.032022, 60590626.211013, 60379012.091563]))
('After', array([60378002, 60378001, 60377030, ..., 61110057, 60375324, 6037599
1]))
('Before', [37688.0, 51617.0, 12447.0, 396054.0, 47547.0, nan, 54311.0, 40227.
0, '...'])
('After', [37688, 51617, 12447, 396054, 47547, 0, 54311, 40227, '...'])
print('Before', data['propertyzoningdesc'].unique())
data['propertyzoningdesc'] = data['propertyzoningdesc'].fillna(0).astype(np.str)
print('After', data['propertyzoningdesc'].unique())
print('Before', data['rawcensustractandblock'].unique())
data['rawcensustractandblock'] = data['rawcensustractandblock'].fillna(0).astype(
print('After', data['rawcensustractandblock'].unique())
print('Before', data['regionidcity'].unique()[:8].tolist() + ['...'])
data['regionidcity'] = data['regionidcity'].fillna(0).astype(np.int32)
print('After', data['regionidcity'].unique()[:8].tolist() + ['...'])
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 29/63
Has datatype: nominal and 0 percent of values missing. We changed the column datatype to
integer.
In [53]:
Variable: regionidneighborhood - Neighborhood in which the property is
located
Has datatype: nominal and 61.1 percent of values missing
We replaced all missing values with 0 which will represent no region ID neighborhood. We changed
the column datatype to integer.
In [54]:
Variable: regionidzip - Zip code in which the property is located
Has datatype: nominal and 0.08 percent of values missing
We replaced all missing values with 0 which will represent no zip code. We changed the column
datatype to integer.
In [55]:
Variable: roomcnt - Total number of rooms in the principal
residence
Has datatype: nominal and 0.001 percent of values missing
We replaced all missing values with 1 which will represent no Total number of rooms in the principal
residence reported. We changed the column datatype to integer. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
('Before', array([ 3101., 1286., 2061.]))
('After', array([3101, 1286, 2061]))
('Before', [nan, 27080.0, 46795.0, 274049.0, 31817.0, 37739.0, 115729.0, 7877.
0, '...'])
('After', [0, 27080, 46795, 274049, 31817, 37739, 115729, 7877, '...'])
('Before', [96337.0, 96095.0, 96424.0, 96450.0, 96446.0, 96049.0, 96434.0, 9643
6.0, '...'])
('After', [96337, 96095, 96424, 96450, 96446, 96049, 96434, 96436, '...'])
print('Before', data['regionidcounty'].unique())
data['regionidcounty'] = data['regionidcounty'].astype(np.int32)
print('After', data['regionidcounty'].unique())
print('Before', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
data['regionidneighborhood'] = data['regionidneighborhood'].fillna(0).astype(np.i
print('After', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
print('Before', data['regionidzip'].unique()[:8].tolist() + ['...'])
data['regionidzip'] = data['regionidzip'].fillna(0).astype(np.int32)
print('After', data['regionidzip'].unique()[:8].tolist() + ['...'])
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 30/63
In [56]:
Variable: storytypeid - Type of floors in a multi-story house (i.e.
basement and main level, split-level, attic, etc.). See tab for
details.
Has datatype: nominal and 99.9 percent of values missing
With 99% missing values, we decided to remove this variable.
In [57]:
Variable: structuretaxvaluedollarcnt - the assessed value of the
building
Has datatype: ratio and 1.46 percent of values missing
We replaced all missing values with the median assessed building tax. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
Outliers found!
('Before', array([ 0. , 8. , 4. , 5. ,
7. , 6. , 11. , 3. ,
10. , 9. , 2. , 12. ,
15.67699991, 13. , 15. , 14. ,
1. , nan]))
('After', array([ 0, 8, 4, 5, 7, 6, 11, 3, 10, 9, 2, 12, 15, 13, 14,
1]))
fix_outliers(data, 'roomcnt')
print('Before', data['roomcnt'].unique())
data['roomcnt'] = data['roomcnt'].fillna(1).astype(np.int32)
print('After', data['roomcnt'].unique())
del data['storytypeid']
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 31/63
In [58]:
Variable: taxamount - property tax for the assessment year
Has datatype: ratio and 0.66 percent of values missing
We replaced all missing values with the median property taxes for the assessment year. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 650756., 571346., ..., 409940., 463704., 43776
5.]))
('After', array([122590, 650756, 571346, ..., 409940, 463704, 437765]))
fix_outliers(data, 'structuretaxvaluedollarcnt')
print('Before', data['structuretaxvaluedollarcnt'].unique())
medTax = np.nanmedian(data['structuretaxvaluedollarcnt'])
data['structuretaxvaluedollarcnt'] = data['structuretaxvaluedollarcnt'].fillna(med
print('After', data['structuretaxvaluedollarcnt'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 32/63
In [59]:
Variable: taxdelinquencyflag - property taxes from 2015 that are past
due
Has datatype: nominal and 98.10 percent of values missing
We replaced all missing values with 0 representing no past due property taxes and all Y values with
1 representing that there are past due property taxes. We changed the column datatype to integer.
In [60]:
Variable: taxdelinquencyyear - years of delinquency
Has datatype: interval and 98.10 percent of values missing
We replaced all missing values with 0 representing no years of property tax delinquencies. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 20800.37, 14557.57, ..., 33604.04, 12627.18,
15546.14]))
('After', array([ 3991.7800293 , 20800.36914062, 14557.5703125 , ...,
33604.0390625 , 12627.1796875 , 15546.13964844]))
('Before', array([nan, 'Y'], dtype=object))
('After', array([0, 1]))
fix_outliers(data, 'taxamount')
print('Before', data['taxamount'].unique())
median_value = data['taxamount'].median()
data['taxamount'] = data['taxamount'].fillna(median_value).astype(np.float32)
print('After', data['taxamount'].unique())
print('Before', data['taxdelinquencyflag'].unique())
data['taxdelinquencyflag'] = data['taxdelinquencyflag'].fillna(0).replace('Y', 1)
print('After', data['taxdelinquencyflag'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 33/63
In [61]:
Variable: taxvaluedollarcnt - total tax
Has datatype: ratio and 1.04 percent of values missing
We replaced all missing values with the median total tax amount. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
Outliers found!
('Before', array([ nan, 13. , 15. , 11. ,
14. , 9. , 10. , 8. ,
12. , 7. , 6. , 2. ,
26.79676804, 5. , 3. , 4. ,
0.98797484, 1. ]))
('After', array([ 0, 13, 15, 11, 14, 9, 10, 8, 12, 7, 6, 2, 26, 5, 3,
4, 1]))
fix_outliers(data, 'taxdelinquencyyear')
print('Before', data['taxdelinquencyyear'].unique())
data['taxdelinquencyyear'] = data['taxdelinquencyyear'].fillna(0).astype(np.int32
print('After', data['taxdelinquencyyear'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 34/63
In [62]:
Variable: threequarterbathnbr - Number of 3/4 bathrooms in house
(shower + sink + toilet)
Has datatype: ordinal and 89.5 percent of values missing
We replaced all missing values with 0 which will represent no Number of 3/4 bathrooms in the
property. We changed the column datatype to integer.
In [63]:
Variable: transactiondate - Date of the transaction response
variable
Has datatype: interval and 96.964429 percent of values missing
Will not fill any missing values because they represent the test part of the dataset
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 1.41338700e+06, ...,
4.70248000e+05, 6.43794000e+05, 5.30550000e+05]))
('After', array([ 9, 27516, 1413387, ..., 470248, 643794, 530550]))
('Before', array([ nan, 1., 2., 4., 3., 6., 5., 7.]))
('After', array([0, 1, 2, 4, 3, 6, 5, 7]))
fix_outliers(data, 'taxvaluedollarcnt')
print('Before', data['taxvaluedollarcnt'].unique())
median_value = data['taxvaluedollarcnt'].median()
data['taxvaluedollarcnt'] = data['taxvaluedollarcnt'].fillna(median_value).astype
print('After', data['taxvaluedollarcnt'].unique())
print('Before', data['threequarterbathnbr'].unique())
data['threequarterbathnbr'] = data['threequarterbathnbr'].fillna(0).astype(np.int
print('After', data['threequarterbathnbr'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 35/63
In [64]:
Variable: typeconstructiontypeid - What type of construction material
was used to construct the home
Has datatype: nominal and 99.7 percent of values missing
With 99% missing values, we decided to remove this variable.
In [65]:
Variable: unitcnt - number of units in the building
Has datatype: ordinal and 33.5 percent of values missing
We replaced all missing values with 1 to represent a single family home for any with no values. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
In [66]:
Variable: yardbuildingsqft17 - sq feet of patio in yard
Has datatype: interval and 97.29 percent of values missing
Outliers found!
('Before', [nan, 2.0, 1.0, 3.0, 5.0, 4.0, 9.0, 13.420418204007635, '...'])
('After', array([ 1, 2, 3, 5, 4, 9, 13, 12, 6, 7, 8, 10, 11]))
data['transactiondate'] = pd.to_datetime(data['transactiondate'])
del data['typeconstructiontypeid']
fix_outliers(data, 'unitcnt')
print('Before', data['unitcnt'].unique()[:8].tolist() + ['...'])
data['unitcnt'] = data['unitcnt'].fillna(1).astype(np.int32)
print('After', data['unitcnt'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 36/63
We replaced all missing values with 0 representing no patio. We changed the column datatype to
integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [67]:
Variable: yardbuildingsqft26 - storage shed/building in yard
Has datatype: interval and 99.91 percent of values missing
We replaced all missing values with 0 which will represent no (square ft) storage shed or building in
the yard. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 450., 94., ..., 969., 1359., 1079.]))
('After', array([ 0, 450, 94, ..., 969, 1359, 1079]))
fix_outliers(data, 'yardbuildingsqft17')
print('Before', data['yardbuildingsqft17'].unique())
data['yardbuildingsqft17'] = data['yardbuildingsqft17'].fillna(0).astype(np.int32
print('After', data['yardbuildingsqft17'].unique())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 37/63
In [68]:
Variable: yearbuilt - The Year the residence was built
Has datatype: interval and 1.63 percent of values missing
We replaced all missing values with the median year built of 1963 until we have a better method to
impute. We changed the column datatype to integer.
In [69]:
End of data cleaning
We went through every variable and next cell will confirm that the dataset has no missing values.
In [70]:
Simple Statistics
Outliers found!
('Before', [nan, 1948.0, 1947.0, 1943.0, 1946.0, 1978.0, 1958.0, 1949.0,
'...'])
('After', [1963, 1948, 1947, 1943, 1946, 1978, 1958, 1949, '...'])
fix_outliers(data, 'yardbuildingsqft26')
data['yardbuildingsqft26'] = data['yardbuildingsqft26'].fillna(0).astype(np.float
#there's too many values to print, before and after data redacted
print('Before', data['yearbuilt'].unique()[:8].tolist() + ['...'])
medYear = data['yearbuilt'].median()
data['yearbuilt'] = data['yearbuilt'].fillna(medYear).astype(np.int32)
print('After', data['yearbuilt'].unique()[:8].tolist() + ['...'])
# 'logerror' and 'transactiondate' are future variables and only exist in the tran
explanatory_vars = data.columns[~data.columns.isin(['logerror', 'transactiondate'
assert np.all(~data[explanatory_vars].isnull())
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 38/63
10 points
Description:
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of
attributes. Describe anything meaningful you found from this or if you found something potentially
interesting. Note: You can also use data from other sources for comparison. Explain why the
statistics run are meaningful.
Table of Binary Variables (0 or 1)
We standardized all Yes/No and True/False variables to 1 or 0, respectively. The table below shows
that all binary flags in this dataset represent rare features such a pool, hot tub, tax delinquency flag,
and three quarter bathroom.
In [71]:
Summary Statistics of All Continuous Variables
To make the table more readable, we converted all simple statistics of continuous variables to
integers. We lose some precision but we get a better overview. For each variable, we have already
accounted for outliers and standardized missing values. We can immediately see that 0 is the most
common value for many of the variables. To explore further, we chose to visualize each variable that
had non-zero 25% to 75% values in the form of a boxplot and histogram.
Out[71]: Percent with value equal to 1
hashottuborspa 2.320720
poolcnt 17.403347
pooltypeid2 1.078548
pooltypeid7 16.324799
pooltypeid10 1.242172
taxdelinquencyflag 1.898850
threequarterbathnbr 10.584165
bin_vars = ['hashottuborspa', 'poolcnt', 'pooltypeid2', 'pooltypeid7', 'pooltypeid
bin_data = data[bin_vars]
result_table = bin_data.mean() * 100
pd.DataFrame(result_table, columns=['Percent with value equal to 1'])
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 39/63
In [72]:
Calculated Finished Square Feet
For calculated square feet, most values were 0 with a range from 0 to 10898 sqft. Note that we
removed outliers earlier while cleaning the data. The median of 1561 was a little smaller than the
mean of 1784 so we expect to see a slight right skew, which we do below. What is interesting here
is the peak at 0 of values and then another peak around 1600 to 1800. We continue to have few
properties with very large (higher than the 75th percentile of 2124) which is fairly normal for any
area to have the middle class homes with few larger homes mixed in.
Out[72]: count mean std min 25% 50% 75% max
calculatedfinishedsquarefeet 2973905 1784 984 0 1199 1561 2124 1092
finishedsquarefeet12 2973905 1596 958 0 1092 1466 1996 6615
finishedsquarefeet50 2973905 94 390 0 0 0 0 3130
garagetotalsqft 2973905 113 217 0 0 0 0 1610
lotsizesquarefeet 2973905 19810 73796 0 5200 6700 9243 1710
poolsizesum 2973905 90 196 0 0 0 0 1476
yardbuildingsqft17 2973905 8 61 0 0 0 0 1485
yardbuildingsqft26 2973905 0 12 0 0 0 0 2126
yearbuilt 2973905 1964 23 1801 1950 1963 1981 2015
structuretaxvaluedollarcnt 2973905 166367 179850 1 75440 122590 195143 2181
taxvaluedollarcnt 2973905 407695 429374 1 181179 306086 485000 4052
assessmentyear 2973905 2014 0 2000 2015 2015 2015 2016
landtaxvaluedollarcnt 2973905 242391 287722 1 76724 167043 303002 2477
taxamount 2973905 5229 5284 1 2471 3991 6178 5129
taxdelinquencyyear 2973905 0 1 0 0 0 0 26
logerror 90275 0 0 -4 0 0 0 4
train_data = data[~data['logerror'].isnull()]
continous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continous_vars = continous_vars[continous_vars.isin(data.columns)]
continous_vars = continous_vars[~continous_vars.isin(['longitude', 'latitude'])]
output_table = data[continous_vars].describe().T
mode_range = data[continous_vars].mode().T
mode_range.columns = ['mode']
mode_range['range'] = data[continous_vars].max() - data[continous_vars].min()
output_table = output_table.join(mode_range)
output_table.astype(int)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 40/63
In [73]:
Finished Living Area
Similar to calculated finished square feet, finished living area had outliers which we already fixed
above. The range for finished living area is 0 to 6871 with 0 being the mode of the data. The mean
(1596) is about 100sqft larger than the median (1466) so they are relatively the same since the
variance is 962.
This variable is bimodal in with a large spike at 0 and another peak with a fairly normal distribution
and long right tail at around 1400.
We also see a slight spike at the very end of the tail of the dataset. This means there were a lot of
outliers that were set to the maximum (mean + 6 * std).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['calculatedfinishedsquarefeet'], ax=ax0, color="#34495e").set_tit
sns.distplot(data['calculatedfinishedsquarefeet'], ax=ax2, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 41/63
In [74]:
Lot Size Square Feet
Lot size square feet has the largest range from 0 to 1,710,750 even after removing all outliers
(mean + std * 5). The mode for this variable is 0 so we see below a spike at 0 and a very long right
tail.
What is interesting with this variable is the large variance of 73796. The 25th to 75th percentile
values are 5200 and 9243 respectively so we will skipped over the box plot and plotted the
histogram below.
In the histogram, we see a right skewed distribution which makes sense considering the mean is
19810 and the median is 6700 - again, with such a large variance it is difficult for the eye to see the
difference. The main takeaway here is the large number of 0s.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['finishedsquarefeet12'], ax=ax0, color="#34495e").set_title('Fin
sns.distplot(data['finishedsquarefeet12'], ax=ax2, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 42/63
In [75]:
Year Built
The year the properties were built ranges from 1801 to 2015. The mode and median of 1963 is only
a year difference from the mean of 1964. The distribution seems to be fairly normal with the peak in
the early 1960s and dropping off on both sides. We see a number of homes that were built before
1905 (the low whisker of the boxplot) which gives us a long left tail.
We see a few other spikes in homes built which could correlate to a number of other factors such as
healthy economic growth, political backing on mortgages, or rises in population. The baby boomers
born early 1960 shows many houses being built and around the time they turned 18 more houses
seem to have been built. We see an apparent fall right before 2000 which could be the dot com
burst and another drop in the housing burst of 2007. Because our data was collected in 2016, we
expect to see fewer homes built the previous year.
What will be interesting with this variable is how old a home has to be to begin to "fall apart" or need
major renovations to the piping or foundation. Will a home built in a certain year have many homes
made from a faulty material that causes damages later on? Will the Zestimate take into account the
disclosures of a home that each sale price typically does?
f, (ax0) = plt.subplots(nrows=1, ncols=1)
sns.distplot(data['lotsizesquarefeet'], ax=ax0, color="#34495e").set_title('Lot s
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 43/63
In [76]:
Total Tax Value
The total tax value of the property ranges from 1 to 4,052,186. The median of 306,086 is the same
as the mode and a little smaller than the mean of 407,695 which is evident in the right skewed
distribution below. These values have already been adjusted for outliers which is why we see a
slight spike at the maximum value for larger developments and unique mansions.
The distribution is fairly similar to square feet above because the tax is calculated by value
assessed * square feet. What is interesting to note here is the missing values for tax were replaced
by the median (hence the median and mode being the same) where the square footage missing
values were replaced with 0s (hence the 0 as the mode and second peak in the distribution).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['yearbuilt'].dropna(), ax=ax0, color="#34495e").set_title('Year
sns.distplot(data['yearbuilt'].dropna(), ax=ax2, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 44/63
In [77]:
Building and Land Tax
The building or structure tax has a similar right skewed distribution as total tax. The values range
from 1 to 2,165,929, already adjusted for outliers and cleaned up with missing values set to median.
That being said, the median and mode are the same at 122,590 which is lower than the mean of
166,344.
The land tax values range from 1 to 2,477,536, also adjusted for outliers and cleaned up with
missing values set to median. Because of this, the median and mode are the same at 167,043
which is lower than the mean of 242,391.
Land tax seems to have a larger range of values from the 25th to 75th percentile than the building
tax. This means that the land is valued at a greater variance (287k) than the buildings in certain
areas (variance of 179k). We think this could be due to location itself as better neighborhoods, safer
areas, or better schools could result in a higher assessment than other locations, thus widening the
variance.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['taxvaluedollarcnt'], ax=ax0, color="#34495e").set_title('Total t
sns.distplot(data['taxvaluedollarcnt'], ax=ax2, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 45/63
In [78]:
Assessment Year
Assessment year is the year that the property was assessed. The 25th through 75th percentile of
values are all from the year 2015 so reading a box plot is not very helpful. Instead we list out the
unique values for assessment year along with our histogram.
In the state of California, the base year value is set when you originally purchase the property,
based on the sales price listed on the deed. However, there are exceptions which is why we see a
few assessment years from 2000 to 2016 thrown in.
In order for assessment year to be useful for our predictions, we should find out what each
exception is and what the cause of it not to be assessed at the point of sale. This could affect the
predicted log error.
f, (ax0,ax2,ax3,ax4) = plt.subplots(nrows=4, ncols=1, figsize=[15, 14])
sns.boxplot(data['structuretaxvaluedollarcnt'], ax=ax0, color="#34495e").set_title
sns.distplot(data['structuretaxvaluedollarcnt'], ax=ax2, color="#34495e");
sns.boxplot(data['landtaxvaluedollarcnt'], ax=ax3, color="#34495e").set_title('La
sns.distplot(data['landtaxvaluedollarcnt'], ax=ax4, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 46/63
In [79]:
Visualize Attributes
15 points
Description:
Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting).
Important: Interpret the implications for each visualization. Explain for each attribute why the chosen
visualization is appropriate.
Distribution of Target Variable: Logerror
In the training dataset, logerror is the response variable so we are interested in seeing the
distribution of log error that we are training on. We visualize this using a boxplot and histogram to
get a general picture of the overall distribution. It is symmetric around zero which implies that the
model generating the logerror has no bias and is very accurate in most instances.
('Unique years:', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010,
2004, 2005,
2002, 2000, 2009]))
print('Unique years:', data['assessmentyear'].unique())
f, (ax2) = plt.subplots(nrows=1, ncols=1, figsize=[15, 4])
sns.distplot(data['assessmentyear'], ax=ax2, color="#34495e")
plt.title('Assessment year distribution');
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 47/63
In [80]:
Count of Bathrooms
We think that the number of bathrooms in a home could be interesting because our data was
collected in California where rent is very high. It is common to buy a rental property and have
random tenants. Tenants that do not know each other may want their own bathroom. In our case,
most homes have 2 bathrooms. Notably, there are outliers with no bathrooms or suspiciously high
train_data = data[~data['logerror'].isnull()]
x = train_data['logerror']
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,
gridspec_kw={"height_ratios": (.15, .85)}, figsize=(10, 10))
sns.boxplot(train_data['logerror'][train_data['logerror'].abs()<1], ax=ax_box, co
sns.distplot(
train_data['logerror'][train_data['logerror'].abs()<1],
ax=ax_hist, bins=400, kde=False, color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 48/63
counts. We see records in the dataset with no bathroom which we justified above as being possible.
Because we are looking at frequency, we chose to visualize the sum of each number of bathrooms
(as a category) in a bar chart.
In [81]:
Count of Bedrooms
For the same reasons we were interested in the number of bathrooms, we are also interested in the
number of bedrooms. In our dataset, most properties have 3 bedrooms and we see fewer instances
as we go up or down one bedroom in the data. Here we still see records without any bedrooms
which we justified as studios above. We chose the same visualization (using number of bedrooms
as a category and counting the frequency of each category) displayed in a bar chart below.
sns.countplot(data['bathroomcnt'], color="#34495e")
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathrooms', fontsize=12)
plt.title("Frequency of Bathroom count", fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 49/63
In [82]:
Bed to Bath Ratio
After visualizing the distribution of bathroom and bedroom counts, we also thought it would be
interesting to try to see if the number of bathrooms were dependent on the number of bedrooms.
We chose to stick with a bar chart, only this time using the ratio of bedrooms to bathrooms as the
category to find the sum counts of. What we found is most homes have about a ratio of 1.5
bedrooms per 1 bathroom in a property.
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bedrooms', fontsize=12)
plt.title("Frequency of Bedrooms count", fontsize=15)
sns.countplot(data['bedroomcnt'], color="#34495e");
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 50/63
In [83]:
Average Tax Per Square Feet
For our last attribute, we calculated the tax per square foot to see if we could find any trends. We
again chose to use a bar chart to plot the ratio and the sum counts. What we found is that plotting
this exposes extreme outliers for possible elimination. Most properties are under a few dollars per
square foot but as the visualization reveals, there are suspicious records. However, because this is
southern California and land space is limited for continuous growth, there could be a reason that
some places have high tax per square feet due to better real estate areas.
non_zero_mask = data['bathroomcnt'] > 0
bedroom = data[non_zero_mask]['bedroomcnt']
bathroom = data[non_zero_mask]['bathroomcnt']
bedroom_to_bath_ratio = bedroom / bathroom
bedroom_to_bath_ratio = bedroom_to_bath_ratio[bedroom_to_bath_ratio<6]
sns.distplot(bedroom_to_bath_ratio, color="#34495e", kde=False)
plt.title('Bed to Bath ratio', fontsize=15)
plt.xlabel('Ratio', fontsize=15)
plt.ylabel('Count', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 51/63
In [84]:
Explore Joint Attributes
15 points
Description:
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-
tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.
Absolute Log Error and Number of Occurrences Per
Month
We compared amount of absolute error based monthly average and found out that the error could
be cyclical for the year, where the error dips during the Spring and Summer months and rises during
the Winter months.
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
tax_per_sqft = tax / sqft
tax_per_sqft = tax_per_sqft[tax_per_sqft<10]
sns.distplot(tax_per_sqft, color="#34495e", kde=False)
plt.title('Tax Per Square Feet', fontsize=15)
plt.xlabel('Ratio', fontsize=15)
plt.ylabel('Count', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 52/63
We compared amount of transactions based monthly average and the transactions are the highest
during the Spring, Summer, and Fall seasons possibly due to an optimal time to sell property. The
transactions are at its lowest during the Winter season.
For a cross comparison, we have high number of transactions during the Spring and Summer
seasons while the log error is relatively low and we have low number of transactions during the
Winter season while the log error is relatively high.
In [85]:
Number of Transactions and Mean Absolute Log Error Per
Day of the Week
Saturdays and Sundays are non-work days, hence why there is a dip in absolute log error and
number of transactions
For the workdays, Friday has the most transactions while Monday has the least.
months = train_data['transactiondate'].dt.month
month_names = ['January','February','March','April','May','June','July','August',
train_data['abs_logerror'] = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
per_month = train_data.groupby(months)["abs_logerror"].mean()
per_month.index = month_names
ax0.set_title('Average Log Error Across Month Of 2016')
ax0.set_xlabel('Month Of The Year', fontsize=15)
ax0.set_ylabel('Log Error', fontsize=15)
sns.pointplot(x=per_month.index, y=per_month, color="#34495e", ax=ax0)
per_month = train_data.groupby(months)["logerror"].count()
per_month.index = month_names
ax1.set_title('No Of Occurunces per month In 2016')
ax1.set_xlabel('Month Of The Year', fontsize=15)
ax1.set_ylabel('Nimber of Occurences', fontsize=15)
sns.barplot(x=per_month.index, y=per_month, color="#34495e", ax=ax1);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 53/63
For the workdays, Monday has relatively the most log errors while Friday has relatively the least log
errors.
For cross analysis, Monday has the least transactions with the most error while Friday has the most
transactions with the least errors. Sunday and Saturday are special cases and does not have
substantial evidence to provide any trends.
In [86]:
Continuous Variable Correlation Heatmap
Heatmap of correlations are represented where the warmer colors are highly correlated, white is
non-correlated, and colder colors are negatively correlated. We see that calculated finished square
feet is correlated with finished square feet, due to collinearity. Tax amounts and year built are also
highly correlated to finished square feet as well as with one another.
Latitude and longitude are negatively correlated with each other possibly because the beachfront
properties are more expensive.
weekday = train_data['transactiondate'].dt.weekday
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'S
abs_logerror = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
to_plot = abs_logerror.groupby(weekday).count()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax0)
ax0.set_title('Number of Transactions Per Day')
ax0.set_ylabel('Number of Transactions', fontsize=15)
ax0.set_xlabel('Day', fontsize=15)
to_plot = abs_logerror.groupby(weekday).mean()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax1)
ax1.set_title('Mean Absolute Log Error Per Day')
ax1.set_ylabel('Mean Absolute Log Error', fontsize=15)
ax1.set_xlabel('Day', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 54/63
In [87]:
Longitude and Latitude Data Points
train_data = data[~data['logerror'].isnull()]
continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continuous_vars = continuous_vars[continuous_vars.isin(data.columns)]
continuous_vars = continuous_vars.sort_values()
corrs = train_data[continuous_vars].corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrs, ax=ax)
plt.title("Variables correlation map", fontsize=20)
plt.xlabel('Continuous Variables', fontsize=15)
plt.ylabel('Continuous Variables', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 55/63
From a simple graph, we can see the shoreline of California as well as possible areas of
obstruction, such as mountains that prevent property growth in those areas. The majority of
properties are in the center to upper left of the graph.
In [88]:
Number of Stories vs Year Built
As architectural feats improved, we started to see more properties with 2 or more stories by 1950.
The number of one story properties also increased during that time. The baby boomers, the end of
WWII and readily available steel, and mortgage incentives may be the cause of the increase of
more properties being built as well as more stories per property. Note: because we filled in missing
values as the median value, the 1965 spike in the data is artificial until we use other methods to
impute year built.
<matplotlib.figure.Figure at 0x114670f10>
plt.figure(figsize=(12,12));
sns.jointplot(x=data.latitude.values, y=data.longitude.values, size=10, color="#34
plt.ylabel('Longitude', fontsize=15)
plt.xlabel('Latitude', fontsize=15)
plt.title('Longitude and Latitude Data Points', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 56/63
In [89]:
Explore Attributes and Class
10 points
Description:
Identify and explain interesting relationships between features and the class you are trying to
predict (i.e., relationships with variables and the target classification).
Correlation of Continuous Variables and Log
Error (Target Variable)
We see that calculatedfinishedsquarefeet has the highest correlation with log error (0.04) while price
per square feet has the highest negative correlation with log error (-0.02). taxvaluedollarcnt has
relatively low correlation with log error. We choose to further explore finishedsquarefeet12 and its
relationship with log error.
fig,ax1= plt.subplots()
fig.set_size_inches(20,10)
yearMerged = data.groupby(['yearbuilt', 'numberofstories'])["parcelid"].count().u
yearMerged = yearMerged.loc[1900:]
yearMerged.index.name = 'Year Built'
plt.title('Number of Stories Per Year Built', fontsize=15)
plt.ylabel('Count', fontsize=15);
yearMerged.plot(ax=ax1, linewidth=4);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 57/63
In [90]: train_data = data[~data['logerror'].isnull()]
continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continuous_vars = continuous_vars[continuous_vars.isin(data.columns)]
continuous_vars = continuous_vars[~continuous_vars.isin(['logerror', 'transactiond
labels = []
values = []
for column in continuous_vars:
labels.append(column)
values.append(train_data[column].corr(train_data['logerror']))
corr = pd.DataFrame({'labels':labels, 'values':values}).fillna(0.)
corr = corr.sort_values(by='values')
labels = corr['labels'].values
values = corr['values'].values
fig, ax = plt.subplots(figsize=(10,10))
plt.barh(range(len(labels)), values, color="#34495e")
plt.title("Correlation of Continuous Variables", fontsize=15);
plt.xlabel('Correlation', fontsize=15)
plt.ylabel('Continuous Variable', fontsize=15)
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels, rotation='horizontal');
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 58/63
Scatterplot of Log Error and Calculated Finished Square
Feet
We are plotting our best correlated variable calculatedfinishedsquarefeet against the logerror. We
don't see any linear relationship from the scatter plot below even though it is evenly distributed.
In [91]:
New Features
5 points
Description:
Are there other features that could be added to the data or created from existing features? Which
ones?
column = "calculatedfinishedsquarefeet"
train_data = data[~data['logerror'].isnull()]
sns.jointplot(train_data[column], train_data['logerror'], size=10, color="#34495e
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated Finished Square Feet', fontsize=15)
plt.title("Calculated Finished Square Feet Vs Log Error", fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 59/63
Tax Per Square Feet
We created a tax per square feet feature. It is negatively correlated with log error and we hope that
it will add value to a predictive model.
In [92]:
City zip code details
The Zillow dataset has a variable: 'regionidcity' which is a numerical ID, representing the city in
which the property is located (if any). We don't have a string variable showing the city name.
We found a government dataset publicly available on the internet, containing all zip codes as well
as other information associated with each zip code. We have downloaded the dataset from here:
http://federalgovernmentzipcodes.us (http://federalgovernmentzipcodes.us) and joined it with our
dataset with the cell below.
This will give us the actual name of the cities, zip code type and location type.
New Variables Joined:
zipcode_type Standard, PO BOX Only, Unique, Military (implies APO or FPO) - Zip code
type may provide useful insight towards prediction
city USPS official city name(s) - this will distinguish one county from another that was
lacking in the original dataset
location_type Primary, Acceptable, Not Acceptable - because these are all valid location
properties, they will most likely be acceptable.
In [93]:
Out[92]: ('Correlation with log error:', -0.014065552662672554)
The zips dataset has 81831 rows and 4 columns
The merged dataset has 3857451 rows and 53 columns
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
data['price_per_sqft'] = tax / sqft
'Correlation with log error:', data['price_per_sqft'].corr(data['logerror'])
# data from http://federalgovernmentzipcodes.us
zips = pd.read_csv('../input/free-zipcode-database.csv', low_memory=False)
zips = zips[['Zipcode','ZipCodeType','City','LocationType']]
zips.columns = ['zipcode', 'zipcode_type', 'city', 'location_type']
assert np.all(~zips.isnull())
zips = zips.rename(columns={'zipcode':'regionidzip'})
data = pd.merge(data, zips, how='left', on='regionidzip')
print('The zips dataset has %d rows and %d columns' % zips.shape)
print('The merged dataset has %d rows and %d columns' % data.shape)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 60/63
Table of New Variables
Just focusing on the new features added to the dataset, here are the value types and descriptions.
In [94]:
Other Ideas For New Features
Other features that we thought about that we could include in the future are last remodel date of
kitchen or bathroom, key words in the description of overpriced Zestimates or underpriced
Zestimates, and how close a home is to a grocery store, Starbucks, a mall, or a place of interest.
A recently remodeled home could raise the actual sale price much higher than the Zestimate.
Certain words in the listing description could be associated with lower sale prices or people who bid
a higher sale price. Lastly the walkability, or how close a home is to a grocery store, Starbucks, a
mall, or place of interest could increase the final sale price as well.
Exceptional Work
10 points
Description:
You have free reign to provide additional analyses. One idea: implement dimensionality reduction,
then visualize and interpret the results.
Categorical Feature Importance
Out[94]: Variable Type Scale Description
city nominal
[APO, WHISKEYTOWN, nan,
REDDING, FPO, ... (239 More)]
USPS offical city name(s)
location_type nominal
[PRIMARY, nan, ACCEPTABLE,
NOT ACCEPTABLE]
Primary, Acceptable, Not
Acceptable
price_per_sqft ratio (0, 11911) Tax per SQFT
zipcode_type nominal
[MILITARY, PO BOX, nan,
STANDARD, UNIQUE]
Standard, PO BOX Only, Unique,
Military(implies APO or FPO)
variables_description = [
['price_per_sqft', 'ratio', 'TBD', 'Tax per SQFT']
,['zipcode_type', 'nominal', 'TBD', 'Standard, PO BOX Only, Unique, Military(impl
,['city', 'nominal', 'TBD', 'USPS offical city name(s)']
,['location_type', 'nominal', 'TBD', 'Primary, Acceptable, Not Acceptable']
]
new_variables = pd.DataFrame(variables_description, columns=['name', 'type', 'sca
new_variables = new_variables.set_index('name')
new_variables = new_variables.loc[new_variables.index.isin(data.columns)]
variables = variables.append(new_variables)
output_variables_table(new_variables)
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 61/63
g p
According to a random forest model with seed 0, region id zip, bedroom count, census tract and
block, and region id neighborhood explain the most variance for log error. Even though the
importance of the other variables are relatively lower, they could provide greater importance if we
add interaction terms or use a different nonlinear model.
In [95]:
Continuous Feature Importance
from sklearn import ensemble
train_data = data[~data['logerror'].isnull()]
categorical_vars = variables[variables['type'].isin(['ordinal', 'nominal'])].index
categorical_vars = categorical_vars[categorical_vars.isin(data.columns)]
categorical_vars = categorical_vars[~categorical_vars.isin(['parcelid', 'logerror
X = train_data[categorical_vars]
# remove string types
categorical_vars = categorical_vars[X.dtypes != object]
X = X[categorical_vars]
y = train_data['logerror']
model = ensemble.ExtraTreesRegressor(random_state=0)
model.fit(X.fillna(0), y)
index = pd.Index(categorical_vars, name='Variable Name')
importance = pd.Series(model.feature_importances_, index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Categorical Feature Importance')
plt.xlabel('Importance', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 62/63
According to the linear regression model, the feature tax delinquency year explains the most
variance of log error. Even though the importance of the other variables are relatively lower, they
could provide greater importance if we add interaction terms or a higher order polynomials.
In [96]:
Exporting the cleaned datasets
from sklearn.linear_model import LinearRegression
train_data = data[~data['logerror'].isnull()]
continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continuous_vars = continuous_vars[continuous_vars.isin(data.columns)]
continuous_vars = continuous_vars[~continuous_vars.isin(['parcelid', 'logerror',
X = train_data[continuous_vars]
y = train_data['logerror']
model = LinearRegression()
model.fit(X.fillna(0), y)
index = pd.Index(continuous_vars, name='Variable Name')
importance = pd.Series(np.abs(model.coef_), index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Continuous Feature Importance')
plt.xlabel('Importance', fontsize=15);
1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 63/63
In [97]:
References
Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels
(https://www.kaggle.com/c/zillow-prize-1/kernels)
Pandas cookbook: https://pandas.pydata.org/pandas-docs/stable/cookbook.html
(https://pandas.pydata.org/pandas-docs/stable/cookbook.html)
Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas
(https://stackoverflow.com/questions/tagged/pandas)
test_mask = data['logerror'].isnull()
train_data = data[~test_mask]
test_data = data[test_mask]
train_data.to_csv('../datasets/train.csv', index=False)
test_data.to_csv('../datasets/test.csv', index=False)
variables.index.name = 'name'
variables.to_csv('../datasets/variables.csv', index=True)

More Related Content

What's hot

MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
Douglas Duncan
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
MongoDB
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
MongoDB
 
Functions
FunctionsFunctions
Functions
G.C Reddy
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
MongoDB
MongoDB MongoDB
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5
johnwilander
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013
Daichi Morifuji
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
Mythbusting: Understanding How We Measure Performance at MongoDB
Mythbusting: Understanding How We Measure Performance at MongoDBMythbusting: Understanding How We Measure Performance at MongoDB
Mythbusting: Understanding How We Measure Performance at MongoDB
MongoDB
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
Martin Zapletal
 
Qtp test
Qtp testQtp test
Qtp test
G.C Reddy
 
Hack reduce mr-intro
Hack reduce mr-introHack reduce mr-intro
Hack reduce mr-intro
montrealouvert
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
MongoDB
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
MongoDB
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
MongoDB
 
Geospatial Indexing and Querying with MongoDB
Geospatial Indexing and Querying with MongoDBGeospatial Indexing and Querying with MongoDB
Geospatial Indexing and Querying with MongoDB
Grant Goodale
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB
 

What's hot (20)

MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
 
Functions
FunctionsFunctions
Functions
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
MongoDB
MongoDB MongoDB
MongoDB
 
Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5Web Integration Patterns in the Era of HTML5
Web Integration Patterns in the Era of HTML5
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
 
Mythbusting: Understanding How We Measure Performance at MongoDB
Mythbusting: Understanding How We Measure Performance at MongoDBMythbusting: Understanding How We Measure Performance at MongoDB
Mythbusting: Understanding How We Measure Performance at MongoDB
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
 
Qtp test
Qtp testQtp test
Qtp test
 
Hack reduce mr-intro
Hack reduce mr-introHack reduce mr-intro
Hack reduce mr-intro
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
Geospatial Indexing and Querying with MongoDB
Geospatial Indexing and Querying with MongoDBGeospatial Indexing and Querying with MongoDB
Geospatial Indexing and Querying with MongoDB
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
 

Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance

(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
Amazon Web Services
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Olga Lavrentieva
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Theodore Grammatikopoulos
 
Test data article
Test data articleTest data article
Test data article
David Harrison
 
OpenGL 4.6 Reference Guide
OpenGL 4.6 Reference GuideOpenGL 4.6 Reference Guide
OpenGL 4.6 Reference Guide
The Khronos Group Inc.
 
Chapter3_Visualizations2.pdf
Chapter3_Visualizations2.pdfChapter3_Visualizations2.pdf
Chapter3_Visualizations2.pdf
MekiyaShigute1
 
Income Qualification ppt.pptx
Income Qualification ppt.pptxIncome Qualification ppt.pptx
Income Qualification ppt.pptx
ShilpaSweety2
 
OpenGL 4.4 Reference Card
OpenGL 4.4 Reference CardOpenGL 4.4 Reference Card
OpenGL 4.4 Reference Card
The Khronos Group Inc.
 
Opengl4 quick reference card
Opengl4 quick reference cardOpengl4 quick reference card
Opengl4 quick reference card
Adrien Wattez
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
Amazon Web Services
 
Eventsourcing with PHP and MongoDB
Eventsourcing with PHP and MongoDBEventsourcing with PHP and MongoDB
Eventsourcing with PHP and MongoDB
Jacopo Nardiello
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
Maxime Beugnet
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
Amazon Web Services
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
Amazon Web Services
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
Patrick McFadin
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
The Ring programming language version 1.9 book - Part 58 of 210
The Ring programming language version 1.9 book - Part 58 of 210The Ring programming language version 1.9 book - Part 58 of 210
The Ring programming language version 1.9 book - Part 58 of 210
Mahmoud Samir Fayed
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 
ELAVARASAN.pdf
ELAVARASAN.pdfELAVARASAN.pdf
ELAVARASAN.pdf
dharmendra kumar jaiswal
 

Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance (20)

(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)
 
Test data article
Test data articleTest data article
Test data article
 
OpenGL 4.6 Reference Guide
OpenGL 4.6 Reference GuideOpenGL 4.6 Reference Guide
OpenGL 4.6 Reference Guide
 
Chapter3_Visualizations2.pdf
Chapter3_Visualizations2.pdfChapter3_Visualizations2.pdf
Chapter3_Visualizations2.pdf
 
Income Qualification ppt.pptx
Income Qualification ppt.pptxIncome Qualification ppt.pptx
Income Qualification ppt.pptx
 
OpenGL 4.4 Reference Card
OpenGL 4.4 Reference CardOpenGL 4.4 Reference Card
OpenGL 4.4 Reference Card
 
Opengl4 quick reference card
Opengl4 quick reference cardOpengl4 quick reference card
Opengl4 quick reference card
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Eventsourcing with PHP and MongoDB
Eventsourcing with PHP and MongoDBEventsourcing with PHP and MongoDB
Eventsourcing with PHP and MongoDB
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
The Ring programming language version 1.9 book - Part 58 of 210
The Ring programming language version 1.9 book - Part 58 of 210The Ring programming language version 1.9 book - Part 58 of 210
The Ring programming language version 1.9 book - Part 58 of 210
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
ELAVARASAN.pdf
ELAVARASAN.pdfELAVARASAN.pdf
ELAVARASAN.pdf
 

More from Yao Yao

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
Yao Yao
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
Yao Yao
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
Yao Yao
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
Yao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
Yao Yao
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
Yao Yao
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
Yao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
Yao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao
 

More from Yao Yao (18)

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 

Recently uploaded

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance

  • 1. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 1/63 Zillow Dataset Analysis and Visualization MSDS 7331 Data Mining - Section 403 - Lab 1 Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion Business Understanding 10 points Description: Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific. Answer: Origin and purpose of dataset This is a dataset from a Kaggle competition: "Zillow Prize: Zillow’s Home Value Prediction (Zestimate)". To download all accompanied dataset refer to this link: https://www.kaggle.com/c/zillow-prize-1/data (https://www.kaggle.com/c/zillow-prize-1/data) Note: The dataset has 2985217 rows and 58 columns and it requires at least 2GB of free RAM to load. Zillow, a leading real estate and rental marketplace platform, developed a model to estimate the property price based on property features, which they call the "Zestimate". As with every real world model, the Zestimate has some error associated with it. Zestimates are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. The purpose of this dataset and Kaggle competition is to minimize the error between the Zestimate (what we will predict) and the actual sale price, given certain features of a home. Description of dataset We are provided with a full dataset of real estate properties in three counties in California: Los Angeles, Orange, and Ventura in 2016. The dataset contains: ID for the listing 57 variables describing the property features such as the number of bedrooms and various measurements in square feet Two resulting variables: logerror and transactiondate The dataset has two parts:
  • 2. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 2/63 Training data (90275 rows), which contains logerror and transactiondate and has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016. Testing data (2895067 rows), which contains the rest of the transactions between October 15 and December 31, 2016. A successful measure of how well we predict log error will be how well we can clean and train our data measured by our placement in the Kaggle competition. Kaggle measures the effectiveness of a good prediction algorithm by taking the log error of the Zestimate and the actual sales price. The log error is defined as: where logerror < 0 represent Zestimates lower than the actual sell price and logerror > 0 represent Zestimates higher than the actual sell price. Our notebook This notebook is an exploratory analysis for the dataset described above. Our study is organized as follows: Data Meaning Data Quality (EDA) Review of variables Identification of missing values and outliers Data cleansing Visualizations Simple Statistics Visualize Attributes Explore Joint Attributes Explore Attributes and Classes New Features Exceptional Work References/Citations Conclusion From the correlation table, random forest, and linear regression feature importance, we found out that regionidzip, calculatedfinishedsquarefeet, bedroomcount, censustractandblock, regionidneighborhood, and taxdelinquencyyear are the most important variables towards building our prediction model. Future work In the future lab notebooks, we will predict the logerror from a regression model. To measure the effectiveness of a good prediction algorithm, we will first apply cross-validation by splitting the training dataset to training, validation, and testing to model our prediction error. A final prediction error will be given by Kaggle when we submit our predictions to the competition. logerror =log(Zestimate)−log(SalePrice)
  • 3. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 3/63 In [1]: Data Meaning 10 points Description: Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Below is a table of all of the variables in the dataset. We list the variable name, type of data, scale, and a description. /usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use th e latter. warnings.warn(self.msg_depr % (key, alt_key)) Out[1]: 'The dataset has 2985342 rows and 60 columns' %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # load datasets here: train_data = pd.read_csv('../input/train_2016_v2.csv') data = pd.read_csv('../input/properties_2016.csv', low_memory=False) data = pd.merge(data, train_data, how='left', on='parcelid') 'The dataset has %d rows and %d columns' % data.shape
  • 4. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 4/63 In [2]: from IPython.display import display, HTML variables_description = [ ['airconditioningtypeid', 'nominal', 'TBD', 'Type of cooling system present in the ,['architecturalstyletypeid', 'nominal', 'TBD', 'Architectural style of the home ,['assessmentyear', 'interval', 'TBD', 'The year of the property tax assessment'] ,['basementsqft', 'ratio', 'TBD', 'Finished living area below or partially below g ,['bathroomcnt', 'ordinal', 'TBD', 'Number of bathrooms in home including fractio ,['bedroomcnt', 'ordinal', 'TBD', 'Number of bedrooms in home'] ,['buildingclasstypeid', 'nominal', 'TBD', 'The building framing type (steel frame ,['buildingqualitytypeid', 'ordinal', 'TBD', 'Overall assessment of condition of t ,['calculatedbathnbr', 'ordinal', 'TBD', 'Number of bathrooms in home including f ,['calculatedfinishedsquarefeet', 'ratio', 'TBD', 'Calculated total finished livi ,['censustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined - ,['decktypeid', 'nominal', 'TBD', 'Type of deck (if any) present on parcel'] ,['finishedfloor1squarefeet', 'ratio', 'TBD', 'Size of the finished living area o ,['finishedsquarefeet12', 'ratio', 'TBD', 'Finished living area'] ,['finishedsquarefeet13', 'ratio', 'TBD', 'Perimeter living area'] ,['finishedsquarefeet15', 'ratio', 'TBD', 'Total area'] ,['finishedsquarefeet50', 'ratio', 'TBD', 'Size of the finished living area on the ,['finishedsquarefeet6', 'ratio', 'TBD', 'Base unfinished and finished area'] ,['fips', 'nominal', 'TBD', 'Federal Information Processing Standard code - see ht ,['fireplacecnt', 'ordinal', 'TBD', 'Number of fireplaces in a home (if any)'] ,['fireplaceflag', 'ordinal', 'TBD', 'Is a fireplace present in this home'] ,['fullbathcnt', 'ordinal', 'TBD', 'Number of full bathrooms (sink, shower + batht ,['garagecarcnt', 'ordinal', 'TBD', 'Total number of garages on the lot including ,['garagetotalsqft', 'ratio', 'TBD', 'Total number of square feet of all garages o ,['hashottuborspa', 'ordinal', 'TBD', 'Does the home have a hot tub or spa'] ,['heatingorsystemtypeid', 'nominal', 'TBD', 'Type of home heating system'] ,['landtaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the land area of ,['latitude', 'interval', 'TBD', 'Latitude of the middle of the parcel multiplied ,['logerror', 'interval', 'TBD', 'Error or the Zillow model response variable'] ,['longitude', 'interval', 'TBD', 'Longitude of the middle of the parcel multiplie ,['lotsizesquarefeet', 'ratio', 'TBD', 'Area of the lot in square feet'] ,['numberofstories', 'ordinal', 'TBD', 'Number of stories or levels the home has' ,['parcelid', 'nominal', 'TBD', 'Unique identifier for parcels (lots)'] ,['poolcnt', 'ordinal', 'TBD', 'Number of pools on the lot (if any)'] ,['poolsizesum', 'ratio', 'TBD', 'Total square footage of all pools on property'] ,['pooltypeid10', 'nominal', 'TBD', 'Spa or Hot Tub'] ,['pooltypeid2', 'nominal', 'TBD', 'Pool with Spa/Hot Tub'] ,['pooltypeid7', 'nominal', 'TBD', 'Pool without hot tub'] ,['propertycountylandusecode', 'nominal', 'TBD', 'County land use code i.e. it's ,['propertylandusetypeid', 'nominal', 'TBD', 'Type of land use the property is zo ,['propertyzoningdesc', 'nominal', 'TBD', 'Description of the allowed land uses ( ,['rawcensustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined ,['regionidcity', 'nominal', 'TBD', 'City in which the property is located (if any ,['regionidcounty', 'nominal', 'TBD', 'County in which the property is located'] ,['regionidneighborhood', 'nominal', 'TBD', 'Neighborhood in which the property i ,['regionidzip', 'nominal', 'TBD', 'Zip code in which the property is located'] ,['roomcnt', 'ordinal', 'TBD', 'Total number of rooms in the principal residence' ,['storytypeid', 'nominal', 'TBD', 'Type of floors in a multi-story house (i.e. b ,['structuretaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the built ,['taxamount', 'ratio', 'TBD', 'The total property tax assessed for that assessme ,['taxdelinquencyflag', 'nominal', 'TBD', 'Property taxes for this parcel are past ,['taxdelinquencyyear', 'interval', 'TBD', 'Year'] ,['taxvaluedollarcnt', 'ratio', 'TBD', 'The total tax assessed value of the parce
  • 5. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 5/63 Out[2]: Variable Type Scale Description airconditioningtypeid nominal [nan, 1.0, 13.0, 5.0, 11.0, 9.0, 12.0, 3.0] Type of cooling system present in the h any) architecturalstyletypeid nominal [nan, 7.0, 21.0, 8.0, 2.0, 3.0, 5.0, 10.0, 27.0] Architectural style of the home (i.e. ran colonial, split-level, etc…) assessmentyear interval (2000, 2016) The year of the property tax assessme basementsqft ratio (20, 8516) Finished living area below or partially b ground level bathroomcnt ordinal [0.0, 2.0, 4.0, 3.0, 1.0, ... (38 More)] Number of bathrooms in home includin fractional bathrooms bedroomcnt ordinal [0.0, 4.0, 5.0, 2.0, 3.0, ... (22 More)] Number of bedrooms in home buildingclasstypeid nominal [nan, 3.0, 4.0, 5.0, 2.0, 1.0] The building framing type (steel frame, frame, concrete/brick) ,['threequarterbathnbr', 'ordinal', 'TBD', 'Number of 3/4 bathrooms in house (show ,['transactiondate', 'nominal', 'TBD', 'Date of the transaction response variable ,['typeconstructiontypeid', 'nominal', 'TBD', 'What type of construction material ,['unitcnt', 'ordinal', 'TBD', 'Number of units the structure is built into (i.e. ,['yardbuildingsqft17', 'interval', 'TBD', 'Patio in yard'] ,['yardbuildingsqft26', 'interval', 'TBD', 'Storage shed/building in yard'] ,['yearbuilt', 'interval', 'TBD', 'The Year the principal residence was built'] ] variables = pd.DataFrame(variables_description, columns=['name', 'type', 'scale', variables = variables.set_index('name') variables = variables.loc[data.columns] def output_variables_table(variables): variables = variables.sort_index() rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th> for vname, atts in variables.iterrows(): atts = atts.to_dict() # add scale if TBD if atts['scale'] == 'TBD': if atts['type'] in ['nominal', 'ordinal']: uniques = data[vname].unique() uniques = list(uniques.astype(str)) if len(uniques) < 10: atts['scale'] = '[%s]' % ', '.join(uniques) else: atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d if atts['type'] in ['ratio', 'interval']: atts['scale'] = '(%d, %d)' % (data[vname].min(), data[vname].max( row = (vname, atts['type'], atts['scale'], atts['description']) rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row return HTML('<table>%s</table>' % ''.join(rows)) output_variables_table(variables)
  • 6. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 6/63 buildingqualitytypeid ordinal [nan, 7.0, 4.0, 10.0, 1.0, ... (13 More)] Overall assessment of condition of the from best (lowest) to worst (highest) calculatedbathnbr ordinal [nan, 2.0, 4.0, 3.0, 1.0, ... (35 More)] Number of bathrooms in home includin fractional bathroom calculatedfinishedsquarefeet ratio (1, 952576) Calculated total finished living area of t home censustractandblock nominal [nan, 6.1110010011e+13, 6.1110009032e+13, 6.1110010024e+13, 6.1110010023e+13, ... (96772 More)] Census tract and block ID combined - contains blockgroup assignment by ex decktypeid nominal [nan, 66.0] Type of deck (if any) present on parcel finishedfloor1squarefeet ratio (3, 31303) Size of the finished living area on the f (entry) floor of the home finishedsquarefeet12 ratio (1, 290345) Finished living area finishedsquarefeet13 ratio (120, 2688) Perimeter living area finishedsquarefeet15 ratio (112, 820242) Total area finishedsquarefeet50 ratio (3, 31303) Size of the finished living area on the f (entry) floor of the home finishedsquarefeet6 ratio (117, 952576) Base unfinished and finished area fips nominal [6037.0, 6059.0, 6111.0, nan] Federal Information Processing Standa - see https://en.wikipedia.org/wiki/FIPS_coun for more details fireplacecnt ordinal [nan, 3.0, 1.0, 2.0, 4.0, ... (10 More)] Number of fireplaces in a home (if any fireplaceflag ordinal [nan, True] Is a fireplace present in this home fullbathcnt ordinal [nan, 2.0, 4.0, 3.0, 1.0, ... (21 More)] Number of full bathrooms (sink, showe bathtub, and toilet) present in home garagecarcnt ordinal [nan, 2.0, 4.0, 1.0, 3.0, ... (25 More)] Total number of garages on the lot incl attached garage garagetotalsqft ratio (0, 7749) Total number of square feet of all garag lot including an attached garage hashottuborspa ordinal [nan, True] Does the home have a hot tub or spa heatingorsystemtypeid nominal [nan, 2.0, 7.0, 20.0, 6.0, ... (15 More)] Type of home heating system landtaxvaluedollarcnt ratio (1, 90246219) The assessed value of the land area o parcel
  • 7. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 7/63 latitude interval (33324388, 34819650) Latitude of the middle of the parcel mu by 10e6 logerror interval (-4, 4) Error or the Zillow model response var longitude interval (-119475780, -117554316) Longitude of the middle of the parcel m by 10e6 lotsizesquarefeet ratio (100, 328263808) Area of the lot in square feet numberofstories ordinal [nan, 1.0, 4.0, 2.0, 3.0, ... (13 More)] Number of stories or levels the home h parcelid nominal [10754147, 10759547, 10843547, 10859147, 10879947, ... (2985217 More)] Unique identifier for parcels (lots) poolcnt ordinal [nan, 1.0] Number of pools on the lot (if any) poolsizesum ratio (19, 17410) Total square footage of all pools on pro pooltypeid10 nominal [nan, 1.0] Spa or Hot Tub pooltypeid2 nominal [nan, 1.0] Pool with Spa/Hot Tub pooltypeid7 nominal [nan, 1.0] Pool without hot tub propertycountylandusecode nominal [010D, 0109, 1200, 1210, 010V, ... (241 More)] County land use code i.e. it's zoning at county level propertylandusetypeid nominal [269.0, 261.0, 47.0, 31.0, 260.0, ... (16 More)] Type of land use the property is zoned propertyzoningdesc nominal [nan, LCA11*, LAC2, LAM1, LAC4, ... (5639 More)] Description of the allowed land uses (z for that property rawcensustractandblock nominal [60378002.041, 60378001.011, 60377030.012, 60371412.023, 60371232.052, ... (99394 More)] Census tract and block ID combined - contains blockgroup assignment by ex regionidcity nominal [37688.0, 51617.0, 12447.0, 396054.0, 47547.0, ... (187 More)] City in which the property is located (if regionidcounty nominal [3101.0, 1286.0, 2061.0, nan] County in which the property is located
  • 8. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 8/63 Data Quality 15 points regionidneighborhood nominal [nan, 27080.0, 46795.0, 274049.0, 31817.0, ... (529 More)] Neighborhood in which the property is regionidzip nominal [96337.0, 96095.0, 96424.0, 96450.0, 96446.0, ... (406 More)] Zip code in which the property is locate roomcnt ordinal [0.0, 8.0, 4.0, 5.0, 7.0, ... (37 More)] Total number of rooms in the principal residence storytypeid nominal [nan, 7.0] Type of floors in a multi-story house (i.e basement and main level, split-level, a etc.). See tab for details. structuretaxvaluedollarcnt ratio (1, 251486000) The assessed value of the built structu the parcel taxamount ratio (1, 3458861) The total property tax assessed for tha assessment year taxdelinquencyflag nominal [nan, Y] Property taxes for this parcel are past d of 2015 taxdelinquencyyear interval (0, 99) Year taxvaluedollarcnt ratio (1, 282786000) The total tax assessed value of the par threequarterbathnbr ordinal [nan, 1.0, 2.0, 4.0, 3.0, 6.0, 5.0, 7.0] Number of 3/4 bathrooms in house (sh sink + toilet) transactiondate nominal [nan, 2016-01-27, 2016-03-30, 2016- 05-27, 2016-06-07, ... (353 More)] Date of the transaction response varia typeconstructiontypeid nominal [nan, 6.0, 4.0, 10.0, 13.0, 11.0] What type of construction material was construct the home unitcnt ordinal [nan, 2.0, 1.0, 3.0, 5.0, ... (147 More)] Number of units the structure is built in = duplex, 3 = triplex, etc...) yardbuildingsqft17 interval (10, 7983) Patio in yard yardbuildingsqft26 interval (10, 6141) Storage shed/building in yard yearbuilt interval (1801, 2015) The Year the principal residence was b
  • 9. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 9/63 Description: Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods. Examining Distribution of Missing Values From the observations, most of the rows have about 30 missing values. For the observations that have 57 missing values, it means that most of the features are missing and we choose to remove those. We will add in values to those missing where appropriate, below. In [3]: All observations have a value for parcelid In [4]: 0.38 percent of the data has only parcelid present and all other variables Out[4]: 0 plt.rcParams['figure.figsize'] = [10, 7] number_missing_per_row = data.isnull().sum(axis=1) sns.distplot(number_missing_per_row, color="#34495e", kde=False); plt.title('Distribution of Missing Values', fontsize=15) plt.xlabel('Number of Missing Values', fontsize=15) plt.ylabel('Number of Rows', fontsize=15); data['parcelid'].isnull().sum()
  • 10. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 10/63 missing We choose to remove those observations because they don't present any value In [5]: Table of Missing Values Of the available variables, here is a table that describes the number of missing values as well as the percent missing. (0.0, 'percent of the data has no data features outside of parcelid') print(round(len(number_missing_per_row[number_missing_per_row >= 57]) / len(data) data = data[number_missing_per_row < 57]
  • 11. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 11/63 In [9]: Out[9]: Variable Name Number Missing Values Percent Missing 0 parcelid 0 0.000000 1 airconditioningtypeid 2162353 72.710897 2 architecturalstyletypeid 2967843 99.796160 3 basementsqft 2972277 99.945257 4 bathroomcnt 25 0.000841 5 bedroomcnt 13 0.000437 6 buildingclasstypeid 2961276 99.575339 7 buildingqualitytypeid 1035337 34.814058 8 calculatedbathnbr 117481 3.950395 9 decktypeid 2956809 99.425133 10 finishedfloor1squarefeet 2771182 93.183272 11 calculatedfinishedsquarefeet 44131 1.483941 12 finishedsquarefeet12 264610 8.897729 13 finishedsquarefeet13 2966233 99.742023 14 finishedsquarefeet15 2783098 93.583958 15 finishedsquarefeet50 2771182 93.183272 16 finishedsquarefeet6 2951902 99.260131 17 fips 0 0.000000 18 fireplacecnt 2661258 89.486988 19 fullbathcnt 117481 3.950395 20 garagecarcnt 2090598 70.298076 21 garagetotalsqft 2090598 70.298076 22 hashottuborspa 2904889 97.679280 23 heatingorsystemtypeid 1167429 39.255760 24 latitude 0 0.000000 25 longitude 0 0.000000 26 lotsizesquarefeet 264676 8.899948 27 poolcnt 2456346 82.596653 28 poolsizesum 2945942 99.059721 missing_values = data.isnull().sum().reset_index() missing_values.columns = ['Variable Name', 'Number Missing Values'] missing_values['Percent Missing'] = missing_values['Number Missing Values'] / len missing_values['Percent Missing'] = missing_values['Percent Missing'].replace(np. missing_values
  • 12. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 12/63 Variable Name Number Missing Values Percent Missing 29 pooltypeid10 2936964 98.757829 30 pooltypeid2 2941830 98.921452 31 pooltypeid7 2488421 83.675201 32 propertycountylandusecode 840 0.028246 33 propertylandusetypeid 0 0.000000 34 propertyzoningdesc 995195 33.464250 35 rawcensustractandblock 0 0.000000 36 regionidcity 51410 1.728704 37 regionidcounty 0 0.000000 38 regionidneighborhood 1817447 61.113149 39 regionidzip 2543 0.085510 40 roomcnt 38 0.001278 41 storytypeid 2972281 99.945392 42 threequarterbathnbr 2662261 89.520714 43 typeconstructiontypeid 2967157 99.773093 44 unitcnt 996333 33.502516 45 yardbuildingsqft17 2893549 97.297963 46 yardbuildingsqft26 2971258 99.910992 47 yearbuilt 48494 1.630651 48 numberofstories 2291806 77.063860 49 fireplaceflag 2968740 99.826323 50 structuretaxvaluedollarcnt 43547 1.464304 51 taxvaluedollarcnt 31113 1.046200 52 assessmentyear 2 0.000067 53 landtaxvaluedollarcnt 56296 1.892999 54 taxamount 19813 0.666228 55 taxdelinquencyflag 2917435 98.101150 56 taxdelinquencyyear 2917433 98.101083 57 censustractandblock 63691 2.141662 58 logerror 2883630 96.964429 59 transactiondate 2883630 96.964429
  • 13. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 13/63 Examining Variables for Missing Values and Outliers For variables that are nominal, ratio, and interval where appropriate, we wrote a function that replaces outliers 5 standard deviations from the mean and assigning them as 5 standard deviations above or below the mean, respectively. In [10]: Variable: airconditioningtypeid - Type of cooling system present in the home (if any) Has datatype: nominal and 72.710860 percent of values missing For this variable, missing values indicate the absence of a cooling system. We replace all missing values with 0 to represent no cooling system. We changed the column datatype to integer. In [11]: Variable: architecturalstyletypeid - Architectural style of the home (i.e. ranch, colonial, split-level, etc…) Has datatype: nominal and 99.796185 percent of values missing Architectural style describes the home design. As such, it is not something we can extrapolate a value for. With over 99% of values missing, we decided to eliminate this variable. ('Before', array([ nan, 1., 13., 5., 11., 9., 12., 3.])) ('After', array([ 0, 1, 13, 5, 11, 9, 12, 3])) def fix_outliers(data, column): mean = data[column].mean() std = data[column].std() max_value = mean + std * 5 min_value = mean - std * 5 if data[column].max() < max_value and data[column].min() > min_value: print('No outliers found') return print('Outliers found!') f, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=[15, 7]) f.subplots_adjust(hspace=.4) sns.boxplot(data[column].dropna(), ax=ax0, color="#34495e").set_title('Before sns.distplot(data[column].dropna(), ax=ax2, color="#34495e").set_title('Before data.loc[data[column] > max_value, column] = max_value data.loc[data[column] < min_value, column] = min_value sns.boxplot(data[column].dropna(), ax=ax1, color="#34495e").set_title('After' sns.distplot(data[column].dropna(), ax=ax3, color="#34495e").set_title('After print('Before', data['airconditioningtypeid'].unique()) data['airconditioningtypeid'] = data['airconditioningtypeid'].fillna(0).astype(np print('After', data['airconditioningtypeid'].unique())
  • 14. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 14/63 In [12]: Variable: assessmentyear - year of the property tax assessment Has datatype: interval and has 2 values missing We replaced the missing values with the latest tax year which also happens to be the median tax year. We changed the column datatype to integer. In [13]: Variable: basementsqft - Finished living area below or partially below ground level Has datatype: ratio and 99.945255 percent of values missing Basements are not standard home features. Whenever a basement is not a feature of the home, the value for area was entered as a missing value. With over 99% of values missing, we decided to eliminate this variable. In [14]: Variable: bathroomcnt - Number of bathrooms in home including fractional bathrooms Has datatype: ordinal and 0.000841 percent of values missing We decided it is potentially possible for the property to not have a bathroom so we decided to replace missing values with zeros since there are only very few. We changed the column datatype to a float. ('Before', array([ 2015., 2014., 2003., 2012., 2001., 2011., 2013., 201 6., 2010., nan, 2004., 2005., 2002., 2000., 2009.])) ('After', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010, 2004, 20 05, 2002, 2000, 2009])) del data['architecturalstyletypeid'] print('Before', data['assessmentyear'].unique()) median_value = data['assessmentyear'].median() data['assessmentyear'] = data['assessmentyear'].fillna(median_value).astype(np.int print('After', data['assessmentyear'].unique()) del data['basementsqft']
  • 15. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 15/63 In [15]: Variable: bedroomcnt - Number of bedrooms in home Has datatype: ordinal and 0.000437 percent of values missing We decided to replace missing values with zeros since there are only very few to represent a studio apartment. We changed the column datatype to integer. In [16]: Variable: buildingclasstypeid - The building framing type (steel frame, wood frame, concrete/brick) Has datatype: nominal and 99.576949 percent of values missing With this much missing values and the difficulty of assigning a building framing type, we decided to remove this variable. In [17]: Variable: buildingqualitytypeid - Overall assessment of condition of the building from best (lowest) to worst (highest) Has datatype: ordinal and 34.81 percent of values missing We chose to replace the missing values with the median of the condition assessment instead of giving the missing values the best or worst value. We changed the column datatype to integer. ('Before', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5. , 1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. , 9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. , 20. , 19.5 , 15. , 10.5 , nan, 18. , 16. , 1.75, 17. , 19. , 0.5 , 12.5 , 11.5 , 14.5 ])) ('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5. , 1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. , 9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. , 20. , 19.5 , 15. , 10.5 , 18. , 16. , 1.75, 17. , 19. , 0.5 , 12.5 , 11.5 , 14.5 ])) ('Before', array([ 0., 4., 5., 2., 3., 1., 6., 7., 8., 12., 11., 9., 10., 14., 16., 13., nan, 15., 17., 18., 20., 19.])) ('After', array([ 0, 4, 5, 2, 3, 1, 6, 7, 8, 12, 11, 9, 10, 14, 16, 1 3, 15, 17, 18, 20, 19])) print('Before', data['bathroomcnt'].unique()) data['bathroomcnt'] = data['bathroomcnt'].fillna(0).astype(np.float32) print('After', data['bathroomcnt'].unique()) print('Before', data['bedroomcnt'].unique()) data['bedroomcnt'] = data['bedroomcnt'].fillna(0).astype(np.int32) print('After', data['bedroomcnt'].unique()) del data['buildingclasstypeid']
  • 16. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 16/63 In [18]: Variable: calculatedbathnbr - Number of bathrooms in home including fractional bathroom Has datatype: ordinal and 3.95 percent of values missing With a low number of missing values, we decided to assign 0 to all missing values since we decided above it is possible that a property could have 0 bathrooms. We changed the column datatype to a float. In [19]: Variable: calculatedfinishedsquarefeet - Calculated total finished living area of the home Has datatype: ratio and 1.48 percent of values missing These missing values appear to be consistent with 0 or missing values for variables associated with a building or structure on the property such as bathroomcnt, bedroomcnt, or architecturalstyletypeid. We can assume that no structures exist on these properties and we decided to impute zeros to these. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. ('Before', array([ nan, 7., 4., 10., 1., 12., 8., 3., 6., 9., 5., 11., 2.])) ('After', array([ 7, 4, 10, 1, 12, 8, 3, 6, 9, 5, 11, 2])) ('Before', array([ nan, 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1. 5, 4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. , 11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. , 10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5])) ('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.5, 4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. , 11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. , 10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5])) print('Before', data['buildingqualitytypeid'].unique()) medianQuality = data['buildingqualitytypeid'].median() data['buildingqualitytypeid'] = data['buildingqualitytypeid'].fillna(medianQuality print('After', data['buildingqualitytypeid'].unique()) print('Before', data['calculatedbathnbr'].unique()) data['calculatedbathnbr'] = data['calculatedbathnbr'].fillna(0).astype(np.float32 print('After', data['calculatedbathnbr'].unique())
  • 17. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 17/63 In [20]: Variable: censustractandblock - census tract and census block ID Has datatype: nominal and 2.14 percent of values missing With such a small amount of missing values, we decided to replace them with the median. A better approach in the future could be taking into account zip code and then median for the missing values. We changed the column datatype to a float. In [21]: Variable: decktypeid - Type of deck (if any) present on parcel Has datatype: nominal and 99.427311 percent of values missing Outliers found! ('Before', [nan, 10925.92657277406, 5068.0, 1776.0, 2400.0, 3611.0, 3754.0, 247 0.0, '...']) ('After', [0, 10925, 5068, 1776, 2400, 3611, 3754, 2470, '...']) ('Before', [nan, 61110010011023.0, 61110009032019.0, 61110010024015.0, 61110010 023002.0, 61110010024021.0, 61110010021029.0, 61110010022038.0, '...']) ('After', [60375714234368.0, 61110011035648.0, 61110006841344.0, 6111000264704 0.0, 61110015229952.0, 61110019424256.0, 61110023618560.0, 61110027812864.0, '...']) fix_outliers(data, 'calculatedfinishedsquarefeet') print('Before', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['.. data['calculatedfinishedsquarefeet'] = data['calculatedfinishedsquarefeet'].fillna print('After', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['... print('Before', data['censustractandblock'].unique()[:8].tolist() + ['...']) median_value = data['censustractandblock'].median() data['censustractandblock'] = data['censustractandblock'].fillna(median_value) data['censustractandblock'] = data['censustractandblock'].astype(np.float32) print('After', data['censustractandblock'].unique()[:8].tolist() + ['...'])
  • 18. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 18/63 Missing values is most likely an indication of an absence of this feature in the property. With 99% missing values, we will remove this column. In [22]: Variable: finishedfloor1squarefeet - Size of the finished living area on the first (entry) floor of the home Has datatype: ratio and 93.18 percent of values missing Having this much missing values and the availability of an alternate variable - calculatedfinishedsquarefeet - with very few missing values, we decided to eliminate this variable. In [23]: Variable: finishedsquarefeet12 - Finished living area Has datatype: ratio and 8.89 percent of values missing The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Missing values are therefore zeros. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [24]: Outliers found! ('Before', array([ nan, 4000., 3633., ..., 317., 268., 161.])) ('After', array([ 0, 4000, 3633, ..., 317, 268, 161])) del data['decktypeid'] del data['finishedfloor1squarefeet'] fix_outliers(data, 'finishedsquarefeet12') print('Before', data['finishedsquarefeet12'].unique()) data['finishedsquarefeet12'] = data['finishedsquarefeet12'].fillna(0).astype(np.i print('After', data['finishedsquarefeet12'].unique())
  • 19. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 19/63 Variable: finishedsquarefeet13 - Finished living area Has datatype: ratio and 99.743000 percent of values missing The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 99% missing values we will remove this from the dataset. In [25]: Variable: finishedsquarefeet15 - Total area Has datatype: ratio and 93.58 percent of values missing The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93% missing values we will remove this from the dataset. In [26]: Variable: finishedsquarefeet50 - Size of the finished living area on the first (entry) floor of the home Has datatype: ratio and 93.18 percent of values missing The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93% missing values we will replace the missing values with 0. We changed the column datatype to float. In [27]: Variable: finishedsquarefeet6 - Base unfinished and finished area Has datatype: ratio and 99.26 percent of values missing With 99% missing values, we decided to delete this variable. In [28]: Variable: fips - Federal Information Processing Standard code - see https://en.wikipedia.org/wiki/FIPS_county_code (https://en.wikipedia.org/wiki/FIPS_county_code) for more details Has datatype: nominal with values [6037.0, 6059.0, 6111.0] and no missing values We changed the column datatype to integer. In [29]: Variable: fireplacecnt - Number of fireplaces in a home (if any) del data['finishedsquarefeet13'] del data['finishedsquarefeet15'] data['finishedsquarefeet50'] = data['finishedsquarefeet50'].fillna(0).astype(np.f del data['finishedsquarefeet6'] data['fips'] = data['fips'].astype(np.int32)
  • 20. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 20/63 Has datatype: ordinal and 89.486882 percent of values missing In this dataset, missing value represents 0 fireplaces. We replaced all missing values with zero and change the column datatype to integer. We changed the column datatype to integer. In [30]: Variable: fireplaceflag - does the home have a fireplace Has datatype: ordinal and 99.82 percent of values missing With 99% missing values, we decided to delete the variable. In [31]: Variable: fullbathcnt - Number of full bathrooms (sink, shower + bathtub, and toilet) present in home Has datatype: ordinal and 3.95 percent of values missing We first replaced its missing values with the values of bathroomcnt which is a similar measure. After that, we have 25 observations missing and we replace them with 0. We changed the column datatype to a float. In [32]: Variable: garagecarcnt - Total number of garages on the lot including an attached garage Has datatype: ordinal and 70.298173 percent of values missing We assume that missing values will represent no garage and replace all missing values with zero. We changed the column datatype to integer. ('Before', array([ nan, 3., 1., 2., 4., 9., 5., 7., 6., 8.])) ('After', array([0, 3, 1, 2, 4, 9, 5, 7, 6, 8])) ('Before', array([ nan, 2., 4., 3., 1., 5., 7., 6., 10., 8., 9., 12., 11., 13., 14., 20., 19., 15., 18., 16., 17.])) ('After', array([ 0. , 2. , 4. , 3. , 1. , 5. , 7. , 6. , 10. , 8. , 9. , 12. , 11. , 7.5 , 2.5 , 4.5 , 1.5 , 13. , 14. , 20. , 3.5 , 19. , 5.5 , 15. , 18. , 16. , 1.75, 6.5 , 17. , 0.5 , 8.5 ])) print('Before', data['fireplacecnt'].unique()) data['fireplacecnt'] = data['fireplacecnt'].fillna(0).astype(np.int32) print('After', data['fireplacecnt'].unique()) del data['fireplaceflag'] print('Before', data['fullbathcnt'].unique()) missing_fullbathcnt = data['fullbathcnt'].isnull() data.loc[missing_fullbathcnt, 'fullbathcnt'] = data['bathroomcnt'][missing_fullbat data['fullbathcnt'] = data['fullbathcnt'].astype(np.float32) print('After', data['fullbathcnt'].unique())
  • 21. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 21/63 In [33]: Variable: garagetotalsqft - Total number of garages on the lot including an attached garage Has datatype: ratio and 70.298173 percent of values missing We first replaced missing values where garagecarcnt is 0 with 0 garagetotalsqft. We changed the column datatype to a float. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [34]: Variable: hashottuborspa - Does the home have a hot tub or spa Has datatype: ordinal and 97.679250 percent of values missing In this dataset missing values represent doesn't have a hot tub or spa. we replaced all missing values with 0 and all True values with 1. We changed the column datatype to integer. [ 0 2 4 1 3 5 7 6 8 9 12 11 10 13 14 15 25 21 18 17 24 19 16 20] Outliers found! data['garagecarcnt'] = data['garagecarcnt'].fillna(0).astype(np.int32) print(data['garagecarcnt'].unique()) fix_outliers(data, 'garagetotalsqft') data.loc[data['garagecarcnt'] == 0, 'garagetotalsqft'] = 0 data['garagecarcnt'] = data['garagecarcnt'].astype(np.float32) assert data['garagetotalsqft'].isnull().sum() == 0
  • 22. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 22/63 In [35]: Variable: heatingorsystemtypeid - Type of home heating system Has datatype: nominal and 39.255728 percent of values missing We replaced all missing values with 0 which will represent a missing heating system type id. We changed the column datatype to integer. In [36]: Variable: landtaxvaluedollarcnt - the assessed value of the land Has datatype: ratio and 1.89 percent of values missing We replaced all missing values with the median assessed land values. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. ('Before', array([nan, True], dtype=object)) ('After', array([0, 1])) ('Before', array([ nan, 2., 7., 20., 6., 13., 18., 24., 12., 10., 1., 14., 21., 11., 19.])) ('After', array([ 0, 2, 7, 20, 6, 13, 18, 24, 12, 10, 1, 14, 21, 11, 19])) print('Before', data['hashottuborspa'].unique()) data['hashottuborspa'] = data['hashottuborspa'].fillna(0).replace('True', 1).asty print('After', data['hashottuborspa'].unique()) print('Before', data['heatingorsystemtypeid'].unique()) data['heatingorsystemtypeid'] = data['heatingorsystemtypeid'].fillna(0).astype(np print('After', data['heatingorsystemtypeid'].unique())
  • 23. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 23/63 In [37]: Variables: latitude and longitude Has datatype: interval and no missing values. We changed the column datatype to float. In [38]: Variable: logerror - Error or the Zillow model response variable Has datatype: interval and 96.964429 percent of values missing We will not fill any missing values because they represent the test part of the dataset. We changed the column datatype to float. In [39]: Variable: lotsizesquarefeet - Area of the lot in square feet Has datatype: ratio and 8.9 percent of values missing Outliers found! ('Before', array([ 9.00000000e+00, 2.75160000e+04, 7.62631000e+05, ..., 1.28007500e+06, 3.61063000e+05, 9.54574000e+05])) ('After', array([ 9, 27516, 762631, ..., 1280075, 361063, 954574])) fix_outliers(data, 'landtaxvaluedollarcnt') print('Before', data['landtaxvaluedollarcnt'].unique()) median_value = data['landtaxvaluedollarcnt'].median() data['landtaxvaluedollarcnt'] = data['landtaxvaluedollarcnt'].fillna(median_value print('After', data['landtaxvaluedollarcnt'].unique()) data[['latitude', 'longitude']] = data[['latitude', 'longitude']].astype(np.float data['logerror'] = data['logerror'].astype(np.float32)
  • 24. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 24/63 We replace all missing values with 0 which will represent no lot. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [40]: Variable: numberofstories - number of stories or levels the home has Has datatype: ordinal and 77.06 percent of values missing We replace all missing values with 1 after removing all outliers to represent a single story home. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! fix_outliers(data, 'lotsizesquarefeet') data['lotsizesquarefeet'] = data['lotsizesquarefeet'].fillna(0).astype(np.float32
  • 25. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 25/63 In [41]: Variable: parcelid - Unique identifier for parcels (lots) Has datatype: nominal and no values missing. We changed the column datatype to integer. In [42]: Variable: poolcnt - Number of pools on the lot (if any) Has datatype: ordinal and 82.6 percent of values missing We replaced all missing values with 0 which will represent no pools. We changed the column datatype to integer. In [43]: Variable: poolsizesum - Total square footage of all pools on Outliers found! ('Before', array([ nan, 1. , 4. , 2. , 3. , 4.09684575])) ('After', array([1, 4, 2, 3])) ('Before', array([ nan, 1.])) ('After', array([0, 1])) fix_outliers(data, 'numberofstories') print('Before', data['numberofstories'].unique()) data['numberofstories'] = data['numberofstories'].fillna(1).astype(np.int32) print('After', data['numberofstories'].unique()) data['parcelid'] = data['parcelid'].astype(np.int32) print('Before', data['poolcnt'].unique()) data['poolcnt'] = data['poolcnt'].fillna(0).astype(np.int32) print('After', data['poolcnt'].unique())
  • 26. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 26/63 property Has datatype: ratio and 99 percent of values missing We replaced all missing values with 0 if number of pools is 0 or with the average poolsizesum otherwise. We changed the column datatype to a float. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [44]: Variable: pooltypeid10 - Spa or Hot Tub Has datatype: nominal and 98.8 percent of values missing We replaced all missing values with 0 which will represent no Spa or Hot Tub. We changed the column datatype to integer. In [45]: Variable: pooltypeid2 - Pool with Spa/Hot Tub Has datatype: nominal and 98.9 percent of values missing Outliers found! ('Before', array([ nan, 1.])) ('After', array([0, 1])) fix_outliers(data, 'poolsizesum') data.loc[data['poolsizesum'].isnull(), 'poolsizesum'] = int(data['poolsizesum'].me data.loc[data['poolcnt'] == 0, 'poolsizesum'] = 0 data['poolcnt'] = data['poolcnt'].astype(np.float32) print('Before', data['pooltypeid10'].unique()) data['pooltypeid10'] = data['pooltypeid10'].fillna(0).astype(np.int32) print('After', data['pooltypeid10'].unique())
  • 27. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 27/63 We replaced all missing values with 0 which will represent no Pool with Spa/Hot Tub. We changed the column datatype to integer. In [46]: Variable: pooltypeid7 - Pool without hot tub Has datatype: nominal and 83.6 percent of values missing We replaced all missing values with 0 which will represent no pool without hot tub. We changed the column datatype to integer. In [47]: Variable: propertycountylandusecode - County land use code i.e. it's zoning at the county level Has datatype: nominal and 0.02 percent of values missing We replaced all missing values with 0 which will represent no county land use code. We changed the column datatype to string. In [48]: Variable: propertylandusetypeid - Type of land use the property is zoned for Has datatype: nominal and 0 percent of values missing. We are just changing the datatype to integer In [49]: ('Before', array([ nan, 1.])) ('After', array([0, 1])) ('Before', array([ nan, 1.])) ('After', array([0, 1])) ('Before', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200', '...']) ('After', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200', '...']) print('Before', data['pooltypeid2'].unique()) data['pooltypeid2'] = data['pooltypeid2'].fillna(0).astype(np.int32) print('After', data['pooltypeid2'].unique()) print('Before', data['pooltypeid7'].unique()) data['pooltypeid7'] = data['pooltypeid7'].fillna(0).astype(np.int32) print('After', data['pooltypeid7'].unique()) print('Before', data['propertycountylandusecode'].unique()[:8].tolist() + ['...'] data['propertycountylandusecode'] = data['propertycountylandusecode'].fillna(0).a print('After', data['propertycountylandusecode'].unique()[:8].tolist() + ['...']) data['propertylandusetypeid'] = data['propertylandusetypeid'].astype(np.int32)
  • 28. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 28/63 Variable: propertyzoningdesc - Description of the allowed land uses (zoning) for that property Has datatype: nominal and 33.4 percent of values missing We replaced all missing values with 0 which will represent no description of the allowed land uses. We changed the column datatype to string. In [50]: Variable: rawcensustractandblock - Census tract and block ID combined - also contains blockgroup assignment by extension Has datatype: nominal and 0 percent of values missing We are just changing the datatype to integer In [51]: Variable: regionidcity - City in which the property is located (if any) Has datatype: nominal and 1.72 percent of values missing we will replace any missing values with 0 to represent no city ID. We are just changing the datatype to integer In [52]: Variable: regionidcounty - County in which the property is located ('Before', array([nan, 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'], dtype=object)) ('After', array(['0', 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'], dtype=object)) ('Before', array([ 60378002.041 , 60378001.011002, 60377030.012017, ..., 60590878.032022, 60590626.211013, 60379012.091563])) ('After', array([60378002, 60378001, 60377030, ..., 61110057, 60375324, 6037599 1])) ('Before', [37688.0, 51617.0, 12447.0, 396054.0, 47547.0, nan, 54311.0, 40227. 0, '...']) ('After', [37688, 51617, 12447, 396054, 47547, 0, 54311, 40227, '...']) print('Before', data['propertyzoningdesc'].unique()) data['propertyzoningdesc'] = data['propertyzoningdesc'].fillna(0).astype(np.str) print('After', data['propertyzoningdesc'].unique()) print('Before', data['rawcensustractandblock'].unique()) data['rawcensustractandblock'] = data['rawcensustractandblock'].fillna(0).astype( print('After', data['rawcensustractandblock'].unique()) print('Before', data['regionidcity'].unique()[:8].tolist() + ['...']) data['regionidcity'] = data['regionidcity'].fillna(0).astype(np.int32) print('After', data['regionidcity'].unique()[:8].tolist() + ['...'])
  • 29. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 29/63 Has datatype: nominal and 0 percent of values missing. We changed the column datatype to integer. In [53]: Variable: regionidneighborhood - Neighborhood in which the property is located Has datatype: nominal and 61.1 percent of values missing We replaced all missing values with 0 which will represent no region ID neighborhood. We changed the column datatype to integer. In [54]: Variable: regionidzip - Zip code in which the property is located Has datatype: nominal and 0.08 percent of values missing We replaced all missing values with 0 which will represent no zip code. We changed the column datatype to integer. In [55]: Variable: roomcnt - Total number of rooms in the principal residence Has datatype: nominal and 0.001 percent of values missing We replaced all missing values with 1 which will represent no Total number of rooms in the principal residence reported. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. ('Before', array([ 3101., 1286., 2061.])) ('After', array([3101, 1286, 2061])) ('Before', [nan, 27080.0, 46795.0, 274049.0, 31817.0, 37739.0, 115729.0, 7877. 0, '...']) ('After', [0, 27080, 46795, 274049, 31817, 37739, 115729, 7877, '...']) ('Before', [96337.0, 96095.0, 96424.0, 96450.0, 96446.0, 96049.0, 96434.0, 9643 6.0, '...']) ('After', [96337, 96095, 96424, 96450, 96446, 96049, 96434, 96436, '...']) print('Before', data['regionidcounty'].unique()) data['regionidcounty'] = data['regionidcounty'].astype(np.int32) print('After', data['regionidcounty'].unique()) print('Before', data['regionidneighborhood'].unique()[:8].tolist() + ['...']) data['regionidneighborhood'] = data['regionidneighborhood'].fillna(0).astype(np.i print('After', data['regionidneighborhood'].unique()[:8].tolist() + ['...']) print('Before', data['regionidzip'].unique()[:8].tolist() + ['...']) data['regionidzip'] = data['regionidzip'].fillna(0).astype(np.int32) print('After', data['regionidzip'].unique()[:8].tolist() + ['...'])
  • 30. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 30/63 In [56]: Variable: storytypeid - Type of floors in a multi-story house (i.e. basement and main level, split-level, attic, etc.). See tab for details. Has datatype: nominal and 99.9 percent of values missing With 99% missing values, we decided to remove this variable. In [57]: Variable: structuretaxvaluedollarcnt - the assessed value of the building Has datatype: ratio and 1.46 percent of values missing We replaced all missing values with the median assessed building tax. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! ('Before', array([ 0. , 8. , 4. , 5. , 7. , 6. , 11. , 3. , 10. , 9. , 2. , 12. , 15.67699991, 13. , 15. , 14. , 1. , nan])) ('After', array([ 0, 8, 4, 5, 7, 6, 11, 3, 10, 9, 2, 12, 15, 13, 14, 1])) fix_outliers(data, 'roomcnt') print('Before', data['roomcnt'].unique()) data['roomcnt'] = data['roomcnt'].fillna(1).astype(np.int32) print('After', data['roomcnt'].unique()) del data['storytypeid']
  • 31. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 31/63 In [58]: Variable: taxamount - property tax for the assessment year Has datatype: ratio and 0.66 percent of values missing We replaced all missing values with the median property taxes for the assessment year. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! ('Before', array([ nan, 650756., 571346., ..., 409940., 463704., 43776 5.])) ('After', array([122590, 650756, 571346, ..., 409940, 463704, 437765])) fix_outliers(data, 'structuretaxvaluedollarcnt') print('Before', data['structuretaxvaluedollarcnt'].unique()) medTax = np.nanmedian(data['structuretaxvaluedollarcnt']) data['structuretaxvaluedollarcnt'] = data['structuretaxvaluedollarcnt'].fillna(med print('After', data['structuretaxvaluedollarcnt'].unique())
  • 32. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 32/63 In [59]: Variable: taxdelinquencyflag - property taxes from 2015 that are past due Has datatype: nominal and 98.10 percent of values missing We replaced all missing values with 0 representing no past due property taxes and all Y values with 1 representing that there are past due property taxes. We changed the column datatype to integer. In [60]: Variable: taxdelinquencyyear - years of delinquency Has datatype: interval and 98.10 percent of values missing We replaced all missing values with 0 representing no years of property tax delinquencies. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! ('Before', array([ nan, 20800.37, 14557.57, ..., 33604.04, 12627.18, 15546.14])) ('After', array([ 3991.7800293 , 20800.36914062, 14557.5703125 , ..., 33604.0390625 , 12627.1796875 , 15546.13964844])) ('Before', array([nan, 'Y'], dtype=object)) ('After', array([0, 1])) fix_outliers(data, 'taxamount') print('Before', data['taxamount'].unique()) median_value = data['taxamount'].median() data['taxamount'] = data['taxamount'].fillna(median_value).astype(np.float32) print('After', data['taxamount'].unique()) print('Before', data['taxdelinquencyflag'].unique()) data['taxdelinquencyflag'] = data['taxdelinquencyflag'].fillna(0).replace('Y', 1) print('After', data['taxdelinquencyflag'].unique())
  • 33. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 33/63 In [61]: Variable: taxvaluedollarcnt - total tax Has datatype: ratio and 1.04 percent of values missing We replaced all missing values with the median total tax amount. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! ('Before', array([ nan, 13. , 15. , 11. , 14. , 9. , 10. , 8. , 12. , 7. , 6. , 2. , 26.79676804, 5. , 3. , 4. , 0.98797484, 1. ])) ('After', array([ 0, 13, 15, 11, 14, 9, 10, 8, 12, 7, 6, 2, 26, 5, 3, 4, 1])) fix_outliers(data, 'taxdelinquencyyear') print('Before', data['taxdelinquencyyear'].unique()) data['taxdelinquencyyear'] = data['taxdelinquencyyear'].fillna(0).astype(np.int32 print('After', data['taxdelinquencyyear'].unique())
  • 34. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 34/63 In [62]: Variable: threequarterbathnbr - Number of 3/4 bathrooms in house (shower + sink + toilet) Has datatype: ordinal and 89.5 percent of values missing We replaced all missing values with 0 which will represent no Number of 3/4 bathrooms in the property. We changed the column datatype to integer. In [63]: Variable: transactiondate - Date of the transaction response variable Has datatype: interval and 96.964429 percent of values missing Will not fill any missing values because they represent the test part of the dataset Outliers found! ('Before', array([ 9.00000000e+00, 2.75160000e+04, 1.41338700e+06, ..., 4.70248000e+05, 6.43794000e+05, 5.30550000e+05])) ('After', array([ 9, 27516, 1413387, ..., 470248, 643794, 530550])) ('Before', array([ nan, 1., 2., 4., 3., 6., 5., 7.])) ('After', array([0, 1, 2, 4, 3, 6, 5, 7])) fix_outliers(data, 'taxvaluedollarcnt') print('Before', data['taxvaluedollarcnt'].unique()) median_value = data['taxvaluedollarcnt'].median() data['taxvaluedollarcnt'] = data['taxvaluedollarcnt'].fillna(median_value).astype print('After', data['taxvaluedollarcnt'].unique()) print('Before', data['threequarterbathnbr'].unique()) data['threequarterbathnbr'] = data['threequarterbathnbr'].fillna(0).astype(np.int print('After', data['threequarterbathnbr'].unique())
  • 35. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 35/63 In [64]: Variable: typeconstructiontypeid - What type of construction material was used to construct the home Has datatype: nominal and 99.7 percent of values missing With 99% missing values, we decided to remove this variable. In [65]: Variable: unitcnt - number of units in the building Has datatype: ordinal and 33.5 percent of values missing We replaced all missing values with 1 to represent a single family home for any with no values. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [66]: Variable: yardbuildingsqft17 - sq feet of patio in yard Has datatype: interval and 97.29 percent of values missing Outliers found! ('Before', [nan, 2.0, 1.0, 3.0, 5.0, 4.0, 9.0, 13.420418204007635, '...']) ('After', array([ 1, 2, 3, 5, 4, 9, 13, 12, 6, 7, 8, 10, 11])) data['transactiondate'] = pd.to_datetime(data['transactiondate']) del data['typeconstructiontypeid'] fix_outliers(data, 'unitcnt') print('Before', data['unitcnt'].unique()[:8].tolist() + ['...']) data['unitcnt'] = data['unitcnt'].fillna(1).astype(np.int32) print('After', data['unitcnt'].unique())
  • 36. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 36/63 We replaced all missing values with 0 representing no patio. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. In [67]: Variable: yardbuildingsqft26 - storage shed/building in yard Has datatype: interval and 99.91 percent of values missing We replaced all missing values with 0 which will represent no (square ft) storage shed or building in the yard. We changed the column datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std), respectively. Outliers found! ('Before', array([ nan, 450., 94., ..., 969., 1359., 1079.])) ('After', array([ 0, 450, 94, ..., 969, 1359, 1079])) fix_outliers(data, 'yardbuildingsqft17') print('Before', data['yardbuildingsqft17'].unique()) data['yardbuildingsqft17'] = data['yardbuildingsqft17'].fillna(0).astype(np.int32 print('After', data['yardbuildingsqft17'].unique())
  • 37. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 37/63 In [68]: Variable: yearbuilt - The Year the residence was built Has datatype: interval and 1.63 percent of values missing We replaced all missing values with the median year built of 1963 until we have a better method to impute. We changed the column datatype to integer. In [69]: End of data cleaning We went through every variable and next cell will confirm that the dataset has no missing values. In [70]: Simple Statistics Outliers found! ('Before', [nan, 1948.0, 1947.0, 1943.0, 1946.0, 1978.0, 1958.0, 1949.0, '...']) ('After', [1963, 1948, 1947, 1943, 1946, 1978, 1958, 1949, '...']) fix_outliers(data, 'yardbuildingsqft26') data['yardbuildingsqft26'] = data['yardbuildingsqft26'].fillna(0).astype(np.float #there's too many values to print, before and after data redacted print('Before', data['yearbuilt'].unique()[:8].tolist() + ['...']) medYear = data['yearbuilt'].median() data['yearbuilt'] = data['yearbuilt'].fillna(medYear).astype(np.int32) print('After', data['yearbuilt'].unique()[:8].tolist() + ['...']) # 'logerror' and 'transactiondate' are future variables and only exist in the tran explanatory_vars = data.columns[~data.columns.isin(['logerror', 'transactiondate' assert np.all(~data[explanatory_vars].isnull())
  • 38. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 38/63 10 points Description: Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful. Table of Binary Variables (0 or 1) We standardized all Yes/No and True/False variables to 1 or 0, respectively. The table below shows that all binary flags in this dataset represent rare features such a pool, hot tub, tax delinquency flag, and three quarter bathroom. In [71]: Summary Statistics of All Continuous Variables To make the table more readable, we converted all simple statistics of continuous variables to integers. We lose some precision but we get a better overview. For each variable, we have already accounted for outliers and standardized missing values. We can immediately see that 0 is the most common value for many of the variables. To explore further, we chose to visualize each variable that had non-zero 25% to 75% values in the form of a boxplot and histogram. Out[71]: Percent with value equal to 1 hashottuborspa 2.320720 poolcnt 17.403347 pooltypeid2 1.078548 pooltypeid7 16.324799 pooltypeid10 1.242172 taxdelinquencyflag 1.898850 threequarterbathnbr 10.584165 bin_vars = ['hashottuborspa', 'poolcnt', 'pooltypeid2', 'pooltypeid7', 'pooltypeid bin_data = data[bin_vars] result_table = bin_data.mean() * 100 pd.DataFrame(result_table, columns=['Percent with value equal to 1'])
  • 39. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 39/63 In [72]: Calculated Finished Square Feet For calculated square feet, most values were 0 with a range from 0 to 10898 sqft. Note that we removed outliers earlier while cleaning the data. The median of 1561 was a little smaller than the mean of 1784 so we expect to see a slight right skew, which we do below. What is interesting here is the peak at 0 of values and then another peak around 1600 to 1800. We continue to have few properties with very large (higher than the 75th percentile of 2124) which is fairly normal for any area to have the middle class homes with few larger homes mixed in. Out[72]: count mean std min 25% 50% 75% max calculatedfinishedsquarefeet 2973905 1784 984 0 1199 1561 2124 1092 finishedsquarefeet12 2973905 1596 958 0 1092 1466 1996 6615 finishedsquarefeet50 2973905 94 390 0 0 0 0 3130 garagetotalsqft 2973905 113 217 0 0 0 0 1610 lotsizesquarefeet 2973905 19810 73796 0 5200 6700 9243 1710 poolsizesum 2973905 90 196 0 0 0 0 1476 yardbuildingsqft17 2973905 8 61 0 0 0 0 1485 yardbuildingsqft26 2973905 0 12 0 0 0 0 2126 yearbuilt 2973905 1964 23 1801 1950 1963 1981 2015 structuretaxvaluedollarcnt 2973905 166367 179850 1 75440 122590 195143 2181 taxvaluedollarcnt 2973905 407695 429374 1 181179 306086 485000 4052 assessmentyear 2973905 2014 0 2000 2015 2015 2015 2016 landtaxvaluedollarcnt 2973905 242391 287722 1 76724 167043 303002 2477 taxamount 2973905 5229 5284 1 2471 3991 6178 5129 taxdelinquencyyear 2973905 0 1 0 0 0 0 26 logerror 90275 0 0 -4 0 0 0 4 train_data = data[~data['logerror'].isnull()] continous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index continous_vars = continous_vars[continous_vars.isin(data.columns)] continous_vars = continous_vars[~continous_vars.isin(['longitude', 'latitude'])] output_table = data[continous_vars].describe().T mode_range = data[continous_vars].mode().T mode_range.columns = ['mode'] mode_range['range'] = data[continous_vars].max() - data[continous_vars].min() output_table = output_table.join(mode_range) output_table.astype(int)
  • 40. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 40/63 In [73]: Finished Living Area Similar to calculated finished square feet, finished living area had outliers which we already fixed above. The range for finished living area is 0 to 6871 with 0 being the mode of the data. The mean (1596) is about 100sqft larger than the median (1466) so they are relatively the same since the variance is 962. This variable is bimodal in with a large spike at 0 and another peak with a fairly normal distribution and long right tail at around 1400. We also see a slight spike at the very end of the tail of the dataset. This means there were a lot of outliers that were set to the maximum (mean + 6 * std). f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7]) sns.boxplot(data['calculatedfinishedsquarefeet'], ax=ax0, color="#34495e").set_tit sns.distplot(data['calculatedfinishedsquarefeet'], ax=ax2, color="#34495e");
  • 41. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 41/63 In [74]: Lot Size Square Feet Lot size square feet has the largest range from 0 to 1,710,750 even after removing all outliers (mean + std * 5). The mode for this variable is 0 so we see below a spike at 0 and a very long right tail. What is interesting with this variable is the large variance of 73796. The 25th to 75th percentile values are 5200 and 9243 respectively so we will skipped over the box plot and plotted the histogram below. In the histogram, we see a right skewed distribution which makes sense considering the mean is 19810 and the median is 6700 - again, with such a large variance it is difficult for the eye to see the difference. The main takeaway here is the large number of 0s. f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7]) sns.boxplot(data['finishedsquarefeet12'], ax=ax0, color="#34495e").set_title('Fin sns.distplot(data['finishedsquarefeet12'], ax=ax2, color="#34495e");
  • 42. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 42/63 In [75]: Year Built The year the properties were built ranges from 1801 to 2015. The mode and median of 1963 is only a year difference from the mean of 1964. The distribution seems to be fairly normal with the peak in the early 1960s and dropping off on both sides. We see a number of homes that were built before 1905 (the low whisker of the boxplot) which gives us a long left tail. We see a few other spikes in homes built which could correlate to a number of other factors such as healthy economic growth, political backing on mortgages, or rises in population. The baby boomers born early 1960 shows many houses being built and around the time they turned 18 more houses seem to have been built. We see an apparent fall right before 2000 which could be the dot com burst and another drop in the housing burst of 2007. Because our data was collected in 2016, we expect to see fewer homes built the previous year. What will be interesting with this variable is how old a home has to be to begin to "fall apart" or need major renovations to the piping or foundation. Will a home built in a certain year have many homes made from a faulty material that causes damages later on? Will the Zestimate take into account the disclosures of a home that each sale price typically does? f, (ax0) = plt.subplots(nrows=1, ncols=1) sns.distplot(data['lotsizesquarefeet'], ax=ax0, color="#34495e").set_title('Lot s
  • 43. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 43/63 In [76]: Total Tax Value The total tax value of the property ranges from 1 to 4,052,186. The median of 306,086 is the same as the mode and a little smaller than the mean of 407,695 which is evident in the right skewed distribution below. These values have already been adjusted for outliers which is why we see a slight spike at the maximum value for larger developments and unique mansions. The distribution is fairly similar to square feet above because the tax is calculated by value assessed * square feet. What is interesting to note here is the missing values for tax were replaced by the median (hence the median and mode being the same) where the square footage missing values were replaced with 0s (hence the 0 as the mode and second peak in the distribution). f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1) sns.boxplot(data['yearbuilt'].dropna(), ax=ax0, color="#34495e").set_title('Year sns.distplot(data['yearbuilt'].dropna(), ax=ax2, color="#34495e");
  • 44. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 44/63 In [77]: Building and Land Tax The building or structure tax has a similar right skewed distribution as total tax. The values range from 1 to 2,165,929, already adjusted for outliers and cleaned up with missing values set to median. That being said, the median and mode are the same at 122,590 which is lower than the mean of 166,344. The land tax values range from 1 to 2,477,536, also adjusted for outliers and cleaned up with missing values set to median. Because of this, the median and mode are the same at 167,043 which is lower than the mean of 242,391. Land tax seems to have a larger range of values from the 25th to 75th percentile than the building tax. This means that the land is valued at a greater variance (287k) than the buildings in certain areas (variance of 179k). We think this could be due to location itself as better neighborhoods, safer areas, or better schools could result in a higher assessment than other locations, thus widening the variance. f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1) sns.boxplot(data['taxvaluedollarcnt'], ax=ax0, color="#34495e").set_title('Total t sns.distplot(data['taxvaluedollarcnt'], ax=ax2, color="#34495e");
  • 45. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 45/63 In [78]: Assessment Year Assessment year is the year that the property was assessed. The 25th through 75th percentile of values are all from the year 2015 so reading a box plot is not very helpful. Instead we list out the unique values for assessment year along with our histogram. In the state of California, the base year value is set when you originally purchase the property, based on the sales price listed on the deed. However, there are exceptions which is why we see a few assessment years from 2000 to 2016 thrown in. In order for assessment year to be useful for our predictions, we should find out what each exception is and what the cause of it not to be assessed at the point of sale. This could affect the predicted log error. f, (ax0,ax2,ax3,ax4) = plt.subplots(nrows=4, ncols=1, figsize=[15, 14]) sns.boxplot(data['structuretaxvaluedollarcnt'], ax=ax0, color="#34495e").set_title sns.distplot(data['structuretaxvaluedollarcnt'], ax=ax2, color="#34495e"); sns.boxplot(data['landtaxvaluedollarcnt'], ax=ax3, color="#34495e").set_title('La sns.distplot(data['landtaxvaluedollarcnt'], ax=ax4, color="#34495e");
  • 46. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 46/63 In [79]: Visualize Attributes 15 points Description: Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate. Distribution of Target Variable: Logerror In the training dataset, logerror is the response variable so we are interested in seeing the distribution of log error that we are training on. We visualize this using a boxplot and histogram to get a general picture of the overall distribution. It is symmetric around zero which implies that the model generating the logerror has no bias and is very accurate in most instances. ('Unique years:', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010, 2004, 2005, 2002, 2000, 2009])) print('Unique years:', data['assessmentyear'].unique()) f, (ax2) = plt.subplots(nrows=1, ncols=1, figsize=[15, 4]) sns.distplot(data['assessmentyear'], ax=ax2, color="#34495e") plt.title('Assessment year distribution');
  • 47. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 47/63 In [80]: Count of Bathrooms We think that the number of bathrooms in a home could be interesting because our data was collected in California where rent is very high. It is common to buy a rental property and have random tenants. Tenants that do not know each other may want their own bathroom. In our case, most homes have 2 bathrooms. Notably, there are outliers with no bathrooms or suspiciously high train_data = data[~data['logerror'].isnull()] x = train_data['logerror'] f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(10, 10)) sns.boxplot(train_data['logerror'][train_data['logerror'].abs()<1], ax=ax_box, co sns.distplot( train_data['logerror'][train_data['logerror'].abs()<1], ax=ax_hist, bins=400, kde=False, color="#34495e");
  • 48. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 48/63 counts. We see records in the dataset with no bathroom which we justified above as being possible. Because we are looking at frequency, we chose to visualize the sum of each number of bathrooms (as a category) in a bar chart. In [81]: Count of Bedrooms For the same reasons we were interested in the number of bathrooms, we are also interested in the number of bedrooms. In our dataset, most properties have 3 bedrooms and we see fewer instances as we go up or down one bedroom in the data. Here we still see records without any bedrooms which we justified as studios above. We chose the same visualization (using number of bedrooms as a category and counting the frequency of each category) displayed in a bar chart below. sns.countplot(data['bathroomcnt'], color="#34495e") plt.ylabel('Count', fontsize=12) plt.xlabel('Bathrooms', fontsize=12) plt.title("Frequency of Bathroom count", fontsize=15);
  • 49. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 49/63 In [82]: Bed to Bath Ratio After visualizing the distribution of bathroom and bedroom counts, we also thought it would be interesting to try to see if the number of bathrooms were dependent on the number of bedrooms. We chose to stick with a bar chart, only this time using the ratio of bedrooms to bathrooms as the category to find the sum counts of. What we found is most homes have about a ratio of 1.5 bedrooms per 1 bathroom in a property. plt.ylabel('Count', fontsize=12) plt.xlabel('Bedrooms', fontsize=12) plt.title("Frequency of Bedrooms count", fontsize=15) sns.countplot(data['bedroomcnt'], color="#34495e");
  • 50. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 50/63 In [83]: Average Tax Per Square Feet For our last attribute, we calculated the tax per square foot to see if we could find any trends. We again chose to use a bar chart to plot the ratio and the sum counts. What we found is that plotting this exposes extreme outliers for possible elimination. Most properties are under a few dollars per square foot but as the visualization reveals, there are suspicious records. However, because this is southern California and land space is limited for continuous growth, there could be a reason that some places have high tax per square feet due to better real estate areas. non_zero_mask = data['bathroomcnt'] > 0 bedroom = data[non_zero_mask]['bedroomcnt'] bathroom = data[non_zero_mask]['bathroomcnt'] bedroom_to_bath_ratio = bedroom / bathroom bedroom_to_bath_ratio = bedroom_to_bath_ratio[bedroom_to_bath_ratio<6] sns.distplot(bedroom_to_bath_ratio, color="#34495e", kde=False) plt.title('Bed to Bath ratio', fontsize=15) plt.xlabel('Ratio', fontsize=15) plt.ylabel('Count', fontsize=15);
  • 51. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 51/63 In [84]: Explore Joint Attributes 15 points Description: Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross- tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships. Absolute Log Error and Number of Occurrences Per Month We compared amount of absolute error based monthly average and found out that the error could be cyclical for the year, where the error dips during the Spring and Summer months and rises during the Winter months. non_zero_mask = data['calculatedfinishedsquarefeet'] > 0 tax = data[non_zero_mask]['taxamount'] sqft = data[non_zero_mask]['calculatedfinishedsquarefeet'] tax_per_sqft = tax / sqft tax_per_sqft = tax_per_sqft[tax_per_sqft<10] sns.distplot(tax_per_sqft, color="#34495e", kde=False) plt.title('Tax Per Square Feet', fontsize=15) plt.xlabel('Ratio', fontsize=15) plt.ylabel('Count', fontsize=15);
  • 52. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 52/63 We compared amount of transactions based monthly average and the transactions are the highest during the Spring, Summer, and Fall seasons possibly due to an optimal time to sell property. The transactions are at its lowest during the Winter season. For a cross comparison, we have high number of transactions during the Spring and Summer seasons while the log error is relatively low and we have low number of transactions during the Winter season while the log error is relatively high. In [85]: Number of Transactions and Mean Absolute Log Error Per Day of the Week Saturdays and Sundays are non-work days, hence why there is a dip in absolute log error and number of transactions For the workdays, Friday has the most transactions while Monday has the least. months = train_data['transactiondate'].dt.month month_names = ['January','February','March','April','May','June','July','August', train_data['abs_logerror'] = train_data['logerror'].abs() f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7]) per_month = train_data.groupby(months)["abs_logerror"].mean() per_month.index = month_names ax0.set_title('Average Log Error Across Month Of 2016') ax0.set_xlabel('Month Of The Year', fontsize=15) ax0.set_ylabel('Log Error', fontsize=15) sns.pointplot(x=per_month.index, y=per_month, color="#34495e", ax=ax0) per_month = train_data.groupby(months)["logerror"].count() per_month.index = month_names ax1.set_title('No Of Occurunces per month In 2016') ax1.set_xlabel('Month Of The Year', fontsize=15) ax1.set_ylabel('Nimber of Occurences', fontsize=15) sns.barplot(x=per_month.index, y=per_month, color="#34495e", ax=ax1);
  • 53. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 53/63 For the workdays, Monday has relatively the most log errors while Friday has relatively the least log errors. For cross analysis, Monday has the least transactions with the most error while Friday has the most transactions with the least errors. Sunday and Saturday are special cases and does not have substantial evidence to provide any trends. In [86]: Continuous Variable Correlation Heatmap Heatmap of correlations are represented where the warmer colors are highly correlated, white is non-correlated, and colder colors are negatively correlated. We see that calculated finished square feet is correlated with finished square feet, due to collinearity. Tax amounts and year built are also highly correlated to finished square feet as well as with one another. Latitude and longitude are negatively correlated with each other possibly because the beachfront properties are more expensive. weekday = train_data['transactiondate'].dt.weekday weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'S abs_logerror = train_data['logerror'].abs() f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7]) to_plot = abs_logerror.groupby(weekday).count() to_plot.index = weekdays to_plot.plot(color="#34495e", linewidth=4, ax=ax0) ax0.set_title('Number of Transactions Per Day') ax0.set_ylabel('Number of Transactions', fontsize=15) ax0.set_xlabel('Day', fontsize=15) to_plot = abs_logerror.groupby(weekday).mean() to_plot.index = weekdays to_plot.plot(color="#34495e", linewidth=4, ax=ax1) ax1.set_title('Mean Absolute Log Error Per Day') ax1.set_ylabel('Mean Absolute Log Error', fontsize=15) ax1.set_xlabel('Day', fontsize=15);
  • 54. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 54/63 In [87]: Longitude and Latitude Data Points train_data = data[~data['logerror'].isnull()] continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index continuous_vars = continuous_vars[continuous_vars.isin(data.columns)] continuous_vars = continuous_vars.sort_values() corrs = train_data[continuous_vars].corr() fig, ax = plt.subplots(figsize=(10, 10)) sns.heatmap(corrs, ax=ax) plt.title("Variables correlation map", fontsize=20) plt.xlabel('Continuous Variables', fontsize=15) plt.ylabel('Continuous Variables', fontsize=15);
  • 55. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 55/63 From a simple graph, we can see the shoreline of California as well as possible areas of obstruction, such as mountains that prevent property growth in those areas. The majority of properties are in the center to upper left of the graph. In [88]: Number of Stories vs Year Built As architectural feats improved, we started to see more properties with 2 or more stories by 1950. The number of one story properties also increased during that time. The baby boomers, the end of WWII and readily available steel, and mortgage incentives may be the cause of the increase of more properties being built as well as more stories per property. Note: because we filled in missing values as the median value, the 1965 spike in the data is artificial until we use other methods to impute year built. <matplotlib.figure.Figure at 0x114670f10> plt.figure(figsize=(12,12)); sns.jointplot(x=data.latitude.values, y=data.longitude.values, size=10, color="#34 plt.ylabel('Longitude', fontsize=15) plt.xlabel('Latitude', fontsize=15) plt.title('Longitude and Latitude Data Points', fontsize=15);
  • 56. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 56/63 In [89]: Explore Attributes and Class 10 points Description: Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification). Correlation of Continuous Variables and Log Error (Target Variable) We see that calculatedfinishedsquarefeet has the highest correlation with log error (0.04) while price per square feet has the highest negative correlation with log error (-0.02). taxvaluedollarcnt has relatively low correlation with log error. We choose to further explore finishedsquarefeet12 and its relationship with log error. fig,ax1= plt.subplots() fig.set_size_inches(20,10) yearMerged = data.groupby(['yearbuilt', 'numberofstories'])["parcelid"].count().u yearMerged = yearMerged.loc[1900:] yearMerged.index.name = 'Year Built' plt.title('Number of Stories Per Year Built', fontsize=15) plt.ylabel('Count', fontsize=15); yearMerged.plot(ax=ax1, linewidth=4);
  • 57. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 57/63 In [90]: train_data = data[~data['logerror'].isnull()] continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index continuous_vars = continuous_vars[continuous_vars.isin(data.columns)] continuous_vars = continuous_vars[~continuous_vars.isin(['logerror', 'transactiond labels = [] values = [] for column in continuous_vars: labels.append(column) values.append(train_data[column].corr(train_data['logerror'])) corr = pd.DataFrame({'labels':labels, 'values':values}).fillna(0.) corr = corr.sort_values(by='values') labels = corr['labels'].values values = corr['values'].values fig, ax = plt.subplots(figsize=(10,10)) plt.barh(range(len(labels)), values, color="#34495e") plt.title("Correlation of Continuous Variables", fontsize=15); plt.xlabel('Correlation', fontsize=15) plt.ylabel('Continuous Variable', fontsize=15) ax.set_yticks(range(len(labels))) ax.set_yticklabels(labels, rotation='horizontal');
  • 58. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 58/63 Scatterplot of Log Error and Calculated Finished Square Feet We are plotting our best correlated variable calculatedfinishedsquarefeet against the logerror. We don't see any linear relationship from the scatter plot below even though it is evenly distributed. In [91]: New Features 5 points Description: Are there other features that could be added to the data or created from existing features? Which ones? column = "calculatedfinishedsquarefeet" train_data = data[~data['logerror'].isnull()] sns.jointplot(train_data[column], train_data['logerror'], size=10, color="#34495e plt.ylabel('Log Error', fontsize=12) plt.xlabel('Calculated Finished Square Feet', fontsize=15) plt.title("Calculated Finished Square Feet Vs Log Error", fontsize=15);
  • 59. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 59/63 Tax Per Square Feet We created a tax per square feet feature. It is negatively correlated with log error and we hope that it will add value to a predictive model. In [92]: City zip code details The Zillow dataset has a variable: 'regionidcity' which is a numerical ID, representing the city in which the property is located (if any). We don't have a string variable showing the city name. We found a government dataset publicly available on the internet, containing all zip codes as well as other information associated with each zip code. We have downloaded the dataset from here: http://federalgovernmentzipcodes.us (http://federalgovernmentzipcodes.us) and joined it with our dataset with the cell below. This will give us the actual name of the cities, zip code type and location type. New Variables Joined: zipcode_type Standard, PO BOX Only, Unique, Military (implies APO or FPO) - Zip code type may provide useful insight towards prediction city USPS official city name(s) - this will distinguish one county from another that was lacking in the original dataset location_type Primary, Acceptable, Not Acceptable - because these are all valid location properties, they will most likely be acceptable. In [93]: Out[92]: ('Correlation with log error:', -0.014065552662672554) The zips dataset has 81831 rows and 4 columns The merged dataset has 3857451 rows and 53 columns non_zero_mask = data['calculatedfinishedsquarefeet'] > 0 tax = data[non_zero_mask]['taxamount'] sqft = data[non_zero_mask]['calculatedfinishedsquarefeet'] data['price_per_sqft'] = tax / sqft 'Correlation with log error:', data['price_per_sqft'].corr(data['logerror']) # data from http://federalgovernmentzipcodes.us zips = pd.read_csv('../input/free-zipcode-database.csv', low_memory=False) zips = zips[['Zipcode','ZipCodeType','City','LocationType']] zips.columns = ['zipcode', 'zipcode_type', 'city', 'location_type'] assert np.all(~zips.isnull()) zips = zips.rename(columns={'zipcode':'regionidzip'}) data = pd.merge(data, zips, how='left', on='regionidzip') print('The zips dataset has %d rows and %d columns' % zips.shape) print('The merged dataset has %d rows and %d columns' % data.shape)
  • 60. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 60/63 Table of New Variables Just focusing on the new features added to the dataset, here are the value types and descriptions. In [94]: Other Ideas For New Features Other features that we thought about that we could include in the future are last remodel date of kitchen or bathroom, key words in the description of overpriced Zestimates or underpriced Zestimates, and how close a home is to a grocery store, Starbucks, a mall, or a place of interest. A recently remodeled home could raise the actual sale price much higher than the Zestimate. Certain words in the listing description could be associated with lower sale prices or people who bid a higher sale price. Lastly the walkability, or how close a home is to a grocery store, Starbucks, a mall, or place of interest could increase the final sale price as well. Exceptional Work 10 points Description: You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results. Categorical Feature Importance Out[94]: Variable Type Scale Description city nominal [APO, WHISKEYTOWN, nan, REDDING, FPO, ... (239 More)] USPS offical city name(s) location_type nominal [PRIMARY, nan, ACCEPTABLE, NOT ACCEPTABLE] Primary, Acceptable, Not Acceptable price_per_sqft ratio (0, 11911) Tax per SQFT zipcode_type nominal [MILITARY, PO BOX, nan, STANDARD, UNIQUE] Standard, PO BOX Only, Unique, Military(implies APO or FPO) variables_description = [ ['price_per_sqft', 'ratio', 'TBD', 'Tax per SQFT'] ,['zipcode_type', 'nominal', 'TBD', 'Standard, PO BOX Only, Unique, Military(impl ,['city', 'nominal', 'TBD', 'USPS offical city name(s)'] ,['location_type', 'nominal', 'TBD', 'Primary, Acceptable, Not Acceptable'] ] new_variables = pd.DataFrame(variables_description, columns=['name', 'type', 'sca new_variables = new_variables.set_index('name') new_variables = new_variables.loc[new_variables.index.isin(data.columns)] variables = variables.append(new_variables) output_variables_table(new_variables)
  • 61. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 61/63 g p According to a random forest model with seed 0, region id zip, bedroom count, census tract and block, and region id neighborhood explain the most variance for log error. Even though the importance of the other variables are relatively lower, they could provide greater importance if we add interaction terms or use a different nonlinear model. In [95]: Continuous Feature Importance from sklearn import ensemble train_data = data[~data['logerror'].isnull()] categorical_vars = variables[variables['type'].isin(['ordinal', 'nominal'])].index categorical_vars = categorical_vars[categorical_vars.isin(data.columns)] categorical_vars = categorical_vars[~categorical_vars.isin(['parcelid', 'logerror X = train_data[categorical_vars] # remove string types categorical_vars = categorical_vars[X.dtypes != object] X = X[categorical_vars] y = train_data['logerror'] model = ensemble.ExtraTreesRegressor(random_state=0) model.fit(X.fillna(0), y) index = pd.Index(categorical_vars, name='Variable Name') importance = pd.Series(model.feature_importances_, index=index) importance.sort() importance.plot(kind='barh', color="#34495e") plt.title('Categorical Feature Importance') plt.xlabel('Importance', fontsize=15);
  • 62. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 62/63 According to the linear regression model, the feature tax delinquency year explains the most variance of log error. Even though the importance of the other variables are relatively lower, they could provide greater importance if we add interaction terms or a higher order polynomials. In [96]: Exporting the cleaned datasets from sklearn.linear_model import LinearRegression train_data = data[~data['logerror'].isnull()] continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index continuous_vars = continuous_vars[continuous_vars.isin(data.columns)] continuous_vars = continuous_vars[~continuous_vars.isin(['parcelid', 'logerror', X = train_data[continuous_vars] y = train_data['logerror'] model = LinearRegression() model.fit(X.fillna(0), y) index = pd.Index(continuous_vars, name='Variable Name') importance = pd.Series(np.abs(model.coef_), index=index) importance.sort() importance.plot(kind='barh', color="#34495e") plt.title('Continuous Feature Importance') plt.xlabel('Importance', fontsize=15);
  • 63. 1/12/2018 lab1 http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 63/63 In [97]: References Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels (https://www.kaggle.com/c/zillow-prize-1/kernels) Pandas cookbook: https://pandas.pydata.org/pandas-docs/stable/cookbook.html (https://pandas.pydata.org/pandas-docs/stable/cookbook.html) Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas (https://stackoverflow.com/questions/tagged/pandas) test_mask = data['logerror'].isnull() train_data = data[~test_mask] test_data = data[test_mask] train_data.to_csv('../datasets/train.csv', index=False) test_data.to_csv('../datasets/test.csv', index=False) variables.index.name = 'name' variables.to_csv('../datasets/variables.csv', index=True)