The purpose of this assignment is to build classifiers in order to predict whether a property can be sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the contents are organized as follow. In section 2 of the report, we will discuss the dataset and their attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will explore each attribute and the inter-relationships between attributes. After these analysis, we will summerize the findings in the last section.
Data pre-processing and Exploration on 2016 Melbourne housing market by using R
1. Predicting Property Price Melbourne
Shuai Gao (s3596156)
4 April 2018
Introduction
The purpose of this assignment is to build classifiers in order to predict whether a property can be
sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing
market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the
contents are organized as follow. In section 2 of the report, we will discuss the dataset and their
attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will
explore each attribute and the inter-relationships between attributes. After these analysis, we will
summerize the findings in the last section.
Data Set
This dataset is provided by kaggle (https://www.kaggle.com/anthonypino/melbourne-housing-
market). Which include 34857 observations and 21 variables.
Target Feature
The response feature is square_price2000 which is given as:
Descriptive Features
The variable description is provided by Tony Pino:
Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not
disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA
- sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not
available. Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t -
townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
2. BuildingArea: Building Size
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanitory
Longtitude: Self explanitory
Since the purpose of this assignment is to evaluate the price of a property due to the existing data,
we will only use variables that linked to our topic, which include the number of Rooms, Type of the
property, property selling method, Distance from CBD, Bedroom2 (scraped number of bedrooms
from another source), numbers of bathroom, number of carspots, the year the house was built,
Regionname and Property count in the same suburb. Since Bedroom2 is the variable sourced from
a different dataset, we will leave it aside now and check if it have similar effect with Rooms. For
more details, see Domain (https://www.domain.com.au/).
Data Pre-processing
Preliminaries
In this project, we used the following R packages.
library(tidyverse)
library(knitr)
library(mlr)
library(cowplot)
Firstly, we need to read data into RStudio in order to process. The data have already provide
header for us so we don’t have to implement header for it.
price <- read.csv('Melbourne.csv', stringsAsFactors = FALSE, header = TRUE)
Data Cleaning and Transformation
After applying str and summarizeColumns functions, we found that there are a few variables that
are not linking to our topic. For example, the price of the property cannot be evaluated solely due to
the difference in land size and building size. In order to estimate the pricd of the property, we have
to consolidate both land size and building size to construct a new column which refers the price in
square meters.
str(price)
3. ## 'data.frame': 34857 obs. of 21 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
## $ Address : chr "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659
Victoria St" ...
## $ Rooms : int 2 2 2 3 3 3 4 4 2 2 ...
## $ Type : chr "h" "h" "h" "u" ...
## $ Price : int NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ...
## $ Method : chr "SS" "S" "S" "VB" ...
## $ SellerG : chr "Jellis" "Biggin" "Biggin" "Rounds" ...
## $ Date : chr "3/09/2016" "3/12/2016" "4/02/2016" "4/02/2016" ...
## $ Distance : chr "2.5" "2.5" "2.5" "2.5" ...
## $ Postcode : chr "3067" "3067" "3067" "3067" ...
## $ Bedroom2 : int 2 2 2 3 3 3 3 3 4 3 ...
## $ Bathroom : int 1 1 1 2 2 2 1 2 1 2 ...
## $ Car : int 1 1 0 1 0 1 2 2 2 1 ...
## $ Landsize : int 126 202 156 0 134 94 120 400 201 202 ...
## $ BuildingArea : num NA NA 79 NA 150 NA 142 220 NA NA ...
## $ YearBuilt : int NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ...
## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Cou
ncil" "Yarra City Council" ...
## $ Lattitude : num -37.8 -37.8 -37.8 -37.8 -37.8 ...
## $ Longtitude : num 145 145 145 145 145 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Norther
n Metropolitan" "Northern Metropolitan" ...
## $ Propertycount: chr "4019" "4019" "4019" "4019" ...
summarizeColumns(price) %>% knitr::kable( caption = 'Feature Summary before Data P
reprocessing')
Feature Summary before Data Preprocessing
name type na mean disp median mad
Suburb character 0 NA 9.757868e-01 NA NA
Address character 0 NA 9.998279e-01 NA NA
Rooms integer 0 3.031012e+00 9.699329e-01 3.0000 1.482600e+00
Type character 0 NA 3.120464e-01 NA NA 3580.00000
Price integer 7610 1.050173e+06 6.414671e+05 870000.0000 4.299540e+05 85000.00000
Method character 0 NA 4.335714e-01 NA NA
SellerG character 0 NA 9.036349e-01 NA NA
Date character 0 NA 9.678974e-01 NA NA
Distance character 0 NA 9.592621e-01 NA NA
Postcode character 0 NA 9.757868e-01 NA NA
Bedroom2 integer 8217 3.084647e+00 9.806897e-01 3.0000 1.482600e+00
4. name type na mean disp median mad
Bathroom integer 8226 1.624798e+00 7.242120e-01 2.0000 1.482600e+00
Car integer 8728 1.728845e+00 1.010771e+00 2.0000 1.482600e+00
Landsize integer 11810 5.935990e+02 3.398842e+03 521.0000 3.113460e+02
BuildingArea numeric 21115 1.602564e+02 4.012671e+02 136.0000 6.078660e+01
YearBuilt integer 19306 1.965290e+03 3.732818e+01 1970.0000 4.447800e+01 1196.00000
CouncilArea character 0 NA 8.945692e-01 NA NA
Lattitude numeric 7976 -3.781063e+01 9.027890e-02 -37.8076 8.077200e-02
Longtitude numeric 7976 1.450019e+02 1.201688e-01 145.0078 1.012912e-01 144.42379
Regionname character 0 NA 6.604412e-01 NA NA
Propertycount character 0 NA 9.757868e-01 NA NA
We removed the excessive white spaces for all character features.
price[, sapply( price, is.factor )] <- sapply( price[, sapply( price, is.factor )],
trimws)
We will estimate the price of the property based on the price per square meter to avoid the side
effect of the differences of the land siae and the building size. We assume that the land size
represent the size of land which no building constructed on that patricular land; and figure of
Building Area represents the size of the building. We assume that one particular property will have
the information of either land size or building area, or both. If the information of one particular
property has neither, we will treat the data entry invalid (treat as 0).
Based on these assumptions, an observation of a 0 value or no value means that the particular
property means it either has no data regarding land size nor building area, or doesn’t have a price,
or both.
price$Landsize[is.na(price$Landsize)] <- 0
price$BuildingArea[is.na(price$BuildingArea)] <- 0
price <- data.frame(price,square_price=price$Price/(price$Landsize+price$BuildingAr
ea))
price <- price%>%filter(square_price>0&square_price!=Inf)
In general, the age of property has a severe impact on the price of that particular property. We will
use the property sold date and built date to compute the age of property. (The negative results are
possible since there might be some pre-order properties. It is also possible for missing value or no
value since there might be some properties is un-sold or no record of the built date)
price$Date <- sapply(price$Date,function(x){strsplit(x,"/")[[1]][3]})
price$YearBuilt <- as.integer(as.integer(price$YearBuilt ))
price$Date <- as.integer(as.integer(price$Date ))
We try to build classifiers ti distinguish the price per square meter. If the price per squar meter is
5. greater than 2000, the we classified the data to be true, otherwise, false.
price <- data.frame(price,"square_price2000"=price$square_price>=2000,year=price$Da
te-price$YearBuilt)
We only include the variables that is relevant to the purpose of this analysis. The relevant variables
are as stated in the previous session.
price <-subset(price,select = c("Rooms","Type","Method","Distance","Bedroom2","Bath
room","Car","year","Regionname","Propertycount","square_price2000"))
The data of the age of the property is wide spreaded, therefore it is really hard to analyze based on
the numeric level. For better analyzing, we need to build classifiers to make the age of the property
to be easy to analysis. We build the same classifiers for variables like Propertycount, Propertycount,
Bathroom, Rooms, Car and Bedroom2.
The breaks of the classifiers was set based on the equivalent amount of data contained in each
level.
breaks=c(-5,10,30,50,100,900)
price$year <- cut(price$year, breaks = breaks)
breaks1=c(0,5000,10000,15000,25000)
price$Propertycount <- as.numeric(as.numeric(price$Propertycount ))
price$Propertycount <- cut(price$Propertycount,breaks = breaks1)
price$Propertycount <- as.factor(as.factor(price$Propertycount ))
breaks2=c(0,5,10,15,20,50)
price$Distance <- as.numeric(as.numeric(price$Distance ))
price$Distance <- cut(price$Distance,breaks = breaks2)
price$Distance <- as.factor(as.factor(price$Distance ))
price$Rooms <- ifelse(price$Rooms>5,"6-12",price$Rooms)
price$Car <- ifelse(price$Car>4,"5-10",price$Car)
price$Bathroom <- ifelse(price$Bathroom>4,"5-9",price$Bathroom)
price$Bedroom2 <- ifelse(price$Bedroom2>5,"6-12",price$Bedroom2)
After pre-processing, we are able to find out the number of data in each break. If the number are
equivelent or similar in each break, We found every variable seems to meet the analysis
requirement. If not, we need to go back to the previous step to adjust the breaks to achieve the
equal or similar number of data in each break to meet the analysis requirement.
price[, sapply( price, is.character )] <- lapply( price[, sapply( price, is.charact
er )], factor)
price$square_price2000 <- as.factor(as.factor(price$square_price2000 ))
summarizeColumns(price) %>%kable( caption = 'Feature Summary after Data Preprocessi
ng' )
Feature Summary after Data Preprocessing
6. name type na mean disp median mad min max nlevs
Rooms factor 0 NA 0.5361869 NA NA 132 8517 6
Type factor 0 NA 0.2121658 NA NA 1395 14467 3
Method factor 0 NA 0.3458585 NA NA 133 12012 5
Distance factor 5 NA NA NA NA 1816 6098 5
Bedroom2 factor 6 NA NA NA NA 13 8529 7
Bathroom factor 9 NA NA NA NA 16 9078 6
Car factor 320 NA NA NA NA 240 8491 6
year factor 6807 NA NA NA NA 1452 3714 5
Regionname factor 0 NA 0.7066383 NA NA 80 5387 8
Propertycount factor 0 NA 0.5810597 NA NA 1030 7693 4
square_price2000 factor 0 NA 0.4737788 NA NA 8700 9663 2
str( price )
## 'data.frame': 18363 obs. of 11 variables:
## $ Rooms : Factor w/ 6 levels "1","2","3","4",..: 2 2 3 3 4 2 3 2 2
3 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ...
## $ Method : Factor w/ 5 levels "PI","S","SA",..: 2 2 4 1 5 2 2 2 2
5 ...
## $ Distance : Factor w/ 5 levels "(0,5]","(5,10]",..: 1 1 1 1 1 1 1 1 1
1 ...
## $ Bedroom2 : Factor w/ 7 levels "0","1","2","3",..: 3 3 4 4 4 3 5 3 4
4 ...
## $ Bathroom : Factor w/ 6 levels "0","1","2","3",..: 2 2 3 3 2 2 3 2 2
3 ...
## $ Car : Factor w/ 6 levels "0","1","2","3",..: 2 1 1 2 3 1 1 3 3
3 ...
## $ year : Factor w/ 5 levels "(-5,10]","(10,30]",..: NA 5 5 NA 1 NA
5 5 5 2 ...
## $ Regionname : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ Propertycount : Factor w/ 4 levels "(0,5e+03]","(5e+03,1e+04]",..: 1 1 1 1
1 1 1 1 1 1 ...
## $ square_price2000: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
sapply( price[ sapply(price,is.factor)], table)
8. ##
## (0,5e+03] (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2.5e+04]
## 5903 7693 3737 1030
##
## $square_price2000
##
## FALSE TRUE
## 8700 9663
Data Exploration
Categorical Features
Rooms
According to the bar chart of rooms below, the figure is clearly normal distributed. In the other chart,
the price per square meter over 2000 proportion graph, the graph is skewed to the right. Based on
our analysis, we found that the properties with 2 or less bedrooms tend to have higher selling price
per square meter. In this particular dataset, we found that the properties with 2 or less bedrooms
have more chance to have the price over 2000 AUD per square meter. Another trend we found in
this analysis is that the more bedrooms equipted to the property, the less chance of the selling price
go over 2000 AUD per square meter. Therefore, the number of the bedrooms equipted to the
property would be a predictive feature.
Further to our analysis, since the selling price per square meter of 1 or 2 bedrooms properties is the
highest in the market, combining the fact that the greater the bedrooms number is, the less chance
it will get sold at the price per square meter greater than 2000 AUD, we can predict that the
willingness to pay per square meter for smaller properties is higher, or, since the total price of the
smaller properties is less than larger properties, consumers’ buying power are still limited.
9. Type
The property that is catigorized as unit, duplex and townhouse tend to have greater chance to be
sold more than 2000$ per square meter. The percentage of property sold of these 2 types of price
higher than 2000$ per square meter is dominant.
But we have to take the sample size into consideration in the process of prediction as well. Based
on the bar chart of type, the porpotion of townhouse and unit properties is quite small. For now, we
need to put this aside for further consideration.
10. Method
Based on the charts below, although the property sold through vender bid was not as large as most
of the other selling method, yet the results from vender bid are standing out - it has the highest
chance of all to get the price per square meter greater than 2000 AUD.
On the other hand, the least porportion of properties was sold through auction, and those properties
got the lowest chance to get sold over 2000 AUD per square meter.
11. Distance
Based on the two charts below, it is really clear to see that the price per square meter has the
negative correlation with the distance from the CBD, as in the greater the distance, the lower the
price. Therefore, ths distance variable can be a predictive feature.
12. Car Park Number
The trend in the variable of car parke number in relation to the price per square meter is very similar
to that of the distance from CBD. Based on the two charts below, it is really clear to see that the
price per square meter has the negative correlation with the car park number, as in the greater the
number of the carpark, the lower the price. Therefore, ths car park number variable can be a
predictive feature.
13. Bathroom
There are clear indications in relation to the price and the different number of bathrooms. Therefore
this wouldn’t be a predictive feature.
14. age
We could see that the the properties aging between 100 to 900 take a great proportion of properties
sold over 2000 AUD per square meter. We would the assume the cause of which happening to be
the property with such age are always associate with historical sites.
15. Regionname
In this graph, we can find that the property in the Southern Metropolitan area are highly demended
in Melbourne housing market.
16. Propertycount
From the property count of 0 to 15000 ,the proportion of price over 2000 goes up; when the
properry count is over 20000, the proportion of such properies sold over 2000 is sharply drupped.
We would assume that people have very specific requirement about the living density.
Multivariate Visualisation
Rooms Num vs BedRoom2
Since the bedroom2 data was drawned from a different source, we choose not to plot it. We will
compare the bedroom2 data with Rooms data. According to the graph below, we can clearly see
that there are little difference between these two variables, which may results from the different
computing standards.
17. Therefore, we removed it.
price$Bedroom2 <- NULL
Car park number, Car park number and Propertycount
According to the graph below, we can summerize that the property with shorter distance, as in
shorter than 10 from CBD, got sold in the highest price per square meter. The price per square
meter increased along with the increased of the number of properties around, but then the price
droped when the number of properties around increased beyond a certain number.The price also
goes down with car park number increased. We can assume that the proporty with 0-10 from CBD
with around 15000 property around and has small number of car park will more likely be sold over
2000 per square meter.
18. Summary
In this assignment, we compute price per square meter in order to avoid the effect of property with
different land and building size. By using sold date and built date to compute the age of the
property. In order to achieve the purpose of the analysis, we removed the data without price, or
without both land and building size. For categorical features, we create different breaks to make
sure that there are similar data items in each break based on variables’s level table. From the data
exploration, we plot every variable’s relation with price per square meter. We found that, rooms,
method, distance, car, year, region name and property count are potentially useful features in
estimating the price classes.