SlideShare a Scribd company logo
1 of 18
Download to read offline
Predicting Property Price Melbourne
Shuai Gao (s3596156)
4 April 2018
Introduction
The purpose of this assignment is to build classifiers in order to predict whether a property can be
sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing
market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the
contents are organized as follow. In section 2 of the report, we will discuss the dataset and their
attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will
explore each attribute and the inter-relationships between attributes. After these analysis, we will
summerize the findings in the last section.
Data Set
This dataset is provided by kaggle (https://www.kaggle.com/anthonypino/melbourne-housing-
market). Which include 34857 observations and 21 variables.
Target Feature
The response feature is square_price2000 which is given as:
Descriptive Features
The variable description is provided by Tony Pino:
Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not
disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA
- sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not
available. Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t -
townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
BuildingArea: Building Size
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanitory
Longtitude: Self explanitory
Since the purpose of this assignment is to evaluate the price of a property due to the existing data,
we will only use variables that linked to our topic, which include the number of Rooms, Type of the
property, property selling method, Distance from CBD, Bedroom2 (scraped number of bedrooms
from another source), numbers of bathroom, number of carspots, the year the house was built,
Regionname and Property count in the same suburb. Since Bedroom2 is the variable sourced from
a different dataset, we will leave it aside now and check if it have similar effect with Rooms. For
more details, see Domain (https://www.domain.com.au/).
Data Pre-processing
Preliminaries
In this project, we used the following R packages.
library(tidyverse)
library(knitr)
library(mlr)
library(cowplot)
Firstly, we need to read data into RStudio in order to process. The data have already provide
header for us so we don’t have to implement header for it.
price <- read.csv('Melbourne.csv', stringsAsFactors = FALSE, header = TRUE)
Data Cleaning and Transformation
After applying str and summarizeColumns functions, we found that there are a few variables that
are not linking to our topic. For example, the price of the property cannot be evaluated solely due to
the difference in land size and building size. In order to estimate the pricd of the property, we have
to consolidate both land size and building size to construct a new column which refers the price in
square meters.
str(price)
## 'data.frame': 34857 obs. of 21 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
## $ Address : chr "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659
Victoria St" ...
## $ Rooms : int 2 2 2 3 3 3 4 4 2 2 ...
## $ Type : chr "h" "h" "h" "u" ...
## $ Price : int NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ...
## $ Method : chr "SS" "S" "S" "VB" ...
## $ SellerG : chr "Jellis" "Biggin" "Biggin" "Rounds" ...
## $ Date : chr "3/09/2016" "3/12/2016" "4/02/2016" "4/02/2016" ...
## $ Distance : chr "2.5" "2.5" "2.5" "2.5" ...
## $ Postcode : chr "3067" "3067" "3067" "3067" ...
## $ Bedroom2 : int 2 2 2 3 3 3 3 3 4 3 ...
## $ Bathroom : int 1 1 1 2 2 2 1 2 1 2 ...
## $ Car : int 1 1 0 1 0 1 2 2 2 1 ...
## $ Landsize : int 126 202 156 0 134 94 120 400 201 202 ...
## $ BuildingArea : num NA NA 79 NA 150 NA 142 220 NA NA ...
## $ YearBuilt : int NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ...
## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Cou
ncil" "Yarra City Council" ...
## $ Lattitude : num -37.8 -37.8 -37.8 -37.8 -37.8 ...
## $ Longtitude : num 145 145 145 145 145 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Norther
n Metropolitan" "Northern Metropolitan" ...
## $ Propertycount: chr "4019" "4019" "4019" "4019" ...
summarizeColumns(price) %>% knitr::kable( caption = 'Feature Summary before Data P
reprocessing')
Feature Summary before Data Preprocessing
name type na mean disp median mad
Suburb character 0 NA 9.757868e-01 NA NA
Address character 0 NA 9.998279e-01 NA NA
Rooms integer 0 3.031012e+00 9.699329e-01 3.0000 1.482600e+00
Type character 0 NA 3.120464e-01 NA NA 3580.00000
Price integer 7610 1.050173e+06 6.414671e+05 870000.0000 4.299540e+05 85000.00000
Method character 0 NA 4.335714e-01 NA NA
SellerG character 0 NA 9.036349e-01 NA NA
Date character 0 NA 9.678974e-01 NA NA
Distance character 0 NA 9.592621e-01 NA NA
Postcode character 0 NA 9.757868e-01 NA NA
Bedroom2 integer 8217 3.084647e+00 9.806897e-01 3.0000 1.482600e+00
name type na mean disp median mad
Bathroom integer 8226 1.624798e+00 7.242120e-01 2.0000 1.482600e+00
Car integer 8728 1.728845e+00 1.010771e+00 2.0000 1.482600e+00
Landsize integer 11810 5.935990e+02 3.398842e+03 521.0000 3.113460e+02
BuildingArea numeric 21115 1.602564e+02 4.012671e+02 136.0000 6.078660e+01
YearBuilt integer 19306 1.965290e+03 3.732818e+01 1970.0000 4.447800e+01 1196.00000
CouncilArea character 0 NA 8.945692e-01 NA NA
Lattitude numeric 7976 -3.781063e+01 9.027890e-02 -37.8076 8.077200e-02
Longtitude numeric 7976 1.450019e+02 1.201688e-01 145.0078 1.012912e-01 144.42379
Regionname character 0 NA 6.604412e-01 NA NA
Propertycount character 0 NA 9.757868e-01 NA NA
We removed the excessive white spaces for all character features.
price[, sapply( price, is.factor )] <- sapply( price[, sapply( price, is.factor )],
trimws)
We will estimate the price of the property based on the price per square meter to avoid the side
effect of the differences of the land siae and the building size. We assume that the land size
represent the size of land which no building constructed on that patricular land; and figure of
Building Area represents the size of the building. We assume that one particular property will have
the information of either land size or building area, or both. If the information of one particular
property has neither, we will treat the data entry invalid (treat as 0).
Based on these assumptions, an observation of a 0 value or no value means that the particular
property means it either has no data regarding land size nor building area, or doesn’t have a price,
or both.
price$Landsize[is.na(price$Landsize)] <- 0
price$BuildingArea[is.na(price$BuildingArea)] <- 0
price <- data.frame(price,square_price=price$Price/(price$Landsize+price$BuildingAr
ea))
price <- price%>%filter(square_price>0&square_price!=Inf)
In general, the age of property has a severe impact on the price of that particular property. We will
use the property sold date and built date to compute the age of property. (The negative results are
possible since there might be some pre-order properties. It is also possible for missing value or no
value since there might be some properties is un-sold or no record of the built date)
price$Date <- sapply(price$Date,function(x){strsplit(x,"/")[[1]][3]})
price$YearBuilt <- as.integer(as.integer(price$YearBuilt ))
price$Date <- as.integer(as.integer(price$Date ))
We try to build classifiers ti distinguish the price per square meter. If the price per squar meter is
greater than 2000, the we classified the data to be true, otherwise, false.
price <- data.frame(price,"square_price2000"=price$square_price>=2000,year=price$Da
te-price$YearBuilt)
We only include the variables that is relevant to the purpose of this analysis. The relevant variables
are as stated in the previous session.
price <-subset(price,select = c("Rooms","Type","Method","Distance","Bedroom2","Bath
room","Car","year","Regionname","Propertycount","square_price2000"))
The data of the age of the property is wide spreaded, therefore it is really hard to analyze based on
the numeric level. For better analyzing, we need to build classifiers to make the age of the property
to be easy to analysis. We build the same classifiers for variables like Propertycount, Propertycount,
Bathroom, Rooms, Car and Bedroom2.
The breaks of the classifiers was set based on the equivalent amount of data contained in each
level.
breaks=c(-5,10,30,50,100,900)
price$year <- cut(price$year, breaks = breaks)
breaks1=c(0,5000,10000,15000,25000)
price$Propertycount <- as.numeric(as.numeric(price$Propertycount ))
price$Propertycount <- cut(price$Propertycount,breaks = breaks1)
price$Propertycount <- as.factor(as.factor(price$Propertycount ))
breaks2=c(0,5,10,15,20,50)
price$Distance <- as.numeric(as.numeric(price$Distance ))
price$Distance <- cut(price$Distance,breaks = breaks2)
price$Distance <- as.factor(as.factor(price$Distance ))
price$Rooms <- ifelse(price$Rooms>5,"6-12",price$Rooms)
price$Car <- ifelse(price$Car>4,"5-10",price$Car)
price$Bathroom <- ifelse(price$Bathroom>4,"5-9",price$Bathroom)
price$Bedroom2 <- ifelse(price$Bedroom2>5,"6-12",price$Bedroom2)
After pre-processing, we are able to find out the number of data in each break. If the number are
equivelent or similar in each break, We found every variable seems to meet the analysis
requirement. If not, we need to go back to the previous step to adjust the breaks to achieve the
equal or similar number of data in each break to meet the analysis requirement.
price[, sapply( price, is.character )] <- lapply( price[, sapply( price, is.charact
er )], factor)
price$square_price2000 <- as.factor(as.factor(price$square_price2000 ))
summarizeColumns(price) %>%kable( caption = 'Feature Summary after Data Preprocessi
ng' )
Feature Summary after Data Preprocessing
name type na mean disp median mad min max nlevs
Rooms factor 0 NA 0.5361869 NA NA 132 8517 6
Type factor 0 NA 0.2121658 NA NA 1395 14467 3
Method factor 0 NA 0.3458585 NA NA 133 12012 5
Distance factor 5 NA NA NA NA 1816 6098 5
Bedroom2 factor 6 NA NA NA NA 13 8529 7
Bathroom factor 9 NA NA NA NA 16 9078 6
Car factor 320 NA NA NA NA 240 8491 6
year factor 6807 NA NA NA NA 1452 3714 5
Regionname factor 0 NA 0.7066383 NA NA 80 5387 8
Propertycount factor 0 NA 0.5810597 NA NA 1030 7693 4
square_price2000 factor 0 NA 0.4737788 NA NA 8700 9663 2
str( price )
## 'data.frame': 18363 obs. of 11 variables:
## $ Rooms : Factor w/ 6 levels "1","2","3","4",..: 2 2 3 3 4 2 3 2 2
3 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ...
## $ Method : Factor w/ 5 levels "PI","S","SA",..: 2 2 4 1 5 2 2 2 2
5 ...
## $ Distance : Factor w/ 5 levels "(0,5]","(5,10]",..: 1 1 1 1 1 1 1 1 1
1 ...
## $ Bedroom2 : Factor w/ 7 levels "0","1","2","3",..: 3 3 4 4 4 3 5 3 4
4 ...
## $ Bathroom : Factor w/ 6 levels "0","1","2","3",..: 2 2 3 3 2 2 3 2 2
3 ...
## $ Car : Factor w/ 6 levels "0","1","2","3",..: 2 1 1 2 3 1 1 3 3
3 ...
## $ year : Factor w/ 5 levels "(-5,10]","(10,30]",..: NA 5 5 NA 1 NA
5 5 5 2 ...
## $ Regionname : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ Propertycount : Factor w/ 4 levels "(0,5e+03]","(5e+03,1e+04]",..: 1 1 1 1
1 1 1 1 1 1 ...
## $ square_price2000: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
sapply( price[ sapply(price,is.factor)], table)
## $Rooms
##
## 1 2 3 4 5 6-12
## 526 3732 8517 4499 957 132
##
## $Type
##
## h t u
## 14467 1395 2501
##
## $Method
##
## PI S SA SP VB
## 2134 12012 133 2372 1712
##
## $Distance
##
## (0,5] (5,10] (10,15] (15,20] (20,50]
## 2362 6098 6089 1993 1816
##
## $Bedroom2
##
## 0 1 2 3 4 5 6-12
## 13 540 3823 8529 4406 916 130
##
## $Bathroom
##
## 0 1 2 3 4 5-9
## 16 9078 7620 1416 165 59
##
## $Car
##
## 0 1 2 3 4 5-10
## 1137 6212 8491 1134 829 240
##
## $year
##
## (-5,10] (10,30] (30,50] (50,100] (100,900]
## 1872 2044 2474 3714 1452
##
## $Regionname
##
## Eastern Metropolitan Eastern Victoria
## 2282 122
## Northern Metropolitan Northern Victoria
## 5295 129
## South-Eastern Metropolitan Southern Metropolitan
## 907 5387
## Western Metropolitan Western Victoria
## 4161 80
##
## $Propertycount
##
## (0,5e+03] (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2.5e+04]
## 5903 7693 3737 1030
##
## $square_price2000
##
## FALSE TRUE
## 8700 9663
Data Exploration
Categorical Features
Rooms
According to the bar chart of rooms below, the figure is clearly normal distributed. In the other chart,
the price per square meter over 2000 proportion graph, the graph is skewed to the right. Based on
our analysis, we found that the properties with 2 or less bedrooms tend to have higher selling price
per square meter. In this particular dataset, we found that the properties with 2 or less bedrooms
have more chance to have the price over 2000 AUD per square meter. Another trend we found in
this analysis is that the more bedrooms equipted to the property, the less chance of the selling price
go over 2000 AUD per square meter. Therefore, the number of the bedrooms equipted to the
property would be a predictive feature.
Further to our analysis, since the selling price per square meter of 1 or 2 bedrooms properties is the
highest in the market, combining the fact that the greater the bedrooms number is, the less chance
it will get sold at the price per square meter greater than 2000 AUD, we can predict that the
willingness to pay per square meter for smaller properties is higher, or, since the total price of the
smaller properties is less than larger properties, consumers’ buying power are still limited.
Type
The property that is catigorized as unit, duplex and townhouse tend to have greater chance to be
sold more than 2000$ per square meter. The percentage of property sold of these 2 types of price
higher than 2000$ per square meter is dominant.
But we have to take the sample size into consideration in the process of prediction as well. Based
on the bar chart of type, the porpotion of townhouse and unit properties is quite small. For now, we
need to put this aside for further consideration.
Method
Based on the charts below, although the property sold through vender bid was not as large as most
of the other selling method, yet the results from vender bid are standing out - it has the highest
chance of all to get the price per square meter greater than 2000 AUD.
On the other hand, the least porportion of properties was sold through auction, and those properties
got the lowest chance to get sold over 2000 AUD per square meter.
Distance
Based on the two charts below, it is really clear to see that the price per square meter has the
negative correlation with the distance from the CBD, as in the greater the distance, the lower the
price. Therefore, ths distance variable can be a predictive feature.
Car Park Number
The trend in the variable of car parke number in relation to the price per square meter is very similar
to that of the distance from CBD. Based on the two charts below, it is really clear to see that the
price per square meter has the negative correlation with the car park number, as in the greater the
number of the carpark, the lower the price. Therefore, ths car park number variable can be a
predictive feature.
Bathroom
There are clear indications in relation to the price and the different number of bathrooms. Therefore
this wouldn’t be a predictive feature.
age
We could see that the the properties aging between 100 to 900 take a great proportion of properties
sold over 2000 AUD per square meter. We would the assume the cause of which happening to be
the property with such age are always associate with historical sites.
Regionname
In this graph, we can find that the property in the Southern Metropolitan area are highly demended
in Melbourne housing market.
Propertycount
From the property count of 0 to 15000 ,the proportion of price over 2000 goes up; when the
properry count is over 20000, the proportion of such properies sold over 2000 is sharply drupped.
We would assume that people have very specific requirement about the living density.
Multivariate Visualisation
Rooms Num vs BedRoom2
Since the bedroom2 data was drawned from a different source, we choose not to plot it. We will
compare the bedroom2 data with Rooms data. According to the graph below, we can clearly see
that there are little difference between these two variables, which may results from the different
computing standards.
Therefore, we removed it.
price$Bedroom2 <- NULL
Car park number, Car park number and Propertycount
According to the graph below, we can summerize that the property with shorter distance, as in
shorter than 10 from CBD, got sold in the highest price per square meter. The price per square
meter increased along with the increased of the number of properties around, but then the price
droped when the number of properties around increased beyond a certain number.The price also
goes down with car park number increased. We can assume that the proporty with 0-10 from CBD
with around 15000 property around and has small number of car park will more likely be sold over
2000 per square meter.
Summary
In this assignment, we compute price per square meter in order to avoid the effect of property with
different land and building size. By using sold date and built date to compute the age of the
property. In order to achieve the purpose of the analysis, we removed the data without price, or
without both land and building size. For categorical features, we create different breaks to make
sure that there are similar data items in each break based on variables’s level table. From the data
exploration, we plot every variable’s relation with price per square meter. We found that, rooms,
method, distance, car, year, region name and property count are potentially useful features in
estimating the price classes.

More Related Content

Similar to Data pre-processing and Exploration on 2016 Melbourne housing market by using R

Real price predictor
Real price predictorReal price predictor
Real price predictor唯 李
 
Academic Team Project: Machine Learning with R
Academic Team Project: Machine Learning with RAcademic Team Project: Machine Learning with R
Academic Team Project: Machine Learning with RLeiyuxiang (Frank) Wu
 
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
 
Python and the CERR Challenge Process
Python and the CERR Challenge ProcessPython and the CERR Challenge Process
Python and the CERR Challenge ProcessGeCo in the Rockies
 
AIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASEAIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASESagar Deogirkar
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WAMohammed Al Hamadi
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
 
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB
 
Work in TDW
Work in TDWWork in TDW
Work in TDWsaso70
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLSotiris Baratsas
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Theodore Grammatikopoulos
 
bidrenttheoryofficesector
bidrenttheoryofficesectorbidrenttheoryofficesector
bidrenttheoryofficesectorSasadeCastro
 
IRJET- House Rent Price Prediction
IRJET- House Rent Price PredictionIRJET- House Rent Price Prediction
IRJET- House Rent Price PredictionIRJET Journal
 
Bid rent theory office sector
Bid rent theory office sectorBid rent theory office sector
Bid rent theory office sectorAbhishek Kanwar
 

Similar to Data pre-processing and Exploration on 2016 Melbourne housing market by using R (20)

PythonCERR_2014
PythonCERR_2014PythonCERR_2014
PythonCERR_2014
 
Real price predictor
Real price predictorReal price predictor
Real price predictor
 
Academic Team Project: Machine Learning with R
Academic Team Project: Machine Learning with RAcademic Team Project: Machine Learning with R
Academic Team Project: Machine Learning with R
 
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
 
Python and the CERR Challenge Process
Python and the CERR Challenge ProcessPython and the CERR Challenge Process
Python and the CERR Challenge Process
 
AIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASEAIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASE
 
bhagat.pdf
bhagat.pdfbhagat.pdf
bhagat.pdf
 
Iowa_Report_2
Iowa_Report_2Iowa_Report_2
Iowa_Report_2
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
MongoDB .local Paris 2020: Tout savoir sur le moteur de recherche Full Text S...
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
Work in TDW
Work in TDWWork in TDW
Work in TDW
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQL
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)
 
bidrenttheoryofficesector
bidrenttheoryofficesectorbidrenttheoryofficesector
bidrenttheoryofficesector
 
IRJET- House Rent Price Prediction
IRJET- House Rent Price PredictionIRJET- House Rent Price Prediction
IRJET- House Rent Price Prediction
 
Bid rent theory office sector
Bid rent theory office sectorBid rent theory office sector
Bid rent theory office sector
 
Field properties
Field propertiesField properties
Field properties
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Data pre-processing and Exploration on 2016 Melbourne housing market by using R

  • 1. Predicting Property Price Melbourne Shuai Gao (s3596156) 4 April 2018 Introduction The purpose of this assignment is to build classifiers in order to predict whether a property can be sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the contents are organized as follow. In section 2 of the report, we will discuss the dataset and their attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will explore each attribute and the inter-relationships between attributes. After these analysis, we will summerize the findings in the last section. Data Set This dataset is provided by kaggle (https://www.kaggle.com/anthonypino/melbourne-housing- market). Which include 34857 observations and 21 variables. Target Feature The response feature is square_price2000 which is given as: Descriptive Features The variable description is provided by Tony Pino: Suburb: Suburb Address: Address Rooms: Number of rooms Price: Price in dollars Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available. Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential. SellerG: Real Estate Agent Date: Date sold Distance: Distance from CBD Regionname: General Region (West, North West, North, North east …etc) Propertycount: Number of properties that exist in the suburb. Bedroom2 : Scraped # of Bedrooms (from different source) Bathroom: Number of Bathrooms Car: Number of carspots Landsize: Land Size
  • 2. BuildingArea: Building Size YearBuilt: Year the house was built CouncilArea: Governing council for the area Lattitude: Self explanitory Longtitude: Self explanitory Since the purpose of this assignment is to evaluate the price of a property due to the existing data, we will only use variables that linked to our topic, which include the number of Rooms, Type of the property, property selling method, Distance from CBD, Bedroom2 (scraped number of bedrooms from another source), numbers of bathroom, number of carspots, the year the house was built, Regionname and Property count in the same suburb. Since Bedroom2 is the variable sourced from a different dataset, we will leave it aside now and check if it have similar effect with Rooms. For more details, see Domain (https://www.domain.com.au/). Data Pre-processing Preliminaries In this project, we used the following R packages. library(tidyverse) library(knitr) library(mlr) library(cowplot) Firstly, we need to read data into RStudio in order to process. The data have already provide header for us so we don’t have to implement header for it. price <- read.csv('Melbourne.csv', stringsAsFactors = FALSE, header = TRUE) Data Cleaning and Transformation After applying str and summarizeColumns functions, we found that there are a few variables that are not linking to our topic. For example, the price of the property cannot be evaluated solely due to the difference in land size and building size. In order to estimate the pricd of the property, we have to consolidate both land size and building size to construct a new column which refers the price in square meters. str(price)
  • 3. ## 'data.frame': 34857 obs. of 21 variables: ## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ... ## $ Address : chr "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659 Victoria St" ... ## $ Rooms : int 2 2 2 3 3 3 4 4 2 2 ... ## $ Type : chr "h" "h" "h" "u" ... ## $ Price : int NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ... ## $ Method : chr "SS" "S" "S" "VB" ... ## $ SellerG : chr "Jellis" "Biggin" "Biggin" "Rounds" ... ## $ Date : chr "3/09/2016" "3/12/2016" "4/02/2016" "4/02/2016" ... ## $ Distance : chr "2.5" "2.5" "2.5" "2.5" ... ## $ Postcode : chr "3067" "3067" "3067" "3067" ... ## $ Bedroom2 : int 2 2 2 3 3 3 3 3 4 3 ... ## $ Bathroom : int 1 1 1 2 2 2 1 2 1 2 ... ## $ Car : int 1 1 0 1 0 1 2 2 2 1 ... ## $ Landsize : int 126 202 156 0 134 94 120 400 201 202 ... ## $ BuildingArea : num NA NA 79 NA 150 NA 142 220 NA NA ... ## $ YearBuilt : int NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ... ## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Cou ncil" "Yarra City Council" ... ## $ Lattitude : num -37.8 -37.8 -37.8 -37.8 -37.8 ... ## $ Longtitude : num 145 145 145 145 145 ... ## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Norther n Metropolitan" "Northern Metropolitan" ... ## $ Propertycount: chr "4019" "4019" "4019" "4019" ... summarizeColumns(price) %>% knitr::kable( caption = 'Feature Summary before Data P reprocessing') Feature Summary before Data Preprocessing name type na mean disp median mad Suburb character 0 NA 9.757868e-01 NA NA Address character 0 NA 9.998279e-01 NA NA Rooms integer 0 3.031012e+00 9.699329e-01 3.0000 1.482600e+00 Type character 0 NA 3.120464e-01 NA NA 3580.00000 Price integer 7610 1.050173e+06 6.414671e+05 870000.0000 4.299540e+05 85000.00000 Method character 0 NA 4.335714e-01 NA NA SellerG character 0 NA 9.036349e-01 NA NA Date character 0 NA 9.678974e-01 NA NA Distance character 0 NA 9.592621e-01 NA NA Postcode character 0 NA 9.757868e-01 NA NA Bedroom2 integer 8217 3.084647e+00 9.806897e-01 3.0000 1.482600e+00
  • 4. name type na mean disp median mad Bathroom integer 8226 1.624798e+00 7.242120e-01 2.0000 1.482600e+00 Car integer 8728 1.728845e+00 1.010771e+00 2.0000 1.482600e+00 Landsize integer 11810 5.935990e+02 3.398842e+03 521.0000 3.113460e+02 BuildingArea numeric 21115 1.602564e+02 4.012671e+02 136.0000 6.078660e+01 YearBuilt integer 19306 1.965290e+03 3.732818e+01 1970.0000 4.447800e+01 1196.00000 CouncilArea character 0 NA 8.945692e-01 NA NA Lattitude numeric 7976 -3.781063e+01 9.027890e-02 -37.8076 8.077200e-02 Longtitude numeric 7976 1.450019e+02 1.201688e-01 145.0078 1.012912e-01 144.42379 Regionname character 0 NA 6.604412e-01 NA NA Propertycount character 0 NA 9.757868e-01 NA NA We removed the excessive white spaces for all character features. price[, sapply( price, is.factor )] <- sapply( price[, sapply( price, is.factor )], trimws) We will estimate the price of the property based on the price per square meter to avoid the side effect of the differences of the land siae and the building size. We assume that the land size represent the size of land which no building constructed on that patricular land; and figure of Building Area represents the size of the building. We assume that one particular property will have the information of either land size or building area, or both. If the information of one particular property has neither, we will treat the data entry invalid (treat as 0). Based on these assumptions, an observation of a 0 value or no value means that the particular property means it either has no data regarding land size nor building area, or doesn’t have a price, or both. price$Landsize[is.na(price$Landsize)] <- 0 price$BuildingArea[is.na(price$BuildingArea)] <- 0 price <- data.frame(price,square_price=price$Price/(price$Landsize+price$BuildingAr ea)) price <- price%>%filter(square_price>0&square_price!=Inf) In general, the age of property has a severe impact on the price of that particular property. We will use the property sold date and built date to compute the age of property. (The negative results are possible since there might be some pre-order properties. It is also possible for missing value or no value since there might be some properties is un-sold or no record of the built date) price$Date <- sapply(price$Date,function(x){strsplit(x,"/")[[1]][3]}) price$YearBuilt <- as.integer(as.integer(price$YearBuilt )) price$Date <- as.integer(as.integer(price$Date )) We try to build classifiers ti distinguish the price per square meter. If the price per squar meter is
  • 5. greater than 2000, the we classified the data to be true, otherwise, false. price <- data.frame(price,"square_price2000"=price$square_price>=2000,year=price$Da te-price$YearBuilt) We only include the variables that is relevant to the purpose of this analysis. The relevant variables are as stated in the previous session. price <-subset(price,select = c("Rooms","Type","Method","Distance","Bedroom2","Bath room","Car","year","Regionname","Propertycount","square_price2000")) The data of the age of the property is wide spreaded, therefore it is really hard to analyze based on the numeric level. For better analyzing, we need to build classifiers to make the age of the property to be easy to analysis. We build the same classifiers for variables like Propertycount, Propertycount, Bathroom, Rooms, Car and Bedroom2. The breaks of the classifiers was set based on the equivalent amount of data contained in each level. breaks=c(-5,10,30,50,100,900) price$year <- cut(price$year, breaks = breaks) breaks1=c(0,5000,10000,15000,25000) price$Propertycount <- as.numeric(as.numeric(price$Propertycount )) price$Propertycount <- cut(price$Propertycount,breaks = breaks1) price$Propertycount <- as.factor(as.factor(price$Propertycount )) breaks2=c(0,5,10,15,20,50) price$Distance <- as.numeric(as.numeric(price$Distance )) price$Distance <- cut(price$Distance,breaks = breaks2) price$Distance <- as.factor(as.factor(price$Distance )) price$Rooms <- ifelse(price$Rooms>5,"6-12",price$Rooms) price$Car <- ifelse(price$Car>4,"5-10",price$Car) price$Bathroom <- ifelse(price$Bathroom>4,"5-9",price$Bathroom) price$Bedroom2 <- ifelse(price$Bedroom2>5,"6-12",price$Bedroom2) After pre-processing, we are able to find out the number of data in each break. If the number are equivelent or similar in each break, We found every variable seems to meet the analysis requirement. If not, we need to go back to the previous step to adjust the breaks to achieve the equal or similar number of data in each break to meet the analysis requirement. price[, sapply( price, is.character )] <- lapply( price[, sapply( price, is.charact er )], factor) price$square_price2000 <- as.factor(as.factor(price$square_price2000 )) summarizeColumns(price) %>%kable( caption = 'Feature Summary after Data Preprocessi ng' ) Feature Summary after Data Preprocessing
  • 6. name type na mean disp median mad min max nlevs Rooms factor 0 NA 0.5361869 NA NA 132 8517 6 Type factor 0 NA 0.2121658 NA NA 1395 14467 3 Method factor 0 NA 0.3458585 NA NA 133 12012 5 Distance factor 5 NA NA NA NA 1816 6098 5 Bedroom2 factor 6 NA NA NA NA 13 8529 7 Bathroom factor 9 NA NA NA NA 16 9078 6 Car factor 320 NA NA NA NA 240 8491 6 year factor 6807 NA NA NA NA 1452 3714 5 Regionname factor 0 NA 0.7066383 NA NA 80 5387 8 Propertycount factor 0 NA 0.5810597 NA NA 1030 7693 4 square_price2000 factor 0 NA 0.4737788 NA NA 8700 9663 2 str( price ) ## 'data.frame': 18363 obs. of 11 variables: ## $ Rooms : Factor w/ 6 levels "1","2","3","4",..: 2 2 3 3 4 2 3 2 2 3 ... ## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ... ## $ Method : Factor w/ 5 levels "PI","S","SA",..: 2 2 4 1 5 2 2 2 2 5 ... ## $ Distance : Factor w/ 5 levels "(0,5]","(5,10]",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ Bedroom2 : Factor w/ 7 levels "0","1","2","3",..: 3 3 4 4 4 3 5 3 4 4 ... ## $ Bathroom : Factor w/ 6 levels "0","1","2","3",..: 2 2 3 3 2 2 3 2 2 3 ... ## $ Car : Factor w/ 6 levels "0","1","2","3",..: 2 1 1 2 3 1 1 3 3 3 ... ## $ year : Factor w/ 5 levels "(-5,10]","(10,30]",..: NA 5 5 NA 1 NA 5 5 5 2 ... ## $ Regionname : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ Propertycount : Factor w/ 4 levels "(0,5e+03]","(5e+03,1e+04]",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ square_price2000: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ... sapply( price[ sapply(price,is.factor)], table)
  • 7. ## $Rooms ## ## 1 2 3 4 5 6-12 ## 526 3732 8517 4499 957 132 ## ## $Type ## ## h t u ## 14467 1395 2501 ## ## $Method ## ## PI S SA SP VB ## 2134 12012 133 2372 1712 ## ## $Distance ## ## (0,5] (5,10] (10,15] (15,20] (20,50] ## 2362 6098 6089 1993 1816 ## ## $Bedroom2 ## ## 0 1 2 3 4 5 6-12 ## 13 540 3823 8529 4406 916 130 ## ## $Bathroom ## ## 0 1 2 3 4 5-9 ## 16 9078 7620 1416 165 59 ## ## $Car ## ## 0 1 2 3 4 5-10 ## 1137 6212 8491 1134 829 240 ## ## $year ## ## (-5,10] (10,30] (30,50] (50,100] (100,900] ## 1872 2044 2474 3714 1452 ## ## $Regionname ## ## Eastern Metropolitan Eastern Victoria ## 2282 122 ## Northern Metropolitan Northern Victoria ## 5295 129 ## South-Eastern Metropolitan Southern Metropolitan ## 907 5387 ## Western Metropolitan Western Victoria ## 4161 80 ## ## $Propertycount
  • 8. ## ## (0,5e+03] (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2.5e+04] ## 5903 7693 3737 1030 ## ## $square_price2000 ## ## FALSE TRUE ## 8700 9663 Data Exploration Categorical Features Rooms According to the bar chart of rooms below, the figure is clearly normal distributed. In the other chart, the price per square meter over 2000 proportion graph, the graph is skewed to the right. Based on our analysis, we found that the properties with 2 or less bedrooms tend to have higher selling price per square meter. In this particular dataset, we found that the properties with 2 or less bedrooms have more chance to have the price over 2000 AUD per square meter. Another trend we found in this analysis is that the more bedrooms equipted to the property, the less chance of the selling price go over 2000 AUD per square meter. Therefore, the number of the bedrooms equipted to the property would be a predictive feature. Further to our analysis, since the selling price per square meter of 1 or 2 bedrooms properties is the highest in the market, combining the fact that the greater the bedrooms number is, the less chance it will get sold at the price per square meter greater than 2000 AUD, we can predict that the willingness to pay per square meter for smaller properties is higher, or, since the total price of the smaller properties is less than larger properties, consumers’ buying power are still limited.
  • 9. Type The property that is catigorized as unit, duplex and townhouse tend to have greater chance to be sold more than 2000$ per square meter. The percentage of property sold of these 2 types of price higher than 2000$ per square meter is dominant. But we have to take the sample size into consideration in the process of prediction as well. Based on the bar chart of type, the porpotion of townhouse and unit properties is quite small. For now, we need to put this aside for further consideration.
  • 10. Method Based on the charts below, although the property sold through vender bid was not as large as most of the other selling method, yet the results from vender bid are standing out - it has the highest chance of all to get the price per square meter greater than 2000 AUD. On the other hand, the least porportion of properties was sold through auction, and those properties got the lowest chance to get sold over 2000 AUD per square meter.
  • 11. Distance Based on the two charts below, it is really clear to see that the price per square meter has the negative correlation with the distance from the CBD, as in the greater the distance, the lower the price. Therefore, ths distance variable can be a predictive feature.
  • 12. Car Park Number The trend in the variable of car parke number in relation to the price per square meter is very similar to that of the distance from CBD. Based on the two charts below, it is really clear to see that the price per square meter has the negative correlation with the car park number, as in the greater the number of the carpark, the lower the price. Therefore, ths car park number variable can be a predictive feature.
  • 13. Bathroom There are clear indications in relation to the price and the different number of bathrooms. Therefore this wouldn’t be a predictive feature.
  • 14. age We could see that the the properties aging between 100 to 900 take a great proportion of properties sold over 2000 AUD per square meter. We would the assume the cause of which happening to be the property with such age are always associate with historical sites.
  • 15. Regionname In this graph, we can find that the property in the Southern Metropolitan area are highly demended in Melbourne housing market.
  • 16. Propertycount From the property count of 0 to 15000 ,the proportion of price over 2000 goes up; when the properry count is over 20000, the proportion of such properies sold over 2000 is sharply drupped. We would assume that people have very specific requirement about the living density. Multivariate Visualisation Rooms Num vs BedRoom2 Since the bedroom2 data was drawned from a different source, we choose not to plot it. We will compare the bedroom2 data with Rooms data. According to the graph below, we can clearly see that there are little difference between these two variables, which may results from the different computing standards.
  • 17. Therefore, we removed it. price$Bedroom2 <- NULL Car park number, Car park number and Propertycount According to the graph below, we can summerize that the property with shorter distance, as in shorter than 10 from CBD, got sold in the highest price per square meter. The price per square meter increased along with the increased of the number of properties around, but then the price droped when the number of properties around increased beyond a certain number.The price also goes down with car park number increased. We can assume that the proporty with 0-10 from CBD with around 15000 property around and has small number of car park will more likely be sold over 2000 per square meter.
  • 18. Summary In this assignment, we compute price per square meter in order to avoid the effect of property with different land and building size. By using sold date and built date to compute the age of the property. In order to achieve the purpose of the analysis, we removed the data without price, or without both land and building size. For categorical features, we create different breaks to make sure that there are similar data items in each break based on variables’s level table. From the data exploration, we plot every variable’s relation with price per square meter. We found that, rooms, method, distance, car, year, region name and property count are potentially useful features in estimating the price classes.