The purpose of this assignment is to build classifiers in order to predict whether a property can be sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the contents are organized as follow. In section 2 of the report, we will discuss the dataset and their attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will explore each attribute and the inter-relationships between attributes. After these analysis, we will summerize the findings in the last section.
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
House Price Estimates Based on Machine Learning Algorithmijtsrd
Housing prices are increasing every year, necessitating the creation of a long term housing price strategy. Predicting a homes price will assist a developer in determining a homes purchase price, as well as a consumer in determining the best time to buy a home. The sale price of real estate in major cities depends on the specific circumstances. Housing prices are constantly changing from day to day and are sometimes fired rather than based on estimates. Predicting real estate prices by real factors is a key element as part of our analysis. We want to make our test dependent on all of the simple metrics that are taken into account when deciding the significance. In this research we use linear regression techniques pathway and our results are not self inflicted process rather is a weighted method of various techniques to give the most accurate results. There are fifteen features in the data collection. In this research. There has been an effort to build a forecasting model for determining the price based on the variables that influence the price.The results have proven to be effective lower error and higher accuracy than individual algorithms are used. Jakir Khan | Dr. Ganesh D "House Price Estimates Based on Machine Learning Algorithm" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42367.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42367/house-price-estimates-based-on-machine-learning-algorithm/jakir-khan
• Have used and demonstrated CRISP-DM methodology throughout the project.
• Used RapidMiner tool to automatically adapt all the possible attributes and operator to provide the prediction.
• Have used different algorithms like Decision tree, Random forest, and Gradient boosted tree to predict price distribution and created the simulation of the result.
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
House Price Estimates Based on Machine Learning Algorithmijtsrd
Housing prices are increasing every year, necessitating the creation of a long term housing price strategy. Predicting a homes price will assist a developer in determining a homes purchase price, as well as a consumer in determining the best time to buy a home. The sale price of real estate in major cities depends on the specific circumstances. Housing prices are constantly changing from day to day and are sometimes fired rather than based on estimates. Predicting real estate prices by real factors is a key element as part of our analysis. We want to make our test dependent on all of the simple metrics that are taken into account when deciding the significance. In this research we use linear regression techniques pathway and our results are not self inflicted process rather is a weighted method of various techniques to give the most accurate results. There are fifteen features in the data collection. In this research. There has been an effort to build a forecasting model for determining the price based on the variables that influence the price.The results have proven to be effective lower error and higher accuracy than individual algorithms are used. Jakir Khan | Dr. Ganesh D "House Price Estimates Based on Machine Learning Algorithm" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42367.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42367/house-price-estimates-based-on-machine-learning-algorithm/jakir-khan
• Have used and demonstrated CRISP-DM methodology throughout the project.
• Used RapidMiner tool to automatically adapt all the possible attributes and operator to provide the prediction.
• Have used different algorithms like Decision tree, Random forest, and Gradient boosted tree to predict price distribution and created the simulation of the result.
Identify those parts of a scene that are visible from a chosen viewing position.
Visible-surface detection algorithms are broadly classified according to whether
they deal with object definitions directly or with their projected images.
These two approaches are called object-space methods and image-space methods, respectively
An object-space method compares
objects and parts of objects to each other within the scene definition to determine which surfaces, as a whole, we should label as visible.
In an image-space algorithm, visibility is decided point by point at each pixel position on the projection plane.
A frequently used class of objects are the quadric surfaces, which are described with second-degree equations (quadratics). They include spheres, ellipsoids, tori, paraboloids, and hyperboloids.
Quadric surfaces, particularly spheres and ellipsoids, are common elements of graphics scenes
In artificial intelligence, an intelligent agent (IA) is an autonomous entity which observes through sensors and acts upon an environment using actuators (i.e. it is an agent) and directs its activity towards achieving goals (i.e. it is "rational", as defined in economics).
An illumination model, also called a lighting model and sometimes referred to as a shading model, is used to calculate the intensity of light that we should see at a given point on the surface of an object.
Identify those parts of a scene that are visible from a chosen viewing position.
Visible-surface detection algorithms are broadly classified according to whether
they deal with object definitions directly or with their projected images.
These two approaches are called object-space methods and image-space methods, respectively
An object-space method compares
objects and parts of objects to each other within the scene definition to determine which surfaces, as a whole, we should label as visible.
In an image-space algorithm, visibility is decided point by point at each pixel position on the projection plane.
A frequently used class of objects are the quadric surfaces, which are described with second-degree equations (quadratics). They include spheres, ellipsoids, tori, paraboloids, and hyperboloids.
Quadric surfaces, particularly spheres and ellipsoids, are common elements of graphics scenes
In artificial intelligence, an intelligent agent (IA) is an autonomous entity which observes through sensors and acts upon an environment using actuators (i.e. it is an agent) and directs its activity towards achieving goals (i.e. it is "rational", as defined in economics).
An illumination model, also called a lighting model and sometimes referred to as a shading model, is used to calculate the intensity of light that we should see at a given point on the surface of an object.
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO
En este webinar repasamos - mediante una demostración con el mercado de Real Estate de Los Angeles como ejemplo - cada uno de los cinco pasos que la plataforma de CARTO sigue para una toma de decisiones eficaz basada en los datos.
Watch it now at: https://go.carto.com/carto-pasos-dato-toma-decisiones-recorded
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
This presentation includes, the steps to create a database in SQL Server, Importing data to the tables, Populating data into Dimension and Fact tables using SSIS, Report generation using SSRS, Data Presentation using Tableau, and also using the Adventure works dataset to differentiate between graph database and DBMSs with AirBnB Database.
Performed association study and modeled real-time data using Linear Regression, Random Forest, Boosted Trees, Gradient Boosting, LASSO to predict house prices in Brooklyn for next few years
(Machine Learning) Clustering & Classifying Houses in King County, WAMohammed Al Hamadi
This presentation shows how to use R programming language to do the following:
- load data set into R
- cluster the data
- classify the data using Support Vector Machines algorithm
This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon School of Engineering, NYU.
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
Parameterisation can be used to build a website with a page for every region/category/row in your data. This talk at DataHarvest/EIJC 2023 walks through how to do that, with example code and tips.
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB
Data analytics can offer insights into your business and help take it to the next level. In this talk you'll learn about MongoDB tools for building visualizations, dashboards and interacting with your data. We'll start with exploratory data analysis using MongoDB Compass. Then, in a matter of minutes, we'll take you from 0 to 1 - connecting to your Atlas cluster via BI Connector and running analytical queries against it in Microsoft Excel. We'll also showcase the new MongoDB Charts product and you'll see how quick, easy and intuitive analytics can be on the MongoDB platform without flattening the data or spending time and effort on complicated and fragile ETL.
A car rental company wants to develop a relational database to monitor customers, rentals, fleet and locations. The company's fleet consists of cars of different types. A car is described via a unique code (VIN), a description, color, brand, model, and date of purchase. A car may belong to one (exactly one) vehicle category (compact, economy, convertible, etc.). Each category is described by a unique ID, a label and a detailed description. The company has several locations around the globe. Each location has a unique ID, an address (street, number, city, state, country) and one or more telephone numbers. The company should also store in this database its customers. A customer is described by a unique ID, SSN, Name (First, Last), email, mobile phone number and lives in a state and country. Customers rent a car, which they pickup from a location and return it another location (not necessarily the same.) A rental is described by a unique reservation number, it has an amount and contains the pickup date and the return date. Entity-Relationship Diagram (ERD) Use the Entity-Relationship Diagram (ERD) to model entities, relationships, attributes, cardinalities, and all necessary constraints. Use any tool you like to draw the ERD.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data pre-processing and Exploration on 2016 Melbourne housing market by using R
1. Predicting Property Price Melbourne
Shuai Gao (s3596156)
4 April 2018
Introduction
The purpose of this assignment is to build classifiers in order to predict whether a property can be
sold more than 2000 per Square meter in a year from the dataset “2016 Melbourne housing
market”. The dataset was sourced from kaggle (https://www.kaggle.com/). In this report, the
contents are organized as follow. In section 2 of the report, we will discuss the dataset and their
attributes. In section 3 of the report, we will discuss the data pre-processing. In section 4, we will
explore each attribute and the inter-relationships between attributes. After these analysis, we will
summerize the findings in the last section.
Data Set
This dataset is provided by kaggle (https://www.kaggle.com/anthonypino/melbourne-housing-
market). Which include 34857 observations and 21 variables.
Target Feature
The response feature is square_price2000 which is given as:
Descriptive Features
The variable description is provided by Tony Pino:
Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not
disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA
- sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not
available. Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t -
townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2 : Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size
2. BuildingArea: Building Size
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanitory
Longtitude: Self explanitory
Since the purpose of this assignment is to evaluate the price of a property due to the existing data,
we will only use variables that linked to our topic, which include the number of Rooms, Type of the
property, property selling method, Distance from CBD, Bedroom2 (scraped number of bedrooms
from another source), numbers of bathroom, number of carspots, the year the house was built,
Regionname and Property count in the same suburb. Since Bedroom2 is the variable sourced from
a different dataset, we will leave it aside now and check if it have similar effect with Rooms. For
more details, see Domain (https://www.domain.com.au/).
Data Pre-processing
Preliminaries
In this project, we used the following R packages.
library(tidyverse)
library(knitr)
library(mlr)
library(cowplot)
Firstly, we need to read data into RStudio in order to process. The data have already provide
header for us so we don’t have to implement header for it.
price <- read.csv('Melbourne.csv', stringsAsFactors = FALSE, header = TRUE)
Data Cleaning and Transformation
After applying str and summarizeColumns functions, we found that there are a few variables that
are not linking to our topic. For example, the price of the property cannot be evaluated solely due to
the difference in land size and building size. In order to estimate the pricd of the property, we have
to consolidate both land size and building size to construct a new column which refers the price in
square meters.
str(price)
3. ## 'data.frame': 34857 obs. of 21 variables:
## $ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
## $ Address : chr "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659
Victoria St" ...
## $ Rooms : int 2 2 2 3 3 3 4 4 2 2 ...
## $ Type : chr "h" "h" "h" "u" ...
## $ Price : int NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ...
## $ Method : chr "SS" "S" "S" "VB" ...
## $ SellerG : chr "Jellis" "Biggin" "Biggin" "Rounds" ...
## $ Date : chr "3/09/2016" "3/12/2016" "4/02/2016" "4/02/2016" ...
## $ Distance : chr "2.5" "2.5" "2.5" "2.5" ...
## $ Postcode : chr "3067" "3067" "3067" "3067" ...
## $ Bedroom2 : int 2 2 2 3 3 3 3 3 4 3 ...
## $ Bathroom : int 1 1 1 2 2 2 1 2 1 2 ...
## $ Car : int 1 1 0 1 0 1 2 2 2 1 ...
## $ Landsize : int 126 202 156 0 134 94 120 400 201 202 ...
## $ BuildingArea : num NA NA 79 NA 150 NA 142 220 NA NA ...
## $ YearBuilt : int NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ...
## $ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Cou
ncil" "Yarra City Council" ...
## $ Lattitude : num -37.8 -37.8 -37.8 -37.8 -37.8 ...
## $ Longtitude : num 145 145 145 145 145 ...
## $ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Norther
n Metropolitan" "Northern Metropolitan" ...
## $ Propertycount: chr "4019" "4019" "4019" "4019" ...
summarizeColumns(price) %>% knitr::kable( caption = 'Feature Summary before Data P
reprocessing')
Feature Summary before Data Preprocessing
name type na mean disp median mad
Suburb character 0 NA 9.757868e-01 NA NA
Address character 0 NA 9.998279e-01 NA NA
Rooms integer 0 3.031012e+00 9.699329e-01 3.0000 1.482600e+00
Type character 0 NA 3.120464e-01 NA NA 3580.00000
Price integer 7610 1.050173e+06 6.414671e+05 870000.0000 4.299540e+05 85000.00000
Method character 0 NA 4.335714e-01 NA NA
SellerG character 0 NA 9.036349e-01 NA NA
Date character 0 NA 9.678974e-01 NA NA
Distance character 0 NA 9.592621e-01 NA NA
Postcode character 0 NA 9.757868e-01 NA NA
Bedroom2 integer 8217 3.084647e+00 9.806897e-01 3.0000 1.482600e+00
4. name type na mean disp median mad
Bathroom integer 8226 1.624798e+00 7.242120e-01 2.0000 1.482600e+00
Car integer 8728 1.728845e+00 1.010771e+00 2.0000 1.482600e+00
Landsize integer 11810 5.935990e+02 3.398842e+03 521.0000 3.113460e+02
BuildingArea numeric 21115 1.602564e+02 4.012671e+02 136.0000 6.078660e+01
YearBuilt integer 19306 1.965290e+03 3.732818e+01 1970.0000 4.447800e+01 1196.00000
CouncilArea character 0 NA 8.945692e-01 NA NA
Lattitude numeric 7976 -3.781063e+01 9.027890e-02 -37.8076 8.077200e-02
Longtitude numeric 7976 1.450019e+02 1.201688e-01 145.0078 1.012912e-01 144.42379
Regionname character 0 NA 6.604412e-01 NA NA
Propertycount character 0 NA 9.757868e-01 NA NA
We removed the excessive white spaces for all character features.
price[, sapply( price, is.factor )] <- sapply( price[, sapply( price, is.factor )],
trimws)
We will estimate the price of the property based on the price per square meter to avoid the side
effect of the differences of the land siae and the building size. We assume that the land size
represent the size of land which no building constructed on that patricular land; and figure of
Building Area represents the size of the building. We assume that one particular property will have
the information of either land size or building area, or both. If the information of one particular
property has neither, we will treat the data entry invalid (treat as 0).
Based on these assumptions, an observation of a 0 value or no value means that the particular
property means it either has no data regarding land size nor building area, or doesn’t have a price,
or both.
price$Landsize[is.na(price$Landsize)] <- 0
price$BuildingArea[is.na(price$BuildingArea)] <- 0
price <- data.frame(price,square_price=price$Price/(price$Landsize+price$BuildingAr
ea))
price <- price%>%filter(square_price>0&square_price!=Inf)
In general, the age of property has a severe impact on the price of that particular property. We will
use the property sold date and built date to compute the age of property. (The negative results are
possible since there might be some pre-order properties. It is also possible for missing value or no
value since there might be some properties is un-sold or no record of the built date)
price$Date <- sapply(price$Date,function(x){strsplit(x,"/")[[1]][3]})
price$YearBuilt <- as.integer(as.integer(price$YearBuilt ))
price$Date <- as.integer(as.integer(price$Date ))
We try to build classifiers ti distinguish the price per square meter. If the price per squar meter is
5. greater than 2000, the we classified the data to be true, otherwise, false.
price <- data.frame(price,"square_price2000"=price$square_price>=2000,year=price$Da
te-price$YearBuilt)
We only include the variables that is relevant to the purpose of this analysis. The relevant variables
are as stated in the previous session.
price <-subset(price,select = c("Rooms","Type","Method","Distance","Bedroom2","Bath
room","Car","year","Regionname","Propertycount","square_price2000"))
The data of the age of the property is wide spreaded, therefore it is really hard to analyze based on
the numeric level. For better analyzing, we need to build classifiers to make the age of the property
to be easy to analysis. We build the same classifiers for variables like Propertycount, Propertycount,
Bathroom, Rooms, Car and Bedroom2.
The breaks of the classifiers was set based on the equivalent amount of data contained in each
level.
breaks=c(-5,10,30,50,100,900)
price$year <- cut(price$year, breaks = breaks)
breaks1=c(0,5000,10000,15000,25000)
price$Propertycount <- as.numeric(as.numeric(price$Propertycount ))
price$Propertycount <- cut(price$Propertycount,breaks = breaks1)
price$Propertycount <- as.factor(as.factor(price$Propertycount ))
breaks2=c(0,5,10,15,20,50)
price$Distance <- as.numeric(as.numeric(price$Distance ))
price$Distance <- cut(price$Distance,breaks = breaks2)
price$Distance <- as.factor(as.factor(price$Distance ))
price$Rooms <- ifelse(price$Rooms>5,"6-12",price$Rooms)
price$Car <- ifelse(price$Car>4,"5-10",price$Car)
price$Bathroom <- ifelse(price$Bathroom>4,"5-9",price$Bathroom)
price$Bedroom2 <- ifelse(price$Bedroom2>5,"6-12",price$Bedroom2)
After pre-processing, we are able to find out the number of data in each break. If the number are
equivelent or similar in each break, We found every variable seems to meet the analysis
requirement. If not, we need to go back to the previous step to adjust the breaks to achieve the
equal or similar number of data in each break to meet the analysis requirement.
price[, sapply( price, is.character )] <- lapply( price[, sapply( price, is.charact
er )], factor)
price$square_price2000 <- as.factor(as.factor(price$square_price2000 ))
summarizeColumns(price) %>%kable( caption = 'Feature Summary after Data Preprocessi
ng' )
Feature Summary after Data Preprocessing
6. name type na mean disp median mad min max nlevs
Rooms factor 0 NA 0.5361869 NA NA 132 8517 6
Type factor 0 NA 0.2121658 NA NA 1395 14467 3
Method factor 0 NA 0.3458585 NA NA 133 12012 5
Distance factor 5 NA NA NA NA 1816 6098 5
Bedroom2 factor 6 NA NA NA NA 13 8529 7
Bathroom factor 9 NA NA NA NA 16 9078 6
Car factor 320 NA NA NA NA 240 8491 6
year factor 6807 NA NA NA NA 1452 3714 5
Regionname factor 0 NA 0.7066383 NA NA 80 5387 8
Propertycount factor 0 NA 0.5810597 NA NA 1030 7693 4
square_price2000 factor 0 NA 0.4737788 NA NA 8700 9663 2
str( price )
## 'data.frame': 18363 obs. of 11 variables:
## $ Rooms : Factor w/ 6 levels "1","2","3","4",..: 2 2 3 3 4 2 3 2 2
3 ...
## $ Type : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 1 1 ...
## $ Method : Factor w/ 5 levels "PI","S","SA",..: 2 2 4 1 5 2 2 2 2
5 ...
## $ Distance : Factor w/ 5 levels "(0,5]","(5,10]",..: 1 1 1 1 1 1 1 1 1
1 ...
## $ Bedroom2 : Factor w/ 7 levels "0","1","2","3",..: 3 3 4 4 4 3 5 3 4
4 ...
## $ Bathroom : Factor w/ 6 levels "0","1","2","3",..: 2 2 3 3 2 2 3 2 2
3 ...
## $ Car : Factor w/ 6 levels "0","1","2","3",..: 2 1 1 2 3 1 1 3 3
3 ...
## $ year : Factor w/ 5 levels "(-5,10]","(10,30]",..: NA 5 5 NA 1 NA
5 5 5 2 ...
## $ Regionname : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ Propertycount : Factor w/ 4 levels "(0,5e+03]","(5e+03,1e+04]",..: 1 1 1 1
1 1 1 1 1 1 ...
## $ square_price2000: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
sapply( price[ sapply(price,is.factor)], table)
8. ##
## (0,5e+03] (5e+03,1e+04] (1e+04,1.5e+04] (1.5e+04,2.5e+04]
## 5903 7693 3737 1030
##
## $square_price2000
##
## FALSE TRUE
## 8700 9663
Data Exploration
Categorical Features
Rooms
According to the bar chart of rooms below, the figure is clearly normal distributed. In the other chart,
the price per square meter over 2000 proportion graph, the graph is skewed to the right. Based on
our analysis, we found that the properties with 2 or less bedrooms tend to have higher selling price
per square meter. In this particular dataset, we found that the properties with 2 or less bedrooms
have more chance to have the price over 2000 AUD per square meter. Another trend we found in
this analysis is that the more bedrooms equipted to the property, the less chance of the selling price
go over 2000 AUD per square meter. Therefore, the number of the bedrooms equipted to the
property would be a predictive feature.
Further to our analysis, since the selling price per square meter of 1 or 2 bedrooms properties is the
highest in the market, combining the fact that the greater the bedrooms number is, the less chance
it will get sold at the price per square meter greater than 2000 AUD, we can predict that the
willingness to pay per square meter for smaller properties is higher, or, since the total price of the
smaller properties is less than larger properties, consumers’ buying power are still limited.
9. Type
The property that is catigorized as unit, duplex and townhouse tend to have greater chance to be
sold more than 2000$ per square meter. The percentage of property sold of these 2 types of price
higher than 2000$ per square meter is dominant.
But we have to take the sample size into consideration in the process of prediction as well. Based
on the bar chart of type, the porpotion of townhouse and unit properties is quite small. For now, we
need to put this aside for further consideration.
10. Method
Based on the charts below, although the property sold through vender bid was not as large as most
of the other selling method, yet the results from vender bid are standing out - it has the highest
chance of all to get the price per square meter greater than 2000 AUD.
On the other hand, the least porportion of properties was sold through auction, and those properties
got the lowest chance to get sold over 2000 AUD per square meter.
11. Distance
Based on the two charts below, it is really clear to see that the price per square meter has the
negative correlation with the distance from the CBD, as in the greater the distance, the lower the
price. Therefore, ths distance variable can be a predictive feature.
12. Car Park Number
The trend in the variable of car parke number in relation to the price per square meter is very similar
to that of the distance from CBD. Based on the two charts below, it is really clear to see that the
price per square meter has the negative correlation with the car park number, as in the greater the
number of the carpark, the lower the price. Therefore, ths car park number variable can be a
predictive feature.
13. Bathroom
There are clear indications in relation to the price and the different number of bathrooms. Therefore
this wouldn’t be a predictive feature.
14. age
We could see that the the properties aging between 100 to 900 take a great proportion of properties
sold over 2000 AUD per square meter. We would the assume the cause of which happening to be
the property with such age are always associate with historical sites.
15. Regionname
In this graph, we can find that the property in the Southern Metropolitan area are highly demended
in Melbourne housing market.
16. Propertycount
From the property count of 0 to 15000 ,the proportion of price over 2000 goes up; when the
properry count is over 20000, the proportion of such properies sold over 2000 is sharply drupped.
We would assume that people have very specific requirement about the living density.
Multivariate Visualisation
Rooms Num vs BedRoom2
Since the bedroom2 data was drawned from a different source, we choose not to plot it. We will
compare the bedroom2 data with Rooms data. According to the graph below, we can clearly see
that there are little difference between these two variables, which may results from the different
computing standards.
17. Therefore, we removed it.
price$Bedroom2 <- NULL
Car park number, Car park number and Propertycount
According to the graph below, we can summerize that the property with shorter distance, as in
shorter than 10 from CBD, got sold in the highest price per square meter. The price per square
meter increased along with the increased of the number of properties around, but then the price
droped when the number of properties around increased beyond a certain number.The price also
goes down with car park number increased. We can assume that the proporty with 0-10 from CBD
with around 15000 property around and has small number of car park will more likely be sold over
2000 per square meter.
18. Summary
In this assignment, we compute price per square meter in order to avoid the effect of property with
different land and building size. By using sold date and built date to compute the age of the
property. In order to achieve the purpose of the analysis, we removed the data without price, or
without both land and building size. For categorical features, we create different breaks to make
sure that there are similar data items in each break based on variables’s level table. From the data
exploration, we plot every variable’s relation with price per square meter. We found that, rooms,
method, distance, car, year, region name and property count are potentially useful features in
estimating the price classes.