Our aim was to develop algorithms which use a broad spectrum of features to predict real prices. Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sber bank to provide more certainty to their customers in an uncertain economy.
2. TABLE OF CONTENTS
• Introduction
• Scope and Objective
• Data Overview
• Data Cleaning
• Data Analysis
• EDA
• Model Building : 1.Linear Regression
• Model Building : 2.Random Forest
• Variable Importance Plot
• Conclusion
3. Introduction
Housing costs demand a significant investment from both consumers and
developers. And when it comes to planning a budget—whether personal
or corporate—the last thing anyone needs is uncertainty about one of their
biggest expenses. Sber bank, Russia’s oldest and largest bank, helps their
customers by making predictions about realty prices to renters,
developers, and lenders are more confident when they sign a lease or
purchase a building.
4. Problem Statement
Our aim was to develop algorithms which use a broad spectrum of features
to predict real prices. Algorithm applications rely on a rich dataset that
includes housing data and macroeconomic patterns. An accurate forecasting
model will allow Sberbank to provide more certainty to their customers in an
uncertain economy.
5. Data Overview
Number of Observations
• Training Data - 30,471
• Testing Data – 7,662
Number of Features - 296
Macro Data
• Number of Observation – 2,484
• Number of Features – 100
Train Data – August 2011 to
June 2015
Test Data - July 2015 to
May 2016
7. Variables Description
price_doc sale price (this is the target variable)
full_sq
total area in square meters, including loggias, balconies
and other non-residential areas.
life_sq
living area in square meters, excluding loggias, balconies
and other non-residential areas
num_room number of living rooms
kitch_sq kitchen area
max_floor number of floors in the building
state apartment condition
build_year year built
product_type owner-occupier purchase or investment
floor for apartments, the floor of the building
school_km Distance to high school
cemetery_km Distance to the cemetery
metro_km_avto Distance to subway by car, km
big_road2_km The distance to next distant major road
nuclear_reactor_km Distance to nuclear reactor
additional_education_km Distance to additional education
stadium_km Distance to stadium
museum_km Distance to museums
cafe_sum_2000_min_pric
e_avg Cafes and restaurant min average bill in 2000 meters zone
prom_part_1500 The share of industrial zones in 1500 meters zone
big_market_km Distance to grocery / wholesale markets
product_type owner-occupier purchase or investment
Data Dictionary
8. Missing Data
Out of 292 columns, 51 have
missing values.
Percentage of missing values from
0.1% in metro_min_walk to 47.4%
in hospital_beds_raion.
11. Price Trend over the time Span
• This shows the Avg Price Over
Time.
• We can see that average prices have
seen fluctuations between 2011 and
2015 with an overall increase over
time.
• However, there is a drop between
June 2012 – Dec 2012.
12. Price by apartment size
Median price does go up relative to apartment
size, with the exception of “Large” apartments
having a median price slightly lower than
apartments of “Medium” size.
13. Price by building size
Apartment price as a function of building size shows
similar median prices for low-rise, medium, and high-
rise buildings. Larger buildings labelled “Sky” with 40
floors or more show a slightly higher median
apartment price.
14. Breakdown of the transactions for the top sub-areas
(by frequency) based on the product type.
15. Product type by Built year
Older buildings generally are involved in investment
transactions, possibly due to better deals, while newer
constructions are predominantly owner occupied.
16.
17.
18. Findings
• Simple exploration and visualization of the data outlines important considerations while
training our machine learning model.
• Our model must consider the factors behind the price dips in 2012.
• The model must also understand that low prices attract investors, as can be seen with
older buildings. However, if investment leads to an increase in price for a particular
building, the model must be able to adjust future predictions for that building accordingly.
Similarly, the model must be able to predict future prices for a given sub-area that has
seen an increase in average price over time as a result of investment.
35. Conclusion
• On comparing the accuracy of the two models Linear and random
forest we can see that there is not much significant difference in the
models but random forest has a comparatively higher accuracy
compared to Linear model.
36.
37. Year of Build
• The distribution appears bimodal with a peak
somewhere in the early 1970s and somewhere in the
past few years.
44. Material
• The data dictionary does not contain more
information about the meaning of the individual
values. But still there is some variation with sale
price.
45. Floor and Max Floor
Price seems to rise with the floor, although the effect is
pretty small.
Small positive correlation. This effect however is likely being
confounded by the fact that the urban core has both more
expensive real estate and taller buildings. So the height of the
building alone is likely not what is determing price here.
46. Max Floor and Floor
The observations below the grey identity line have a floor
greater than the number of floors in the building.
There are 1,493 observations where this is the case.
47. Raion population density
These density numbers seem to make sense given that
the population density of Moscow as a whole is
8,537/sq km. There are a few raions that seem to have a
density of near zero, which seems odd. Home price
does seem to increase with population density.
48. Demographic &
Geographic Characteristics School Characteristics
Price is correlated with most of these, but the
associations are fairly weak.
No correlation between price and the school variables.
The school variables however are highly correlated with
each other.
49. Homes in a raion with 3 top 20 universities have the
highest median home price, however, it is fairly close
among 0, 1, and 2. There are very few homes with 3
top universites in their raion.
There is only one districts with three universities.
University_top_20_raion
50. Cultural/Recreational Characteristics
• There are weak correlations between price and many
of these variables.
• There is a small positive correlation between price and
the number of ‘sports objects’ in a raion as well as
between price and the number of shopping centers.
• There is also a negative correlation between price and
the cultural and recreational amenities.
51. There is a positive correlation. There is a negative correlation.
52. Infrastructure Features
• There are weak correlations between price and many
of these variables.
• There is also a negative correlation between price and
the Kremlin_km and recreational amenities.
56. Conclusion
From EDA analysis , we can observe that :
We can see the clear upward trend in price across the recent years.
The price is positively correlated with most of the Housing Internal Characteristics.
A vast majority of the apartments have three rooms or less.
This shows a clear relation, between the total area and the sale price: The bigger the
appartment the higher the sale price.
We see that, property prices are higher for investment sales compared to owner sales.
The data dictionary does not contain more information about the meaning of the
individual values. But still there is some variation with sale price.
Price seems to rise with the floor, although the effect is pretty small.
57. The variables concerning the size of the home, such as area and number of
rooms, have the highest importance.
The number of stories in the building seems to be quite important as well.
Build year ranks high.
The distance to the nearest sports court ranks high as well.