7. IT
IT
Dev
OpsData
• Smartphone App
• Agile Development
• Device
• Cloud
• Help Desk
• Data Science
• Analytics
*http://www.itmedia.co.jp/enterprise/articles/
1802/26/news007.html
*http://diamond.jp/articles/-/150122
27. Data : introduction
● Public Land Price data released by the Ministry of Land, Infrastructure, Transport and
Tourism as of 2018/01/01
● Area Tokyo, Nagoya, Kanagawa, Chiba and Saitama (without Tokyo Islands)
● Type of lands Residential
28. Model 1 : Geographically Weighted Regression (GWR)
“Everything is related to everything else. But near things are more related than distant things”
Tobler’s first law of Geography
Normal Regression
Every point is treated the same for prediction
GWR
Closer points are treated as more important :
the closer the bigger the weight
10
30
30
10
20
10
30
30
10
20
Prediction:
20
Prediction:
12
29. ● Model definition:
with yi the land price, xik the value of the variable k, (ui,vi) the coordinates, βk the regression
parameter for the variable k and εi the error at location i
● Parameter estimation (regression) in matrix notation:
with W(ui,vi) the diagonal matrix denoting the geographical weighting of each observed data
for regression point i
Model 1 : Geographically Weighted Regression (GWR)
30. ● Weight at regression point i of datapoint j :
● Bandwidth (scale) selection with a Golden-Section algorithm
● Bandwidth is adaptive rather than fixed :
Adaptive bandwidth that includes the k-nearest neighbors at each regression point rather
than a fixed value
This allows to incorporate data points with few close neighbors better in the regression
Model 1 : Geographically Weighted Regression (GWR)
where dij is the distance between i and
j, and b is the bandwidth or scale
31. Model 2 : Hedonic
● Model definition :
● Regression method: OLS
Yi Land Price in /m2
(log)
Variables (xk) Units Explanations
Distance to closest station m (log)
Distance to big railway hubs m (log) ( : : ...)
Floor-area ratio %
Road width m
Gas Flag : connected to gas infrastructure or not
Land area m² (log)
Building material Categorical concrete, wood,...
Land usage Categorical low-rise residential type-1, semi-residential, ...
Neighborhood weighted prices distance-weighted average of the 9 nearest neighbors
32. Model 3 : Boosted Trees (XGBoost)
● A set of weak learners (decision trees) are combined to get strong learners
● Trees are grown sequentially : each tree is grown using information from
previously grown trees
● Boosted trees are implemented using XGBoost library
● Same variables as Model 2 (Hedonic) Distance to closest station, Distance
to big railway hubs, Floor-area ratio, Road width, Gas, Land area, Building
material, Land usage, Neighborhood weighted prices
ERRORS
TRAIN TEST TRAIN TEST
MODEL MODEL
DATASET
34. Results : prediction error distribution
GWR 11.9 % 8.4 %
Hedonic 11.9 % 8.7 %
XGBoost 11.3 % 8.3 %
● Error estimation via a 100-fold Cross Validation
(train / eval to data ratio : 75% / 25%)
Mean prediction error by area Error distribution for each model
36. Results : discussion
XGBoost was the best performing model
● Model Limitations and possible ameliorations
○ GWR
The current version is single scale, assuming that all the variables experience local effects on the
same scale.
Multiscale GWR would drop that assumption and potentially improve the accuracy
○ Hedonic/XGBoost :
Spatial correlation effects are treated empirically
● Data :
○ Data points : some areas have little data point, dragging the overall accuracy down (especially Chiba)
○ Euclidean distance is used for in each case
Other distance metrics (Manhattan distance, commute time,..) might be more suited