オープンハウスにおける機械学習・データサイエンスの取り組みについて

IT
IT
Dev
OpsData
• Smartphone App
• Agile Development
• Device
• Cloud
• Help Desk
• Data Science
• Analytics
*http://www.itmedia.co.jp/enterprise/articles/
1802/26/news007.html
*http://diamond.jp/articles/-/150122

20172015 20162014
Legacy
SendMail
H/W
Analog
Feature Phone
W/F
In-House
DevOps
Digital
Phase 1
Cloud
Phase 0
On-Premises
(1/2)

20182017
CRM/SFA
MA
D-Marketing
IoT
Collab.
Phase 3
Insighti
Real Estate SPA
DD-Biz.
ML/DL
Big Data
PoC
Phase 2
Agility
(2/2)

IT
Dev
OpsData
•
•
•
•
•
1.
2.
3.
Data

BIGQuery
DB GIS
Web
SFA embulk
Web
SFA
RDB
BQ
BQGIS
GIS
.shp

( F LVZ
• KPSJQ C8B G
• 1 G 1 G 1
• O N , C 4
• C8B G )
• aW I
• JTI CG 7
BigQuery

•
–
–
–
–
•
–
–
–2m
2m
3m

Q. A.
–
–
Q. or A.
–
–
A
B
C
D
3:7 2:2
→

Stacking (Acc:88%)
10
(Acc:86%)
(Acc:82%)
!
"
"
!
!

Comparison of urban land
price prediction approaches

Data : introduction
● Public Land Price data released by the Ministry of Land, Infrastructure, Transport and
Tourism as of 2018/01/01
● Area Tokyo, Nagoya, Kanagawa, Chiba and Saitama (without Tokyo Islands)
● Type of lands Residential

Model 1 : Geographically Weighted Regression (GWR)
“Everything is related to everything else. But near things are more related than distant things”
Tobler’s first law of Geography
Normal Regression
Every point is treated the same for prediction
GWR
Closer points are treated as more important :
the closer the bigger the weight
10
30
30
10
20
10
30
30
10
20
Prediction:
20
Prediction:
12

● Model definition:
with yi the land price, xik the value of the variable k, (ui,vi) the coordinates, βk the regression
parameter for the variable k and εi the error at location i
● Parameter estimation (regression) in matrix notation:
with W(ui,vi) the diagonal matrix denoting the geographical weighting of each observed data
for regression point i

● Weight at regression point i of datapoint j :
● Bandwidth (scale) selection with a Golden-Section algorithm
● Bandwidth is adaptive rather than fixed :
Adaptive bandwidth that includes the k-nearest neighbors at each regression point rather
than a fixed value
This allows to incorporate data points with few close neighbors better in the regression
where dij is the distance between i and
j, and b is the bandwidth or scale

Model 2 : Hedonic
● Model definition :
● Regression method: OLS
Yi Land Price in /m2
(log)
Variables (xk) Units Explanations
Distance to closest station m (log)
Distance to big railway hubs m (log) ( : : ...)
Floor-area ratio %
Road width m
Gas Flag : connected to gas infrastructure or not
Land area m² (log)
Building material Categorical concrete, wood,...
Land usage Categorical low-rise residential type-1, semi-residential, ...
Neighborhood weighted prices distance-weighted average of the 9 nearest neighbors

Model 3 : Boosted Trees (XGBoost)
● A set of weak learners (decision trees) are combined to get strong learners
● Trees are grown sequentially : each tree is grown using information from
previously grown trees
● Boosted trees are implemented using XGBoost library
● Same variables as Model 2 (Hedonic) Distance to closest station, Distance
to big railway hubs, Floor-area ratio, Road width, Gas, Land area, Building
material, Land usage, Neighborhood weighted prices
ERRORS
TRAIN TEST TRAIN TEST
MODEL MODEL
DATASET

Results : Ratios of low-error predictions
Error Global Tokyo Saitama Kanagawa Chiba Nagoya
GWR
< 5 % 32 % 39 % 27 % 34 % 22 % 34 %
< 10% 57 % 63 % 57 % 62 % 37 % 59 %
< 20% 83 % 86 % 85 % 89 % 67 % 87 %
Hedonic
< 5 % 34 % 38 % 32 % 33 % 21 % 34%
< 10% 56 % 63 % 64 % 57 % 41 % 59 %
< 20% 83 % 85 % 89 % 85 % 64 % 87 %
XGBoost
< 5 % 35 % 41 % 34 % 35 % 21 % 35 %
< 10% 61 % 66 % 66 % 63 % 42 % 61 %
< 20% 85 % 92 % 88 % 88 % 69 % 88 %

Results : prediction error distribution
GWR 11.9 % 8.4 %
Hedonic 11.9 % 8.7 %
XGBoost 11.3 % 8.3 %
● Error estimation via a 100-fold Cross Validation
(train / eval to data ratio : 75% / 25%)
Mean prediction error by area Error distribution for each model

Results : maps
Predicted prices map around Tokyo Prediction error around Tokyo

Results : discussion
XGBoost was the best performing model
● Model Limitations and possible ameliorations
○ GWR
The current version is single scale, assuming that all the variables experience local effects on the
same scale.
Multiscale GWR would drop that assumption and potentially improve the accuracy
○ Hedonic/XGBoost :
Spatial correlation effects are treated empirically
● Data :
○ Data points : some areas have little data point, dragging the overall accuracy down (especially Chiba)
○ Euclidean distance is used for in each case
Other distance metrics (Manhattan distance, commute time,..) might be more suited

オープンハウスにおける機械学習・データサイエンスの取り組みについて

オープンハウスにおける機械学習・データサイエンスの取り組みについて

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to オープンハウスにおける機械学習・データサイエンスの取り組みについて

Similar to オープンハウスにおける機械学習・データサイエンスの取り組みについて (20)

More from Teito Nakagawa

More from Teito Nakagawa (8)

Recently uploaded

Recently uploaded (20)