Large-Scale Geographically Weighted Regression on Spark

Hung Tien Tran, Hiep Tuan Nguyen, Viet-Trung Tran
Hanoi University of Science and Technology

Introduction
 What is Geographically Weighted Regression?
 What is our work?
Source: http://desktop.arcgis.com
GWR + =
- Large-scale spatial data
- Improve performance
- Distributed

Outline
 Background
 Problem
 Scalable GWR on Spark
 Experiments
 Discussion
 Conclusion

Background
 First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but closer
things are more related”.
 Model GWR
 The OLS estimator takes the form
yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi
βˆ(u) = (X TW (u)X )−1 X TW (u)Y

Background
 Kernel function
 Gaussian function
 Bandwidth
5
fixed bandwidth adaptive bandwidth

Problem
 Estimating a local model
 Bandwidth selection
 Evaluation model
Choose kernel function
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Source: http://rose.bris.ac.uk
O(n3)
Which bandwidth is good

Problem
 How to apply the model for large-scale data?
 Data points
 Features
 Regression points

Large-Scale GWR on Spark
 Why is Spark?
 In-memory cluster-computing platform
 Support parallel programming
 Develop applications by high-level APIs
 Provides resilient distributed datasets and parallel
operations
 Integration with other components on Spark

Large-Scale GWR on Spark
 We propose three approach to scaling GWR
 Scaling Weighted Linear Regression
 Parallel Multiple WLR models
 Parallel Geographically Weighted Regression (combine
the first two approach)

Scalable GWR on Spark
 Naïve approach – Scaling Weighted Linear Regression
Foreach regPoint
Compute weight
Fit Weighted
Linear
Regression
Summary model
Compute weight
parallel
Compute WLR
model parallel

 Naïve approach

Regression dataset
Training dataset
WL
R
Compute weight
WL
R
Compute parallel
multiple WLR
models
Summary

 Parallel Geographically Weighted Regression
R
R
R
T
T
T
R
T
R
T
R
T
Regressio
n dataset
Training
dataset
Combin
e dataset
Distributed GWR Computation

 Parallel Geographically Weighted Regression

Experiments
 Environment
 Cluster: 8 nodes on Amazon Web Service
 4 cores Inte Xeon E5-2670 v2 2.5 GHz
 16 GB RAM, 2x40 GB SSD
 Hadoop 2.7.2 and Spark 1.6.1
 Dataset
| − −x : double(nullable = false)
| − −y : double(nullable = false)
| − −label : double(nullable = false)
| − −f eatures : vector(nullable = false)

Experiments
 Testing large training dataset
0
200
400
600
800
1000
1200
10000 100000 1000000 2000000 5000000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of training points

Experiments
 Testing large regression dataset
0
200
400
600
800
1000
1200
1000 5000 10000 20000 50000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points

Experiments
 Testing large dataset with increasing number of
features
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 50 100 200
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points

Experiments
 Cluster
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2-node 4-node 8-node
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of nodes

Discussion
 Related work
 Many library GWR on local
 Spgwr (multiR on GRID)
 Using GPU
 Our work
 First study distributed GWR on Spark
 Easy deployment and the advantages of Spark
 Scalable and work well on cluster

Conclusion
 We have
 Propose three approach
 Implement four algorithms base on Spark
 Evaluate our implementation
 Future work
 Improve performance by using Pipeline and Partitions
 Release as open-source library

Large-Scale Geographically Weighted Regression on Spark

Large-Scale Geographically Weighted Regression on Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Large-Scale Geographically Weighted Regression on Spark

Similar to Large-Scale Geographically Weighted Regression on Spark (20)

More from Viet-Trung TRAN

More from Viet-Trung TRAN (20)

Recently uploaded

Recently uploaded (20)

Large-Scale Geographically Weighted Regression on Spark

Editor's Notes