Hung Tien Tran, Hiep Tuan Nguyen, Viet-Trung Tran
Hanoi University of Science and Technology
Introduction
 What is Geographically Weighted Regression?
 What is our work?
Source: http://desktop.arcgis.com
GWR + =
- Large-scale spatial data
- Improve performance
- Distributed
Outline
 Background
 Problem
 Scalable GWR on Spark
 Experiments
 Discussion
 Conclusion
Background
 First Law of Geography - Waldo Tobler:
“Everything is related with everything else, but closer
things are more related”.
 Model GWR
 The OLS estimator takes the form
yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Background
 Kernel function
 Gaussian function
 Bandwidth
5
fixed bandwidth adaptive bandwidth
Problem
 Estimating a local model
 Bandwidth selection
 Evaluation model
Choose kernel function
βˆ(u) = (X TW (u)X )−1 X TW (u)Y
Source: http://rose.bris.ac.uk
O(n3)
Which bandwidth is good
Problem
 How to apply the model for large-scale data?
 Data points
 Features
 Regression points
Large-Scale GWR on Spark
 Why is Spark?
 In-memory cluster-computing platform
 Support parallel programming
 Develop applications by high-level APIs
 Provides resilient distributed datasets and parallel
operations
 Integration with other components on Spark
Large-Scale GWR on Spark
 We propose three approach to scaling GWR
 Scaling Weighted Linear Regression
 Parallel Multiple WLR models
 Parallel Geographically Weighted Regression (combine
the first two approach)
Scalable GWR on Spark
 Naïve approach – Scaling Weighted Linear Regression
Foreach regPoint
Compute weight
Fit Weighted
Linear
Regression
Summary model
Compute weight
parallel
Compute WLR
model parallel
Scalable GWR on Spark
 Naïve approach
Scalable GWR on Spark
 Parallel Multiple WLR models
Regression dataset
Training dataset
WL
R
Compute weight
WL
R
Compute parallel
multiple WLR
models
Summary
Scalable GWR on Spark
 Parallel Multiple WLR models
Scalable GWR on Spark
 Parallel Geographically Weighted Regression
R
R
R
T
T
T
R
T
R
T
R
T
Regressio
n dataset
Training
dataset
Combin
e dataset
Distributed GWR Computation
Scalable GWR on Spark
 Parallel Geographically Weighted Regression
Scalable GWR on Spark
 Parallel Geographically Weighted Regression
Experiments
 Environment
 Cluster: 8 nodes on Amazon Web Service
 4 cores Inte Xeon E5-2670 v2 2.5 GHz
 16 GB RAM, 2x40 GB SSD
 Hadoop 2.7.2 and Spark 1.6.1
 Dataset
| − −x : double(nullable = false)
| − −y : double(nullable = false)
| − −label : double(nullable = false)
| − −f eatures : vector(nullable = false)
Experiments
 Testing large training dataset
0
200
400
600
800
1000
1200
10000 100000 1000000 2000000 5000000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of training points
Experiments
 Testing large regression dataset
0
200
400
600
800
1000
1200
1000 5000 10000 20000 50000
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points
Experiments
 Testing large dataset with increasing number of
features
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 50 100 200
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time
(sec).
Number of regression
points
Experiments
 Cluster
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2-node 4-node 8-node
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
time (sec).
Number of nodes
Discussion
 Related work
 Many library GWR on local
 Spgwr (multiR on GRID)
 Using GPU
 Our work
 First study distributed GWR on Spark
 Easy deployment and the advantages of Spark
 Scalable and work well on cluster
Conclusion
 We have
 Propose three approach
 Implement four algorithms base on Spark
 Evaluate our implementation
 Future work
 Improve performance by using Pipeline and Partitions
 Release as open-source library
Large-Scale Geographically Weighted Regression on Spark

Large-Scale Geographically Weighted Regression on Spark

  • 1.
    Hung Tien Tran,Hiep Tuan Nguyen, Viet-Trung Tran Hanoi University of Science and Technology
  • 2.
    Introduction  What isGeographically Weighted Regression?  What is our work? Source: http://desktop.arcgis.com GWR + = - Large-scale spatial data - Improve performance - Distributed
  • 3.
    Outline  Background  Problem Scalable GWR on Spark  Experiments  Discussion  Conclusion
  • 4.
    Background  First Lawof Geography - Waldo Tobler: “Everything is related with everything else, but closer things are more related”.  Model GWR  The OLS estimator takes the form yi (u) = β0i (u) + β1i (u)x1i +β2i (u)x2i + ... + βmi (u)xmi βˆ(u) = (X TW (u)X )−1 X TW (u)Y
  • 5.
    Background  Kernel function Gaussian function  Bandwidth 5 fixed bandwidth adaptive bandwidth
  • 6.
    Problem  Estimating alocal model  Bandwidth selection  Evaluation model Choose kernel function βˆ(u) = (X TW (u)X )−1 X TW (u)Y Source: http://rose.bris.ac.uk O(n3) Which bandwidth is good
  • 7.
    Problem  How toapply the model for large-scale data?  Data points  Features  Regression points
  • 8.
    Large-Scale GWR onSpark  Why is Spark?  In-memory cluster-computing platform  Support parallel programming  Develop applications by high-level APIs  Provides resilient distributed datasets and parallel operations  Integration with other components on Spark
  • 9.
    Large-Scale GWR onSpark  We propose three approach to scaling GWR  Scaling Weighted Linear Regression  Parallel Multiple WLR models  Parallel Geographically Weighted Regression (combine the first two approach)
  • 10.
    Scalable GWR onSpark  Naïve approach – Scaling Weighted Linear Regression Foreach regPoint Compute weight Fit Weighted Linear Regression Summary model Compute weight parallel Compute WLR model parallel
  • 11.
    Scalable GWR onSpark  Naïve approach
  • 12.
    Scalable GWR onSpark  Parallel Multiple WLR models Regression dataset Training dataset WL R Compute weight WL R Compute parallel multiple WLR models Summary
  • 13.
    Scalable GWR onSpark  Parallel Multiple WLR models
  • 14.
    Scalable GWR onSpark  Parallel Geographically Weighted Regression R R R T T T R T R T R T Regressio n dataset Training dataset Combin e dataset Distributed GWR Computation
  • 15.
    Scalable GWR onSpark  Parallel Geographically Weighted Regression
  • 16.
    Scalable GWR onSpark  Parallel Geographically Weighted Regression
  • 17.
    Experiments  Environment  Cluster:8 nodes on Amazon Web Service  4 cores Inte Xeon E5-2670 v2 2.5 GHz  16 GB RAM, 2x40 GB SSD  Hadoop 2.7.2 and Spark 1.6.1  Dataset | − −x : double(nullable = false) | − −y : double(nullable = false) | − −label : double(nullable = false) | − −f eatures : vector(nullable = false)
  • 18.
    Experiments  Testing largetraining dataset 0 200 400 600 800 1000 1200 10000 100000 1000000 2000000 5000000 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of training points
  • 19.
    Experiments  Testing largeregression dataset 0 200 400 600 800 1000 1200 1000 5000 10000 20000 50000 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of regression points
  • 20.
    Experiments  Testing largedataset with increasing number of features 0 200 400 600 800 1000 1200 1400 1600 1800 10 20 50 100 200 Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of regression points
  • 21.
    Experiments  Cluster 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2-node 4-node8-node Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 time (sec). Number of nodes
  • 22.
    Discussion  Related work Many library GWR on local  Spgwr (multiR on GRID)  Using GPU  Our work  First study distributed GWR on Spark  Easy deployment and the advantages of Spark  Scalable and work well on cluster
  • 23.
    Conclusion  We have Propose three approach  Implement four algorithms base on Spark  Evaluate our implementation  Future work  Improve performance by using Pipeline and Partitions  Release as open-source library

Editor's Notes

  • #9 Scalability , Performance User-friendly APIs