Quang-Huy Tran
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: huytran1126@gmail.com
2024-04-22
Taming Local Effects in Graph-based
Spatiotemporal Forecasting
Andrea Cini et al.
NeurIPS’37: 2023 Conference on Neural Information Processing Systems
2
OUTLINE
• MOTIVATION
• INTRODUCTION
• PROBLEM FORMULATION
• METHODOLOGY
• EXPERIMENT & RESULT
• CONCLUSION
3
MOTIVATION
• Graph-based methods are effective in forecasting collections of time series.
• Deep learning methods combine sequence-processing operators with message
passing.
o A single (global) inductive model is trained to predict any time series associated with a node.
o It is common practice to use node-specific (local) parameters to help node identification.
 Improved modeling of local effects and accuracy.
 Inductive capabilities compromised.
Overview graph
4
INTRODUCTION
• Understanding the interplay between globality and locality in graph-based models.
• A methodological framework for designing node-specific components effectively.
• Methods to ease transferability of the global components to new nodes/graphs.
5
PROBLEM FORMULATION
• We consider a set of 𝑁 correlated time series, where each 𝑖-th time series associated
with:
o an observation vector Xt = 𝑥𝑡
𝑖
∈ ℝ𝑑𝑥 at each time step t.
o a vector of exogenous variable Ut = 𝑢𝑡
𝑖
∈ ℝ𝑑𝑢 at each time step t;
Collections of time series
6
PROBLEM FORMULATION
Relational Information
• Assume the existence of functional dependencies between time
series:
o i.e. forecasts for one time series can be improved by
accounting for the past values of other time series.
• Model pairwise relationships existing at time step t with
adjacency matrix 𝐴 ∈ ℝ𝑁×𝑁.
o A can be asymmetric (directed graph).
• Spatial is called the dimension spanning the time series
collection.
7
PROBLEM FORMULATION
Time Series Forecasting
• Multi – step time series forecasting:
o Given a window of observations 𝑋𝑡−𝑊:𝑡, the goal is to predict the next 𝐻 observations 𝑋𝑡:𝑡+𝐻
𝑋𝑡:𝑡+𝐻 = 𝑓(𝑋𝑡−𝑤:𝑡)
• The deep learning approach to forecasting consists of
training:
o A global neural network (NN).
o local node specification NN.
8
METHODOLOGY
Global and Local Forecasting
• Global forecasting model: if its parameters are fitted to a group of time series.
o All learnable parameters are shared.
o More data available for training.
o Can be used in inductive learning.
• Local forecasting model: specific to a single time series.
o Capture better series-specific dynamics.
o Often require shorter input windows.
o Or reduced model capacity.
9
METHODOLOGY
Relational inductive biases
• Both approaches share a drawback: dependencies across time are often discarded.
Message Passing Neural Network for Spatiotemporal GNNs: Spatiotemporal
Message Passing (STMP)
• Embed relational information as an
architectural bias into the processing.
• Graph Neural Network (GNN) provide
appropriate neural operators.
10
METHODOLOGY
Architecture
• a sequence of three operators:
• Encoding layer: an MLP.
• A stack of L STMP layers.
• Readout layer: an MLP.
11
METHODOLOGY
Spatiotemporal message passing - Globality
• The cornerstone operator in STGNNs is the STMP layer
where 𝜌𝑙: update function of layer 𝑙 and 𝛾𝑙: message function of layer 𝑙.
12
METHODOLOGY
Globality and Locality in STGNNs
• Limitation of Global model:
• Hybrid global-local STGNNs with
specialized local components:
o Struggle to model local effects.
o Require large model capacity or impractically
long windows.
o Node-level effects are captured more
efficiently than by fully global models.
o Forecasting accuracy on the task is usually
higher empirically.
13
METHODOLOGY
Limits of global-local STGNNs
• Local components in a global STGNN’s disadvantages:
o Model’s inductive capabilities are compromised (hard to handle unseen time series).
o The number of learnable parameters can be much larger compared to fully global model.
14
METHODOLOGY
Learnable node embeddings
• Mitigate drawbacks by using node embeddings, a table of learnable parameter
associated with each node:
o Fed into global STGNN and learned end-to-end.
• Amortize cost of specializing the model to each time series:
o A single vector for each node is added to the model’s parameters.
o Same vector can be used in multiple components of the architecture.
• Transfer the learned model to a different set of 𝑁′
time series more easily:
o Only 𝑁′𝑑𝑣 parameters need to be tuned, while the shared components are fixed.
o The embedding space can be regularized to better fit embeddings of new nodes.
15
METHODOLOGY
Structuring the Embedding space
• Two strategies for regularizing the embedding space:
o Variational: a smoother embedding space to enable interpolation.
 Model each node embedding as sample from a multivariate Gaussian under sampling t
where (𝜇𝑖, 𝜎𝑖): learnable (local) parameters.
where 𝑃 = 𝑁 0, Ι : prior,
𝐷𝐾𝐿 : Kullback-Leibler divergence.
𝛽: controls the regularization strength.
16
METHODOLOGY
Structuring the Embedding space
• Two strategies for regularizing the embedding space:
o Clustering: make clusters in the latent space to improve interpretability.
 Add a matrix 𝐶 ∈ ℝ𝐾×𝑑𝑣 of K ≪ 𝑁 learnable centroids and a cluster assignment matrix 𝑆 ∈
ℝ𝑁×𝐾
containing node-cluster pair scores.
where 𝜏: hyperparameter.
17
EXPERIMENT AND RESULT
EXPERIMENT
• Measurement:
o Mean Absolute Errors (MAE)
• Dataset:
o GPVAR(-L): a graph with 20 communities resulting in a network.
o Traffic flow forecasting: METR-LA and PEMS-BAY. For transfer learning, PEMS03, PEMS04, PEMS07,
and PEMS08 dataset are used.
o Electric load forecasting : CER-E
dataset, a collection of energy
consumption.
o Air quality monitoring: AQI dataset
collects hourly measurements of
pollutant PM2.5 in China.
18
• Baseline:
o RNN: global univariate RNN sharing the same parameters across the time series.
o FC-RNN: multivariate RNN taking as input the time series as if they were a multivariate one.
o LocalRNNs: local univariate RNNs with different sets of parameters for each time series.
o DCRNN[1]: recurrent T&S model with the Diffusion Convolutional operator.
o AGCRN[2]: T&S global-local Adaptive Graph Convolutional Recurrent Network.
o GraphWaveNet[3]: deep T&S spatiotemporal convolutional network.
EXPERIMENT AND RESULT
EXPERIMENT
[1] Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2017). Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926.
[2] Bai, L., Yao, L., Li, C., Wang, X., & Wang, C. (2020). Adaptive graph convolutional recurrent network for traffic forecasting. Advances in neural information processing systems, 33, 17804-17815.
[3] Wu, Z., Pan, S., Long, G., Jiang, J., & Zhang, C. (2019). Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121.
19
EXPERIMENT AND RESULT
RESULT
20
EXPERIMENT AND RESULT
RESULT – Visualization Analysis
21
EXPERIMENT AND RESULT
RESULT – Additional Experiments Results
• Analysis on Local components.
o The performance of reference architecture with and without local components
22
EXPERIMENT AND RESULT
RESULT – Additional Experiments Results
• Analysis on Transfer Learning
23
EXPERIMENT AND RESULT
RESULT – Additional Experiments Results
• Analysis on Transfer Learning
24
CONCLUSION
• Investigate the impact of locality and globality in graph-based spatiotemporal
forecasting architectures.
• Propose a framework to explain empirical results associated with the use of trainable
node embeddings
o discuss different architectures and regularization techniques to account for local effects.
• The proposed methodologies are thoroughly empirically validated:
o effective in a transfer learning context.
• Future works can build on the results presented here and study alternative, and even
more transferable, methods to account for local effects.
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx

[20240422_LabSeminar_Huy]Taming_Effect.pptx

  • 1.
    Quang-Huy Tran Network ScienceLab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: huytran1126@gmail.com 2024-04-22 Taming Local Effects in Graph-based Spatiotemporal Forecasting Andrea Cini et al. NeurIPS’37: 2023 Conference on Neural Information Processing Systems
  • 2.
    2 OUTLINE • MOTIVATION • INTRODUCTION •PROBLEM FORMULATION • METHODOLOGY • EXPERIMENT & RESULT • CONCLUSION
  • 3.
    3 MOTIVATION • Graph-based methodsare effective in forecasting collections of time series. • Deep learning methods combine sequence-processing operators with message passing. o A single (global) inductive model is trained to predict any time series associated with a node. o It is common practice to use node-specific (local) parameters to help node identification.  Improved modeling of local effects and accuracy.  Inductive capabilities compromised. Overview graph
  • 4.
    4 INTRODUCTION • Understanding theinterplay between globality and locality in graph-based models. • A methodological framework for designing node-specific components effectively. • Methods to ease transferability of the global components to new nodes/graphs.
  • 5.
    5 PROBLEM FORMULATION • Weconsider a set of 𝑁 correlated time series, where each 𝑖-th time series associated with: o an observation vector Xt = 𝑥𝑡 𝑖 ∈ ℝ𝑑𝑥 at each time step t. o a vector of exogenous variable Ut = 𝑢𝑡 𝑖 ∈ ℝ𝑑𝑢 at each time step t; Collections of time series
  • 6.
    6 PROBLEM FORMULATION Relational Information •Assume the existence of functional dependencies between time series: o i.e. forecasts for one time series can be improved by accounting for the past values of other time series. • Model pairwise relationships existing at time step t with adjacency matrix 𝐴 ∈ ℝ𝑁×𝑁. o A can be asymmetric (directed graph). • Spatial is called the dimension spanning the time series collection.
  • 7.
    7 PROBLEM FORMULATION Time SeriesForecasting • Multi – step time series forecasting: o Given a window of observations 𝑋𝑡−𝑊:𝑡, the goal is to predict the next 𝐻 observations 𝑋𝑡:𝑡+𝐻 𝑋𝑡:𝑡+𝐻 = 𝑓(𝑋𝑡−𝑤:𝑡) • The deep learning approach to forecasting consists of training: o A global neural network (NN). o local node specification NN.
  • 8.
    8 METHODOLOGY Global and LocalForecasting • Global forecasting model: if its parameters are fitted to a group of time series. o All learnable parameters are shared. o More data available for training. o Can be used in inductive learning. • Local forecasting model: specific to a single time series. o Capture better series-specific dynamics. o Often require shorter input windows. o Or reduced model capacity.
  • 9.
    9 METHODOLOGY Relational inductive biases •Both approaches share a drawback: dependencies across time are often discarded. Message Passing Neural Network for Spatiotemporal GNNs: Spatiotemporal Message Passing (STMP) • Embed relational information as an architectural bias into the processing. • Graph Neural Network (GNN) provide appropriate neural operators.
  • 10.
    10 METHODOLOGY Architecture • a sequenceof three operators: • Encoding layer: an MLP. • A stack of L STMP layers. • Readout layer: an MLP.
  • 11.
    11 METHODOLOGY Spatiotemporal message passing- Globality • The cornerstone operator in STGNNs is the STMP layer where 𝜌𝑙: update function of layer 𝑙 and 𝛾𝑙: message function of layer 𝑙.
  • 12.
    12 METHODOLOGY Globality and Localityin STGNNs • Limitation of Global model: • Hybrid global-local STGNNs with specialized local components: o Struggle to model local effects. o Require large model capacity or impractically long windows. o Node-level effects are captured more efficiently than by fully global models. o Forecasting accuracy on the task is usually higher empirically.
  • 13.
    13 METHODOLOGY Limits of global-localSTGNNs • Local components in a global STGNN’s disadvantages: o Model’s inductive capabilities are compromised (hard to handle unseen time series). o The number of learnable parameters can be much larger compared to fully global model.
  • 14.
    14 METHODOLOGY Learnable node embeddings •Mitigate drawbacks by using node embeddings, a table of learnable parameter associated with each node: o Fed into global STGNN and learned end-to-end. • Amortize cost of specializing the model to each time series: o A single vector for each node is added to the model’s parameters. o Same vector can be used in multiple components of the architecture. • Transfer the learned model to a different set of 𝑁′ time series more easily: o Only 𝑁′𝑑𝑣 parameters need to be tuned, while the shared components are fixed. o The embedding space can be regularized to better fit embeddings of new nodes.
  • 15.
    15 METHODOLOGY Structuring the Embeddingspace • Two strategies for regularizing the embedding space: o Variational: a smoother embedding space to enable interpolation.  Model each node embedding as sample from a multivariate Gaussian under sampling t where (𝜇𝑖, 𝜎𝑖): learnable (local) parameters. where 𝑃 = 𝑁 0, Ι : prior, 𝐷𝐾𝐿 : Kullback-Leibler divergence. 𝛽: controls the regularization strength.
  • 16.
    16 METHODOLOGY Structuring the Embeddingspace • Two strategies for regularizing the embedding space: o Clustering: make clusters in the latent space to improve interpretability.  Add a matrix 𝐶 ∈ ℝ𝐾×𝑑𝑣 of K ≪ 𝑁 learnable centroids and a cluster assignment matrix 𝑆 ∈ ℝ𝑁×𝐾 containing node-cluster pair scores. where 𝜏: hyperparameter.
  • 17.
    17 EXPERIMENT AND RESULT EXPERIMENT •Measurement: o Mean Absolute Errors (MAE) • Dataset: o GPVAR(-L): a graph with 20 communities resulting in a network. o Traffic flow forecasting: METR-LA and PEMS-BAY. For transfer learning, PEMS03, PEMS04, PEMS07, and PEMS08 dataset are used. o Electric load forecasting : CER-E dataset, a collection of energy consumption. o Air quality monitoring: AQI dataset collects hourly measurements of pollutant PM2.5 in China.
  • 18.
    18 • Baseline: o RNN:global univariate RNN sharing the same parameters across the time series. o FC-RNN: multivariate RNN taking as input the time series as if they were a multivariate one. o LocalRNNs: local univariate RNNs with different sets of parameters for each time series. o DCRNN[1]: recurrent T&S model with the Diffusion Convolutional operator. o AGCRN[2]: T&S global-local Adaptive Graph Convolutional Recurrent Network. o GraphWaveNet[3]: deep T&S spatiotemporal convolutional network. EXPERIMENT AND RESULT EXPERIMENT [1] Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2017). Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. [2] Bai, L., Yao, L., Li, C., Wang, X., & Wang, C. (2020). Adaptive graph convolutional recurrent network for traffic forecasting. Advances in neural information processing systems, 33, 17804-17815. [3] Wu, Z., Pan, S., Long, G., Jiang, J., & Zhang, C. (2019). Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121.
  • 19.
  • 20.
    20 EXPERIMENT AND RESULT RESULT– Visualization Analysis
  • 21.
    21 EXPERIMENT AND RESULT RESULT– Additional Experiments Results • Analysis on Local components. o The performance of reference architecture with and without local components
  • 22.
    22 EXPERIMENT AND RESULT RESULT– Additional Experiments Results • Analysis on Transfer Learning
  • 23.
    23 EXPERIMENT AND RESULT RESULT– Additional Experiments Results • Analysis on Transfer Learning
  • 24.
    24 CONCLUSION • Investigate theimpact of locality and globality in graph-based spatiotemporal forecasting architectures. • Propose a framework to explain empirical results associated with the use of trainable node embeddings o discuss different architectures and regularization techniques to account for local effects. • The proposed methodologies are thoroughly empirically validated: o effective in a transfer learning context. • Future works can build on the results presented here and study alternative, and even more transferable, methods to account for local effects.

Editor's Notes

  • #20 TTS-IMP: isotropic MP operator. (eq 5) TTS-AMP: anisotropic MP operator (Eq 6-7
  • #21 1a : results of the analysis by reporting the median load profile for each cluster; shaded areas correspond to quantiles with 10% increments.
  • #22 TTS-IMP and TTS-AMP, together with FC-RNN (a multivariate RNN) and Local RNNs (local univariate RNNs with a different set of parameters for each time series. STGNNs: a global variant (without any local component) and global-local alternatives, where insert node-specific components within the architecture
  • #23 each table shows results for the reference architectures w.r.t. different training set sizes (from 1 day to 2 weeks) and considers the settings where embeddings are fed to both encoder and decoder or decoder only
  • #24 each table shows results for the reference architectures w.r.t. different training set sizes (from 1 day to 2 weeks) and considers the settings where embeddings are fed to both encoder and decoder or decoder only