Time series forecasting remains a challenging task across many application fields
despite extensive work done in this domain. The purpose of this paper is to propose a
scalable and efficient method which simplifies multi-step-ahead simultaneous forecasting of
large number of time-series. The method proposed here seeks to improve the efficiency and
accuracy of multi-step-ahead forecasting over medium/long term forecast horizons performed
simultaneously in one go, for a large number of time-series. The method proposed in this work is also exemplified for a store-item forecasting application in retail domain.
Multi-Step-Ahead Simultaneously Forecasting For Multiple Time-Series, Using Truncated Singular Value Decomposition (SVD)
1. Multi-Step-Ahead Simultaneously Forecasting For
Multiple Time-Series, Using Truncated Singular Value
Decomposition (SVD)
Florian Cartuta1
1
Bucharest, Romania
E-mail: floriancartuta2@yahoo.com
Abstract
Purpose: Time series forecasting remains a challenging task across many application fields
despite extensive work done in this domain [6-7]. The purpose of this paper is to propose a
scalable and efficient method which simplifies multi-step-ahead simultaneous forecasting of
large number of time-series. The method proposed here seeks to improve the efficiency and
accuracy of multi-step-ahead forecasting over medium/long term forecast horizons performed
simultaneously in one go, for a large number of time-series. The method proposed in this
work is also exemplified for a store-item forecasting application in retail domain.
The proposed method uses Truncated Singular Value Decomposition at its core, to extract the
dominant correlations of multiple time-series stored in a matrix. It is shown that a very small
number of components extracted (sometime even as low as one or two right singular vectors)
might be sufficient to simultaneously forecast hundreds or more time series through their
dominant correlations. After the main components are extracted, the forecast is made only on
truncated right singular vectors matrix which encodes the time-bound evolution of the
underlying structure of the data, using a standard time-series stochastic forecasting method
like Holt-Winter Triple Exponential Smoothing. In a subsequent step, the original matrix is
recomposed. The recomposed matrix will contain both the reconstructed history
approximation and predicted values for each original time-series. As such by modeling only
few dominant correlations of the entire set, it can be simultaneously generated forecasts for
very large number of time series.
Benefits: The method is scalable, accurate, more processing time efficient than individual
time-series forecasting and can be used to forecast very large number of time-series
simultaneously.
Keywords: Singular Value Decomposition, Multiple Time-Series, Simultaneous Forecasting, Multi-Step-Ahead Forecasting
1
2. 1. Introduction
The method proposed in this paper seeks to improve the
efficience and accuracy of multi-step-ahead forecasting over
medium/long term forecast horizons performed
simultaneously in one go for a large number of time-series.
The example given is for a store-item forecast retail
application, but the method is fairly broad and can
potentially be applied to numerous other real data
applications where multi-step-ahead simultaneous
forecasting of a large number of time-series is needed.
Benefits and challenges: The method introduced here can be
used to forecast very large number of time series
simultaneously. SVD has linear scalability with the number
of rows and cubic scalability with the number of attributes
when a full decomposition is computed. A low-rank
decomposition is typically linear with the number of rows
and linear with the number of columns. SVD has reasonable
computing cost [8]. There are a number of benefits and
challenges to forecasting multiple time series in one go,
especially when we refer to large number of time series in
the order of thousands or tens of thousands, like for example
when the task is store-item demand forecasting in retail
industry. Among benefits, we can enumerate - simplicity:
avoiding to prepare, train and maintain a separate model for
each time-series. A challenge of simultaneous forecasting
methods and models is the level of accuracy - it is difficult to
achieve the same or superior accuracy when many time
series with possible different behaviors are simultaneously
predicted with same model, compared with the situation
when each time-series is predicted using its own model. This
is usually the situation when the task it the demand
prediction at store-item level of granularity. At such low
level of granularity, the product demand which is appreciated
using daily sales is prone to be influenced by perturbing
factors like lack of store item stock , etc. Sales perturbing
factors usually negatively influence the demand forecast
accuracy in the retail domain.
Singular Value Decomposition (SVD) [1]
is one of the most
important matrix factorization techniques. SVD is used to
obtain a low-rank approximation of matrices. It is often the
case that complex systems generate data that is naturally
arranged in large matrices. For example multiple time-series
of store-items sales may be arranged in matrix with each row
containing daily store-item sales and column containing
containing all of sales for each item at a given date.
Remarkably, the data are typically low rank, meaning that
there are a few dominant patterns that explain the high-
dimensional data. The SVD is a numerically robust and
efficient method of extracting these patterns from data. [1]
Definition of the SVD [1]
We are interested in analyzing a large data set X R n×m∈R n×m
(1)
For example in this paper, X will consist of a time-series of
data. The columns are often called snapshots.
The SVD is a unique matrix decomposition that exists for
every complex-valued matrix X C n×m:∈R n×m
X=UΣV T (eq. 2)
where U R n×n and V R m×m are unitary matrices with∈ R n×n and V ∈ R m×m are unitary matrices with ∈ R n×n and V ∈ R m×m are unitary matrices with
orthonormal columns, and Σ R n×m is a matrix with real,∈R n×m
non-negative entries on the diagonal and zeros off the
diagonal. As is the case in the demand forecasting, we will
only use real numbers.
Matrix Approximation
SVD provides a low-rank approximation to matrix X.
According to Eckart-Young theorem[9]: the optimal rank-r
approximation to X, in a least-square sense is given by the
rank-r SVD truncation
(eq. 3)
Here, Ũ and denote the first r leading columns of U and
V, and contains the leading r x r sub-block of Σ.
2. Methodology
In this section I describe the method to multiple-step-ahead
simultaneous forecast a large number of time-series arranged
in a matrix, by first extracting the low rank approximation
matrix, then generating the multi-step-ahead forecasting for
just a limited number (even as low as one or two) of main
components of the right singular vector instead of all time-
series. Finally the original matrix is reconstructed and the
reconstructed matrix will also contain the forecast values for
the entire time-series set.
2.1 Data transformation:
First we will need to arrange and transform the data in a
format suitable for SVD.
2.1.1 Data arrangement in matrix
The dataset X containing time-series values retrieved at same
points in time for each: t0, t1, …, tm (t0 being the oldest
value) is transformed in a matrix with
time-series arranged in rows, and
columns representing t0, …, tm values
respectively. This format is chosen in
3. order to comply with the format required by the SVD
decomposition: X Rn×m.∈R n×m
2.1.2 Data normalization: In this step, the dataset is scaled
to prepare it for SVD transformation, first by applying power
transformation (I.e: natural log) to stabilize the variance and
to obtain a more-Gaussian distribution. SVD makes the
assumption that the underlying data is Gaussian distributed
and can be well described in terms of means and covariances.
[9] After the power transformation step, data is scaled to
standard normal by rows (which represents time-series).
Outlier treatment may be beneficial since SVD can be
sensitive to outliers[9]. This is an important data pre-
processing step.
2.2 Singular Value Decomposition (SVD)
Next, the scaled X data is decomposed using SVD and the
number of modes to be retained are computed. There are
three matrices obtained through decomposition: U (left
singular vectors matrix), Σ (singular values sorted in the
order of importance) and V (right singular vectors matrix).
There are different software libraries which will perform this
step.[2]
After singular value decomposition, the original matrix X
will be transformed as the singular values and singular
vectors: X=UΣV T
U and V are unitary matrices and they essentially induce a
rotation of the input data. Σ the singular values matrix is a
diagonal matrix inducing scaling.
2.2.1 Optimal low-rank X input matrix approximation
A very important point in truncated SVD decomposition is to
select the rank of truncated SVD in order to obtain the
optimal low-rank X matrix approximation. Instead of taking
all the singular values and their corresponding left and right
singular vectors, we only take the k largest singular values
and their corresponding singular vectors.
As we will see later in this paper, while choosing a higher k
would get us a closer approximation to X, choosing a smaller
k will save us more effort overall due to the fact that we will
not be required to forecast as many components.
Neglecting all but the first k components is justified since the
first k components supposedly capture the underlying
structure or the signal of the data. [3] An example is shown
in figure 1.
Figure 1. Truncated SVD with k-reduced singular decomposition of W
If the original input X matrix had n x m dimension, the k-
truncated SVD matrices will have the following dimensions:
U (n x k), Σ (k x k), V T
(k x m).
If for example we start with X containing 500 time-series
with 365 values each and k = 2 components are used for
truncated decomposition, the resulted dimensions of the low-
rank matrices are: U (500 x 2), Σ (2 x 2), VT
(2 x 365).
There are several methods to compute the optimal k value. In
this paper I used an empirical ‘elbow’-like method by
plotting the semi-log of singular values and choosing the cut-
off at the inflection point, correlated with the inspection of
the auto-correlation function (ACF) for main components of
VT
.
In figure 1 which represents the semi-log plot of the singular
values obtained through decomposition (in the diagonal
matrix Σ), we can see that in the example presented in this
paper, after k = 1 the slope tend to stabilize and we can use
this value for matrix approximation.
Fig.1 Optimal k selection
by method of Semi-log plot
of singular values
Depending on the time-series characteristics, a larger k might
be needed. In other datasets I’ve investigated, the optimal
rank k was in the range 7 – 9. Because this does not change
the approach, in this paper I will refer to the dataset used for
exemplification which is a Kaggle challenge dataset for
store-item forecasting [5].
3. Forecasting the main components of the right singular
vectors matrix VT
The V matrix encodes the time-series dynamics. The k
vectors from the truncated right singular vectors matrix VT
represent the time-bound evolution of the underlying
structure of the data and we will be forecasting them only.
Therefore instead to forecast all n time-series, we will
forecast only k time-series, for a small k << n. As we will see
below, k can have a value as small as one or two, therefore
we will be able to compute the forecast for n time-series by
forecasting only a few (k << n) main VT
components. To
recompose the original input matrix and generate the forecast
for all time-series, we’ll be using the following truncated
matrices: U truncated, Σ truncated obtained in 2.2.1 and a
new VT
matrix: VT
_forecast obtained from the horizontal
concatenation of truncated VT
(shape k x m - from 2.2.1
truncated singular value decomposition) and the k forecasts
each having (1 x forecast_horizon) shape.
3
4. Therefore the VT
forecast matrix will have the shape: k x m’
where m’ = m + forecast_horizon.
Finally the X_forecast is computed as dot product of
U_truncated, Σ_truncated, VT
forecast, with X_forecast of
shape: n x m’ as per the formula (3).
3.1 Decision about the forecasting model
Because the final forecast will be influenced only by a
limited number of (k) forecasts made on main components of
the V matrix, it is very important that these k forecasts to be
as accurate as possible. We might be needed to take into
account both long-term and short-term cycles and use an
appropriate time-series forecasting machine learning model.
One challenge is that the model should be able to
accommodate multiple cycles (I.e weekly and yearly).
For this study, the model used was a Triple Exponential
Smoothing (Holt Winters) [4] stochastic model because it
can easily accommodate long-term cycle. As is seen in figure
3, the data presents yearly and weekly seasonality. There are
also other models which can be tested for this purpose like
I.e Auto-Regressive Integrated Moving Average (ARIMA)
but this will remain to be done in a future test.
The solution I’ve used in this study was Holt Winters Triple
Exponential Smoothing with yearly seasonality for the
prediction of the first component (choosing k = 1).
4. Example: Application to store-item simultaneously
forecasting task
To exemplify the forecasting method, I used a dataset with
500 time-series from a Kaggle challenge [5]. The dataset
contains 5 years of daily store-item sales data for 50 different
items at 10 different stores (500 store-item time-series in
total). The prediction was made for 3 months of sales: 92
values representing daily sales for each time-series.
In figure 2, is exemplified one store-item time-series and can
be observed that it exhibits yearly seasonality. Through the
analysis of auto-correlation (ACF) and partial auto-
correlation (PCF) graphs we will see that the main
components of the low-rank approximation matrix also
exhibits both yearly and weekly seasonality.
Figure 2. Daily sales of one
store-item
4.1 Data transformation: First the data was split in train and
validation dataframes having shapes: 500 x 1734,
respectively 500 x 92. I reserved the last three months of data
(92 days) for results validation.
As a data preprocessing step, the train data was log
transformed and then standardized as described in 2.1 Data
Transformation paragraph. According to figure 1 the rank k
can be set to 1 and consequently there will be only one
component (mode 0 of the right singular vector matrix) to be
forecasted. In the next step, singular value decomposition
(SVD) was applied on the scaled train data and the low-rank
U (500, 1), Σ (1, 1), VT
(1 , 1734) matrices were computed.
4.2 Data Analysis and Modeling
Figure 3 shows the graph of VT
mode 0.
Figure 3. VT
mode 0
Like in the original time-series, the yearly and weekly cycles
are also captured in the time-series corresponding to mode 0.
Also it exhibits a trend.
The auto-correlation (ACF) graph for 50 lags of mode 0 (fig.
4), displays the weekly seasonality through lag 7 spikes of
the differentiated mode 0 time-series.
Figure 4. Auto-correlation (ACF) of
differentiated mode 0 time-series
(50 lags)
The auto-correlation (ACF) graph of the second term mode
1 (fig. 5), displays no significant lag correlation at any lag,
which is one more reason to limit our rank k to the first mode
(mode 0, k = 1).
Figure 5. Auto-correlation (ACF) of
mode 1 time-series (50 lags)
4
5. According to the ACF graph, I used for forecasting of mode
0, a Holt-Winters model with yearly seasonality. (figure 6)
Figure 6. Forecast of the first
right singular vector (mode
0)
4.3 Generate forecasting for all 500 time-series:
The forecast is produced by using the approximation
formula:
scaled_forecast_sales = U_truncated * Σ_truncated *
np.hstack((V_truncated,V_forecast))
It is observed that the right singular vectors matrix used is a
horizontal concatenation of the V_truncated and V_forecast
with shape (1 x 1826) and the scaled_forecast_sales will
have the shape 500 x 1826. The dataset will contain both the
reconstructed history (1734 values per store-item time-series)
and the forecast (92 values per store-item time-series).
The figure 7 below, displays the final forecast result for one
time-series. Forecast horizon is 92 days.
Figure 7. Example of a store-item time-series forecast (forecast horizon is
92 days)
4.4 Method evaluation: Comparative results with base
forecasting method: Triple Exponential Smoothing (Holt
Winters) [3]
For evaluation of forecasting accuracy I’ve used two metrics
which are widely used for assessing the prediction
performance: Mean Absolute Percentage Error (MAPE) and
Root Mean Squared Error (RMSE).
In table 1 it is presented the comparison of average for these
two metrics over all 500 time-series, computed for the
method described in this paper and another well known time-
series prediction: the Holt-Winters Triple Exponential
Smoothing method.
It can be seen that the proposed method had better results
over the Holt-Winters method. In average MAPE was
improved by 22.7% and RMSE by ~ 19%.
Table 1: Forecasting Accuracy: results comparison
5. Concluding remarks
In this work I proposed a novel method for multi-step-ahead
multiple time-series simultaneously forecasting using matrix
factorization - Truncated Singular Decomposition at its core.
As a central algorithm, the method uses Truncated Singular
Value Decomposition of a dataset (named X) containing
many time-series to be predicted (shape n x m). After low-
rank approximation of the dataset, the multi-step-ahead
prediction is made only on the main components of the
truncated right singular vector matrix V_truncated. Therefore
instead to forecast n time-series, it is enough to forecast k ,
with k << n. In this example it was enough to set the rank k
to 1 meaning only one mode was used.
By horizontal concatenation of the V_truncated with its
forecast, it results a new V_truncated* matrix which is used
to compose the forecast of original X using formula (2). The
X_forecast matrix contains both the approximated values of
the original X and the multi-step-ahead forecast for all time-
series.
I have shown that this method has several important
advantages:
a. It is very scalable.
b. It can simultaneously predict a large number of n
time-series through prediction of only k main
components (modes), with k << n (k can be as low
as 1).
c. The processing time necessary for multi-step-
ahead multiple time-series simultaneous forecasting
is a fraction of the processing time needed when
individual forecasts are performed for all n time-
series
d. The forecasting accuracy improves in average
over a well known stochastic time-series
forecasting method, namely Holt-Winters Triple
Exponential Smoothing
The proposed method is fairly broad and can potentially be
applied to numerous other real data applications where multi-
step-ahead simultaneous forecasting of a large number of
time-series is needed.
5
Forecasting Accuracy – Results Comparison
Method
18.15 11.03 3.47 4.04
23.48 13.61 5.54 5.12
Average
MAPE [%]
Average
RMSE
Standard
deviation – MAPE
Standard deviation
– RMSE
Prediction on main
modes using SVD
Triple Exponential
Smoothing (Holt
Winters)
6. References
[1] Brunton S., Kutz J.N, 2019 February Singular Value
Decomposition (SVD) -
Researchgate.net/publication/331230334_Singular_Value_Dec
omposition_SVD
[2] In this work for SVD decomposition I used:
https://numpy.org/doc/stable/reference/generated/numpy.linalg.
svd.html library
[3] Frank M and J.M. Buhmann. June 2011 Selecting the rank of
truncated SVD by Maximum Approximation Capacity
[4] https://en.wikipedia.org/wiki/
Exponential_smoothing#Triple_exponential_smoothing_(Holt_
Winters)
[5] Dataset: https://www.kaggle.com/c/demand-forecasting-
kernels-only
[6] Nielsen A., 2019 Practical Time Series Analysis Prediction with
Statistics & Machine Learning
[7] Mills R. Applied Time Series Analysis: A Practical Guide to
Modeling and Forecasting, 2019 Academic Press
[8] Oracle Database Online Documentation 12c, Release 1 (12.1) /
Data Warehousing and Business Intelligence
[9] Low-rank approximation
6