Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology

Nonnegative Matrix Factorization with Side Information for
Time Series Recovery and Prediction
Jiali Mei 4 Yohann De Castro 1 Yannig Goude 1,2 Jean-Marc Azaïs 3
Georges Hébrail 2
1LMO, Univ. Paris-Sud, CNRS, Universite Paris-Saclay, Orsay
2EDF Lab Paris-Saclay, Palaiseau
3Institut de Mathématiques, Université Paul Sabatier, Toulouse
4Shift Technology, Paris
September 26, 2018
1/35

Context
Utility companies are interested in electricty consumption data of small
regions (village, block, small city) on a ﬁne temporal scale.
This is useful in several ways:
Useful for utility companies to manage the supply-demand balance locally, in a
world with decentralized electricity generation (wind and solar power) and open
electricity market;
A requirement for the transmission system operators (TSO) by regulators;
Generally useful information to better understand socio-economic activities on a
ﬁne temporal level.
Are there enough data for doing this?
2/35

Motivating example 1: data from meters
Figure: Electricity meter readings Figure: Daily electricity consumption
Traditional electricity meters need to be read physically, therefore at a lower
frequency than needed for further applications.
The resulting data are asynchronous, since the meter reading dates are not
aligned for all clients. Such data are diﬃcult to further process.
3/35

Motivating example 2: data from electricity network
Figure: Map of the 7th Arrondissement of Lyon and low-voltage transformers in this district
Load data can be at a high temporal frequency on the electricity network, but at
a diﬀerent or coarser spatial scale than what is needed.
4/35

Motivating example 3: electricity consumption and external factors
03:00 08:00 13:00 18:00 23:00
1000150020002500300035004000
Influence of calendar variablesconsumption(kw)
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
0 10 20 30
0200040006000800010000
Influence of the temperature
Temperature
consumption(kw)
Figure: Portuguese electricity consumption versus days of the week and the temperature.
It is well established that electricity is inﬂuenced by many external factors.
5/35

A matrix representation
Figure: A matrix representation of the estimation target of the thesis
Variable of interest: vi,j is the electricity consumption at period i, for an
individual j, with n1 periods and n2 individuals in total.
V ∈ Rn1×n2 for the whole matrix, (vi)T and vj for i-th row and j-th column.
6/35

Main questions
Figure: A matrix representation of the estimation
target of the thesis
How can we estimate all entries of the
matrix V from temporal aggregates
and/or spatial aggregates?
Can the use of additional information
such as temporal regularity and
additional exogenous variables
improve such estimations?
Is it possible to produce predictions of
electricity consumption for new
periods and new individuals with such
data?
7/35

Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
8/35

Outline
HALSX algorithm
Experiments
Conclusions
9/35

Nonnegative matrix factorization
We propose to solve the estimation problem by nonnegative matrix factorization
(NMF, Lee and Seung 1999).
Based on the hypothesis that the matrix to be recovered is of low-rank.
All entries in the factor matrices are nonnegative.
A dimension-reduction tool, similar to Singular Value Decomposition (SVD),
Principal Component Analysis (PCA), etc..
10/35

Nonnegative matrix factorization
We propose to solve the estimation problem by nonnegative matrix factorization
(NMF, Lee and Seung 1999).
Based on the hypothesis that the matrix to be recovered is of low-rank.
All entries in the factor matrices are nonnegative.
A dimension-reduction tool, similar to Singular Value Decomposition (SVD),
Principal Component Analysis (PCA), etc..
Remarks on non-negativity
Why?: For the electricity application: nonnegative consumption proﬁles and
weights are much more interpretable.
Price to pay: Less convergence guarantee.
10/35

Trace regression model
We wish to recover a matrix V∗, with the knowledge of data a ∈ RD, which are linear
measurements on the unknown matrix V∗, or
a = A(V∗
),
where A is a known linear operator.
11/35

Trace regression model
We wish to recover a matrix V∗, with the knowledge of data a ∈ RD, which are linear
measurements on the unknown matrix V∗, or
a = A(V∗
),
where A is a known linear operator.
The linear operator A is identiﬁed by A1, ..., AD, D matrices or masks of the same
dimension as V∗.
For all matrix X ∈ RT ×N ,
A(X) ≡ (Tr(AT
1 X), Tr(AT
2 X), ..., Tr(AT
DX))T
.
Hence the name trace regression model (Rohde and Tsybakov 2011).
Usual types of measurement operator A
complete observations
matrix completion (Candès and Recht 2009)
matrix sensing (Recht, Fazel, and Parrilo 2010)
rank-one matrix projections (Zuk and Wagner 2015)
temporal aggregates
11/35

Temporal aggregation patterns
In the electricity application, the masks correspond to the temporal aggregation
patterns:
12/35

Classical NMF algorithms with linear measurements
We minimize the quadratic approximation error, with a linear equality constraint:
min
V∈Rn1×n2 , Fr∈Rn1×k, Fc∈Rn2×k
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
13/35

min
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
We solve it by combining
classical iterative NMF algorithms, such as HALS or NeNMF (Cichocki 2009;
Guan et al. 2012),
with a projection step: V = PA(FrFT
c ), where PA is the projection operator
into the convex set A deﬁned by the two constraints, V ≥ 0, A(V) = a.
13/35

min
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
We solve it by combining
classical iterative NMF algorithms, such as HALS or NeNMF (Cichocki 2009;
Guan et al. 2012),
with a projection step: V = PA(FrFT
c ), where PA is the projection operator
into the convex set A deﬁned by the two constraints, V ≥ 0, A(V) = a.
Data: PA, 1 ≤ k ≤ min{n1, n2}
Result: V ∈ A, Fr ∈ Rn1×k
+ , Fc ∈ Rn2×k
+
Initialize F0
r, F0
c ≥ 0, V0 = PA(F0
r(F0
c)T ), i = 0;
while Stopping criterion is not satisﬁed do
Fi+1
r = Update(Fi
r, (Fi
c)T , Vi);
(Fi+1
c )T = Update(Fi+1
r , (Fi
c)T , Vi);
Vi+1 = PA(Fi+1
r (Fi+1
c )T );
i = i + 1;
end
Limiting points are stationary points, as most NMF algorithms.
13/35

Outline
HALSX algorithm
Experiments
Conclusions
14/35

Regression models on factors
We introduce regression models in the NMF framework to take into account external
factors having an influence in electricity consumption.
Potential benefits
It may improve recovery quality.
It may help to interpret the estimated profiles.
The regression models may be used in prediction for new periods and new
individuals. 15/35

Generative low-rank model with exogenous variables
To take into account exogenous variables as side information, we propose a generative
low-rank nonnegative model:
V∗ has an NMF:
V∗
= FrFT
c .
The data are still a = A(V∗).
Features matrices Xr ∈ Rn1×d1 and Xc ∈ Rn2×d2 are connected to V∗ through
link functions fr : Rd1 → Rk and fc : Rd1 → Rk, so that
Fr = (fr(Xr))+,
Fc = (fc(Xc))+,
where the matrices are obtained by stacking the row vectors together.
Given this generative model, the task is to estimate fc, fr, Fc, Fr, and V∗, given Xr,
Xc, A, and a.
16/35

Classiﬁcation of models
The generative model leads to the following optimization problem:
min
V,fr∈F k
r ,fc∈F k
c
V − (fr(Xr))+(fc(Xc))T
+
2
F
s.t. A(V) = a, V ≥ 0.
17/35

Classiﬁcation of models
min
V,fr∈F k
r ,fc∈F k
c
+
2
F
s.t. A(V) = a, V ≥ 0.
The generative model is very general and includes many known methods as special
cases, by specifying:
measurement operator A:
complete observations
matrix completion
matrix sensing
rank-one matrix projections
temporal aggregates
functional spaces of fr, fc:
reduced-rank linear models (Foygel et al. 2012)
non-parametric reduced-rank regression (Mukherjee and Zhu 2011)
features Xr, Xc:
multiple kernel learning (Gönen and Alpaydın 2011)
collaborative ﬁltering with graph features (Abernethy et al. 2009)
17/35

Outline
HALSX algorithm
Experiments
Conclusions
18/35

Extending HALS
min
V,fr∈F k
r ,fc∈F k
c
+
2
F
s.t. A(V) = a, V ≥ 0.
To solve the problem above, we extend the HALS algorithm mentioned before, we
modify the update function for each rank at each iteration to use the exogenous
variables. We call this algorithm HALSX (HALS with eXogenous variables).
With fairly mild conditions, HALSX also veriﬁes that its limiting points are
stationary points.
Suﬃcient conditions of the uniqueness of such a decomposition can be found in
the case where the link functions are linear.
19/35

HALSX: Pseudo-code
Data: A, a, Xr, Xc, Fr, Fc, 1 ≤ k ≤ min{n1, n2}.
Result: Vt, Ft
r ∈ Rn1×k
+ , ft
r,1, ..., ft
r,k ∈ Fr, Ft
c ∈ Rn2×k
+ , ft
c,1, ..., ft
c,k ∈ Fc.
Initialize F0
r, F0
c ≥ 0, t = 0;
while Stopping criterion is not satisﬁed do
Vt = arg minV|A(V)=a,V≥0 V − Ft
r(Ft
c)T 2
F ;
Rt = Vt − Ft
r(Ft
c)T ;
for i = 1, 2, ..., k do
Rt
i = Rt + ft
r,i(ft
c,i)T ;
Calculate ft+1
r,i = arg minf∈Fr Rt
i − f(Xr)(ft
c,i)T 2
F ;
ft+1
r,i = max(0, ft+1
r,i (Xr));
Rt = Rt
i − ft+1
r,i (ft
c,i)T ;
end
for i = 1, 2, ..., k do
Rt
i = Rt + ft+1
r,i (ft
c,i)T ;
Calculate ft+1
c,i = arg minf∈Fc Rt
i − ft+1
r,i f(Xc)T 2
F ;
ft+1
c,i = max(0, ft+1
c,i (Xc));
Rt = Rt
i − ft+1
r,i (ft+1
c,i )T ;
end
t = t + 1;
end
20/35

Local convergence of HALSX
The following property known for HALS (Kim, He, and Park 2014):
Property
For all R ∈ Rn1×n2 , y ∈ Rn2
+ , y not identically zero, any vector x∗ that veriﬁes
x∗
∈ arg min
x∈Rn1
R − x(y)T 2
F ,
is also a solution to
min
x∈Rn1
R − x+(y)T 2
F .
21/35

Local convergence of HALSX
The following property known for HALS (Kim, He, and Park 2014):
Property
For all R ∈ Rn1×n2 , y ∈ Rn2
+ , y not identically zero, any vector x∗ that veriﬁes
x∗
∈ arg min
x∈Rn1
R − x(y)T 2
F ,
min
x∈Rn1
R − x+(y)T 2
F .
In HALSX, we can show the following similar property:
Proposition
Suppose that R ∈ Rn1×n2 , fc ∈ Rn2
+ are not identically equal to zero, and
g : Rd → Rn1 , with d ≥ n1, is a convex diﬀerentiable function. Suppose
θ∗
∈ arg min
θ∈Rd
R − g(θ)(fc)T 2
F .
If gθ∗ , the Jacobian matrix of g at θ∗
, is of rank n1, then θ∗
min
θ∈Rd
R − (g(θ))+(fc)T 2
F .
Then by an argument of strict quasi-convexity, we obtain the convergence result. 21/35

Outline
HALSX algorithm
Experiments
Conclusions
22/35

Experimental setting
Three datasets are used in experiments:
Synthetic a rank-20 150-by-180 nonnegative matrix simulated following the
generative model (n1 = 150, n2 = 180).
Synthetic row and column variables.
French daily consumption of 473 medium-voltage feeders near Lyon from 2010 to
2012 (n1 = 1096, n2 = 473).
Row variables: daily temperature, calendar variables.
Columns variables: the percentage of each type of clients (residential, professional,
industrial, high-voltage clients).
Portuguese daily consumption of 370 Portuguese clients from 2010 to 2014
(n1 = 1461, n2 = 369).
Row variables: daily temperature, calendar variables.
We generate measures by selecting a number of
observation periods,
either uniformly on the whole matrix (random),
or periodically with a randomly chosen oﬀset
for each column (periodic).
23/35

Recovery or prediction
To test the prediction on new individuals and new periods, temporal aggregates
are generated on a number of observation periods over the upper-left matrix.
An error metric (RRMSE, or ˆX − X F / X F ) is calculated on each of the four
submatrices.
24/35

Proﬁles obtained with Portuguese dataset
Using external factors, the obtained proﬁles present visible annual cycles.
25/35

Outline
HALSX algorithm
Experiments
Conclusions
26/35

Results on time series recovery
Synthetic Portuguese French
periodicrandom
0.1 0.2 0.3 0.4 0.50.1 0.2 0.3 0.4 0.50.1 0.2 0.3 0.4 0.5
0.00
0.05
0.10
0.15
0.20
0.25
0.00
0.05
0.10
0.15
0.20
0.25
Sampling rate
Recoveryerror
algorithm
empty_model
softImpute
HALS
NeNMF
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Using exogenous variables (HALSX_model), the error rate on matrix recovery is
in most cases equivalent or an improvement compared to NMF methods (NeNMF
and HALS).
With random observation dates on the synthetic dataset, which is arguably the
least realistic case, HALSX_model is a little worse oﬀ than NeNMF and HALS.
27/35

Outline
HALSX algorithm
Experiments
Conclusions
28/35

Results on time series prediction
Row error Column error RC error
periodicrandom
0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
Sampling rate
algorithm
rrr
trmf
individual_gam
factor_gam
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Figure: Prediction error on synthetic data
Row error Column error RC error
periodicrandom
0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00
0.20
0.25
0.30
0.35
0.40
0.20
0.25
0.30
0.35
0.40
Sampling rate
algorithm
rrr
trmf
individual_gam
factor_gam
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Figure: Prediction error on French data
On synthetic data, the error of prediction is rather low for the
three prediction types (around 10%), which is remarkable
since only very partial data was available in the ﬁrst place.
On the real-world datasets, the prediction error is higher.
However, HALSX still outperforms other benchmark methods.
HALSX is not sensitive to the sampling rate.
29/35

Outline
HALSX algorithm
Experiments
Conclusions
30/35

Conclusions
In this talk we
formalized the temporal aggregate observations in electricity consumption as a
trace regression model;
proposed a generative low-rank matrix model to introduce side information in
NMF;
deduced HALSX, an algorithm to solve the new NMF problem;
tested it on real and synthetic electricity datasets and obtained results that are
equivalent or better than reference methods.
The proposed method is implemented in an R package used internally at EDF.
31/35

Perspectives of the thesis
Industrial applications
Instead of estimating the whole time series, NMF can be used to directly or indirectly
estimate important statistics, such as the peak demand.
Methodological perspectives
Estimation with both spatial and temporal aggregates
Usage of social network data as column variables
Causal relationship between the presence of the measures and the data that are
measured
Neural network/deep learning with partial data
Theoretical perspectives
Is it possible to achieve global convergence of ﬁrst-order NMF algorithms in special
cases?
32/35

References
Jacob Abernethy et al. “A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization”. In: The
Journal of Machine Learning Research 10 (2009), pp. 803–826.
Emmanuel J. Candès and Benjamin Recht. “Exact Matrix Completion via Convex Optimization”. In: Foundations of
Computational Mathematics 9.6 (2009), pp. 717–772. doi: 10.1007/s10208-009-9045-5.
Yunmei Chen and Xiaojing Ye. “Projection Onto A Simplex”. In: arXiv preprint arXiv:1101.6081 (2011).
Andrzej Cichocki, ed. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and
Blind Source Separation. Chichester, U.K: John Wiley, 2009. 477 pp. isbn: 978-0-470-74666-0.
Rina Foygel et al. “Nonparametric Reduced Rank Regression”. In: Advances in Neural Information Processing Systems. 2012,
pp. 1628–1636.
Mehmet Gönen and Ethem Alpaydın. “Multiple Kernel Learning Algorithms”. In: Journal of Machine Learning Research 12 (Jul
2011), pp. 2211–2268.
Naiyang Guan et al. “NeNMF: An Optimal Gradient Method for Nonnegative Matrix Factorization”. In: IEEE Transactions on
Signal Processing 60.6 (2012), pp. 2882–2898. doi: 10.1109/TSP.2012.2190406.
Jingu Kim, Yunlong He, and Haesun Park. “Algorithms for Nonnegative Matrix and Tensor Factorizations: A Uniﬁed View Based
on Block Coordinate Descent Framework”. In: Journal of Global Optimization 58.2 (2014), pp. 285–319.
Daniel D. Lee and H. Sebastian Seung. “Learning the Parts of Objects by Non-Negative Matrix Factorization”. In: Nature
401.6755 (1999), pp. 788–791.
Ashin Mukherjee and Ji Zhu. “Reduced Rank Ridge Regression and Its Kernel Extensions”. In: Statistical Analysis and Data
Mining 4.6 (Dec. 2011), pp. 612–622. doi: 10.1002/sam.10138. pmid: 22993641.
Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. “Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via
Nuclear Norm Minimization”. In: SIAM review 52.3 (2010), pp. 471–501.
Angelika Rohde and Alexandre B. Tsybakov. “Estimation of High-Dimensional Low-Rank Matrices”. In: The Annals of Statistics
39.2 (2011), pp. 887–930. doi: 10.1214/10-AOS860.
Or Zuk and Avishai Wagner. “Low-Rank Matrix Recovery from Row-and-Column Aﬃne Measurements”. In: Proceedings of The
32nd International Conference on Machine Learning. Proceedings of The 32nd International Conference on Machine Learning.
2015, pp. 2012–2020.
33/35

How to calculate PA(X)
For some forms of masks, there are eﬃcient methods.
Matrix completion: replacing the observed entries.
Temporal aggregates: simplex projection
min
vId
vId
−
t0(d)+h(d)
t=t0(d)+1
(fr)t(FT
c )nd
2
s.t. vId
≥ 0, vT
Id
1 = ad.
An eﬃcient simplex proejction algorithm (Chen and Ye 2011) is used in this case.
General case: iterate between
V = V + A†
(a − A(V));
vi,j = max(0, vi,j ).
34/35

Which functional spaces to choose
HALSX is rather agnostic in the choice of regression models.
There is a biais-variance trade-off between flexible models with many parameters
and simple models with few parameters.
Overfitting can be adressed by cross-validation at each model update.
35/35

Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology

Similar to Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology