Abstract : Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSieijjournal
Similar to Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology (20)
Nonnegative Matrix Factorization with Side Information for Time Series Recovery and Prediction by Jiali Mei, Researcher @Shift Technology
1. Nonnegative Matrix Factorization with Side Information for
Time Series Recovery and Prediction
Jiali Mei 4 Yohann De Castro 1 Yannig Goude 1,2 Jean-Marc Azaïs 3
Georges Hébrail 2
1LMO, Univ. Paris-Sud, CNRS, Universite Paris-Saclay, Orsay
2EDF Lab Paris-Saclay, Palaiseau
3Institut de Mathématiques, Université Paul Sabatier, Toulouse
4Shift Technology, Paris
September 26, 2018
1/35
3. Context
Utility companies are interested in electricty consumption data of small
regions (village, block, small city) on a fine temporal scale.
This is useful in several ways:
Useful for utility companies to manage the supply-demand balance locally, in a
world with decentralized electricity generation (wind and solar power) and open
electricity market;
A requirement for the transmission system operators (TSO) by regulators;
Generally useful information to better understand socio-economic activities on a
fine temporal level.
Are there enough data for doing this?
2/35
4. Motivating example 1: data from meters
Figure: Electricity meter readings Figure: Daily electricity consumption
Traditional electricity meters need to be read physically, therefore at a lower
frequency than needed for further applications.
The resulting data are asynchronous, since the meter reading dates are not
aligned for all clients. Such data are difficult to further process.
3/35
5. Motivating example 2: data from electricity network
Figure: Map of the 7th Arrondissement of Lyon and low-voltage transformers in this district
Load data can be at a high temporal frequency on the electricity network, but at
a different or coarser spatial scale than what is needed.
4/35
6. Motivating example 3: electricity consumption and external factors
03:00 08:00 13:00 18:00 23:00
1000150020002500300035004000
Influence of calendar variablesconsumption(kw)
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
0 10 20 30
0200040006000800010000
Influence of the temperature
Temperature
consumption(kw)
Figure: Portuguese electricity consumption versus days of the week and the temperature.
It is well established that electricity is influenced by many external factors.
5/35
7. A matrix representation
Figure: A matrix representation of the estimation target of the thesis
Variable of interest: vi,j is the electricity consumption at period i, for an
individual j, with n1 periods and n2 individuals in total.
V ∈ Rn1×n2 for the whole matrix, (vi)T and vj for i-th row and j-th column.
6/35
8. Main questions
Figure: A matrix representation of the estimation
target of the thesis
How can we estimate all entries of the
matrix V from temporal aggregates
and/or spatial aggregates?
Can the use of additional information
such as temporal regularity and
additional exogenous variables
improve such estimations?
Is it possible to produce predictions of
electricity consumption for new
periods and new individuals with such
data?
7/35
9. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
8/35
10. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
9/35
11. Nonnegative matrix factorization
We propose to solve the estimation problem by nonnegative matrix factorization
(NMF, Lee and Seung 1999).
Based on the hypothesis that the matrix to be recovered is of low-rank.
All entries in the factor matrices are nonnegative.
A dimension-reduction tool, similar to Singular Value Decomposition (SVD),
Principal Component Analysis (PCA), etc..
10/35
12. Nonnegative matrix factorization
We propose to solve the estimation problem by nonnegative matrix factorization
(NMF, Lee and Seung 1999).
Based on the hypothesis that the matrix to be recovered is of low-rank.
All entries in the factor matrices are nonnegative.
A dimension-reduction tool, similar to Singular Value Decomposition (SVD),
Principal Component Analysis (PCA), etc..
Remarks on non-negativity
Why?: For the electricity application: nonnegative consumption profiles and
weights are much more interpretable.
Price to pay: Less convergence guarantee.
10/35
13. Trace regression model
We wish to recover a matrix V∗, with the knowledge of data a ∈ RD, which are linear
measurements on the unknown matrix V∗, or
a = A(V∗
),
where A is a known linear operator.
11/35
14. Trace regression model
We wish to recover a matrix V∗, with the knowledge of data a ∈ RD, which are linear
measurements on the unknown matrix V∗, or
a = A(V∗
),
where A is a known linear operator.
The linear operator A is identified by A1, ..., AD, D matrices or masks of the same
dimension as V∗.
For all matrix X ∈ RT ×N ,
A(X) ≡ (Tr(AT
1 X), Tr(AT
2 X), ..., Tr(AT
DX))T
.
Hence the name trace regression model (Rohde and Tsybakov 2011).
Usual types of measurement operator A
complete observations
matrix completion (Candès and Recht 2009)
matrix sensing (Recht, Fazel, and Parrilo 2010)
rank-one matrix projections (Zuk and Wagner 2015)
temporal aggregates
11/35
16. Classical NMF algorithms with linear measurements
We minimize the quadratic approximation error, with a linear equality constraint:
min
V∈Rn1×n2 , Fr∈Rn1×k, Fc∈Rn2×k
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
13/35
17. Classical NMF algorithms with linear measurements
We minimize the quadratic approximation error, with a linear equality constraint:
min
V∈Rn1×n2 , Fr∈Rn1×k, Fc∈Rn2×k
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
We solve it by combining
classical iterative NMF algorithms, such as HALS or NeNMF (Cichocki 2009;
Guan et al. 2012),
with a projection step: V = PA(FrFT
c ), where PA is the projection operator
into the convex set A defined by the two constraints, V ≥ 0, A(V) = a.
13/35
18. Classical NMF algorithms with linear measurements
We minimize the quadratic approximation error, with a linear equality constraint:
min
V∈Rn1×n2 , Fr∈Rn1×k, Fc∈Rn2×k
V − FrFT
c
2
F
s.t. V ≥ 0, Fr ≥ 0, Fc ≥ 0,
A(V) = a.
We solve it by combining
classical iterative NMF algorithms, such as HALS or NeNMF (Cichocki 2009;
Guan et al. 2012),
with a projection step: V = PA(FrFT
c ), where PA is the projection operator
into the convex set A defined by the two constraints, V ≥ 0, A(V) = a.
Data: PA, 1 ≤ k ≤ min{n1, n2}
Result: V ∈ A, Fr ∈ Rn1×k
+ , Fc ∈ Rn2×k
+
Initialize F0
r, F0
c ≥ 0, V0 = PA(F0
r(F0
c)T ), i = 0;
while Stopping criterion is not satisfied do
Fi+1
r = Update(Fi
r, (Fi
c)T , Vi);
(Fi+1
c )T = Update(Fi+1
r , (Fi
c)T , Vi);
Vi+1 = PA(Fi+1
r (Fi+1
c )T );
i = i + 1;
end
Limiting points are stationary points, as most NMF algorithms.
13/35
19. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
14/35
20. Regression models on factors
We introduce regression models in the NMF framework to take into account external
factors having an influence in electricity consumption.
Potential benefits
It may improve recovery quality.
It may help to interpret the estimated profiles.
The regression models may be used in prediction for new periods and new
individuals. 15/35
21. Generative low-rank model with exogenous variables
To take into account exogenous variables as side information, we propose a generative
low-rank nonnegative model:
V∗ has an NMF:
V∗
= FrFT
c .
The data are still a = A(V∗).
Features matrices Xr ∈ Rn1×d1 and Xc ∈ Rn2×d2 are connected to V∗ through
link functions fr : Rd1 → Rk and fc : Rd1 → Rk, so that
Fr = (fr(Xr))+,
Fc = (fc(Xc))+,
where the matrices are obtained by stacking the row vectors together.
Given this generative model, the task is to estimate fc, fr, Fc, Fr, and V∗, given Xr,
Xc, A, and a.
16/35
22. Classification of models
The generative model leads to the following optimization problem:
min
V,fr∈F k
r ,fc∈F k
c
V − (fr(Xr))+(fc(Xc))T
+
2
F
s.t. A(V) = a, V ≥ 0.
17/35
23. Classification of models
The generative model leads to the following optimization problem:
min
V,fr∈F k
r ,fc∈F k
c
V − (fr(Xr))+(fc(Xc))T
+
2
F
s.t. A(V) = a, V ≥ 0.
The generative model is very general and includes many known methods as special
cases, by specifying:
measurement operator A:
complete observations
matrix completion
matrix sensing
rank-one matrix projections
temporal aggregates
functional spaces of fr, fc:
reduced-rank linear models (Foygel et al. 2012)
non-parametric reduced-rank regression (Mukherjee and Zhu 2011)
features Xr, Xc:
multiple kernel learning (Gönen and Alpaydın 2011)
collaborative filtering with graph features (Abernethy et al. 2009)
17/35
24. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
18/35
25. Extending HALS
The generative model leads to the following optimization problem:
min
V,fr∈F k
r ,fc∈F k
c
V − (fr(Xr))+(fc(Xc))T
+
2
F
s.t. A(V) = a, V ≥ 0.
To solve the problem above, we extend the HALS algorithm mentioned before, we
modify the update function for each rank at each iteration to use the exogenous
variables. We call this algorithm HALSX (HALS with eXogenous variables).
With fairly mild conditions, HALSX also verifies that its limiting points are
stationary points.
Sufficient conditions of the uniqueness of such a decomposition can be found in
the case where the link functions are linear.
19/35
26. HALSX: Pseudo-code
Data: A, a, Xr, Xc, Fr, Fc, 1 ≤ k ≤ min{n1, n2}.
Result: Vt, Ft
r ∈ Rn1×k
+ , ft
r,1, ..., ft
r,k ∈ Fr, Ft
c ∈ Rn2×k
+ , ft
c,1, ..., ft
c,k ∈ Fc.
Initialize F0
r, F0
c ≥ 0, t = 0;
while Stopping criterion is not satisfied do
Vt = arg minV|A(V)=a,V≥0 V − Ft
r(Ft
c)T 2
F ;
Rt = Vt − Ft
r(Ft
c)T ;
for i = 1, 2, ..., k do
Rt
i = Rt + ft
r,i(ft
c,i)T ;
Calculate ft+1
r,i = arg minf∈Fr Rt
i − f(Xr)(ft
c,i)T 2
F ;
ft+1
r,i = max(0, ft+1
r,i (Xr));
Rt = Rt
i − ft+1
r,i (ft
c,i)T ;
end
for i = 1, 2, ..., k do
Rt
i = Rt + ft+1
r,i (ft
c,i)T ;
Calculate ft+1
c,i = arg minf∈Fc Rt
i − ft+1
r,i f(Xc)T 2
F ;
ft+1
c,i = max(0, ft+1
c,i (Xc));
Rt = Rt
i − ft+1
r,i (ft+1
c,i )T ;
end
t = t + 1;
end
20/35
27. Local convergence of HALSX
The following property known for HALS (Kim, He, and Park 2014):
Property
For all R ∈ Rn1×n2 , y ∈ Rn2
+ , y not identically zero, any vector x∗ that verifies
x∗
∈ arg min
x∈Rn1
R − x(y)T 2
F ,
is also a solution to
min
x∈Rn1
R − x+(y)T 2
F .
21/35
28. Local convergence of HALSX
The following property known for HALS (Kim, He, and Park 2014):
Property
For all R ∈ Rn1×n2 , y ∈ Rn2
+ , y not identically zero, any vector x∗ that verifies
x∗
∈ arg min
x∈Rn1
R − x(y)T 2
F ,
is also a solution to
min
x∈Rn1
R − x+(y)T 2
F .
In HALSX, we can show the following similar property:
Proposition
Suppose that R ∈ Rn1×n2 , fc ∈ Rn2
+ are not identically equal to zero, and
g : Rd → Rn1 , with d ≥ n1, is a convex differentiable function. Suppose
θ∗
∈ arg min
θ∈Rd
R − g(θ)(fc)T 2
F .
If gθ∗ , the Jacobian matrix of g at θ∗
, is of rank n1, then θ∗
is also a solution to
min
θ∈Rd
R − (g(θ))+(fc)T 2
F .
Then by an argument of strict quasi-convexity, we obtain the convergence result. 21/35
29. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
22/35
30. Experimental setting
Three datasets are used in experiments:
Synthetic a rank-20 150-by-180 nonnegative matrix simulated following the
generative model (n1 = 150, n2 = 180).
Synthetic row and column variables.
French daily consumption of 473 medium-voltage feeders near Lyon from 2010 to
2012 (n1 = 1096, n2 = 473).
Row variables: daily temperature, calendar variables.
Columns variables: the percentage of each type of clients (residential, professional,
industrial, high-voltage clients).
Portuguese daily consumption of 370 Portuguese clients from 2010 to 2014
(n1 = 1461, n2 = 369).
Row variables: daily temperature, calendar variables.
We generate measures by selecting a number of
observation periods,
either uniformly on the whole matrix (random),
or periodically with a randomly chosen offset
for each column (periodic).
23/35
31. Recovery or prediction
To test the prediction on new individuals and new periods, temporal aggregates
are generated on a number of observation periods over the upper-left matrix.
An error metric (RRMSE, or ˆX − X F / X F ) is calculated on each of the four
submatrices.
24/35
32. Profiles obtained with Portuguese dataset
Using external factors, the obtained profiles present visible annual cycles.
25/35
33. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
26/35
34. Results on time series recovery
Synthetic Portuguese French
periodicrandom
0.1 0.2 0.3 0.4 0.50.1 0.2 0.3 0.4 0.50.1 0.2 0.3 0.4 0.5
0.00
0.05
0.10
0.15
0.20
0.25
0.00
0.05
0.10
0.15
0.20
0.25
Sampling rate
Recoveryerror
algorithm
empty_model
softImpute
HALS
NeNMF
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Using exogenous variables (HALSX_model), the error rate on matrix recovery is
in most cases equivalent or an improvement compared to NMF methods (NeNMF
and HALS).
With random observation dates on the synthetic dataset, which is arguably the
least realistic case, HALSX_model is a little worse off than NeNMF and HALS.
27/35
35. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
28/35
36. Results on time series prediction
Row error Column error RC error
periodicrandom
0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
Sampling rate
algorithm
rrr
trmf
individual_gam
factor_gam
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Figure: Prediction error on synthetic data
Row error Column error RC error
periodicrandom
0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00
0.20
0.25
0.30
0.35
0.40
0.20
0.25
0.30
0.35
0.40
Sampling rate
algorithm
rrr
trmf
individual_gam
factor_gam
HALSX_model
HALSX regression
lm
gam
gaussprRadial
svmLinear
Figure: Prediction error on French data
On synthetic data, the error of prediction is rather low for the
three prediction types (around 10%), which is remarkable
since only very partial data was available in the first place.
On the real-world datasets, the prediction error is higher.
However, HALSX still outperforms other benchmark methods.
HALSX is not sensitive to the sampling rate.
29/35
37. Outline
Method: Nonnegative matrix factorization with side information
NMF with linear measurements
Time series recovery and prediction with side information
HALSX algorithm
Experiments
Time series recovery
Time series prediction
Conclusions
30/35
38. Conclusions
In this talk we
formalized the temporal aggregate observations in electricity consumption as a
trace regression model;
proposed a generative low-rank matrix model to introduce side information in
NMF;
deduced HALSX, an algorithm to solve the new NMF problem;
tested it on real and synthetic electricity datasets and obtained results that are
equivalent or better than reference methods.
The proposed method is implemented in an R package used internally at EDF.
31/35
39. Perspectives of the thesis
Industrial applications
Instead of estimating the whole time series, NMF can be used to directly or indirectly
estimate important statistics, such as the peak demand.
Methodological perspectives
Estimation with both spatial and temporal aggregates
Usage of social network data as column variables
Causal relationship between the presence of the measures and the data that are
measured
Neural network/deep learning with partial data
Theoretical perspectives
Is it possible to achieve global convergence of first-order NMF algorithms in special
cases?
32/35
40. References
Jacob Abernethy et al. “A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization”. In: The
Journal of Machine Learning Research 10 (2009), pp. 803–826.
Emmanuel J. Candès and Benjamin Recht. “Exact Matrix Completion via Convex Optimization”. In: Foundations of
Computational Mathematics 9.6 (2009), pp. 717–772. doi: 10.1007/s10208-009-9045-5.
Yunmei Chen and Xiaojing Ye. “Projection Onto A Simplex”. In: arXiv preprint arXiv:1101.6081 (2011).
Andrzej Cichocki, ed. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and
Blind Source Separation. Chichester, U.K: John Wiley, 2009. 477 pp. isbn: 978-0-470-74666-0.
Rina Foygel et al. “Nonparametric Reduced Rank Regression”. In: Advances in Neural Information Processing Systems. 2012,
pp. 1628–1636.
Mehmet Gönen and Ethem Alpaydın. “Multiple Kernel Learning Algorithms”. In: Journal of Machine Learning Research 12 (Jul
2011), pp. 2211–2268.
Naiyang Guan et al. “NeNMF: An Optimal Gradient Method for Nonnegative Matrix Factorization”. In: IEEE Transactions on
Signal Processing 60.6 (2012), pp. 2882–2898. doi: 10.1109/TSP.2012.2190406.
Jingu Kim, Yunlong He, and Haesun Park. “Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based
on Block Coordinate Descent Framework”. In: Journal of Global Optimization 58.2 (2014), pp. 285–319.
Daniel D. Lee and H. Sebastian Seung. “Learning the Parts of Objects by Non-Negative Matrix Factorization”. In: Nature
401.6755 (1999), pp. 788–791.
Ashin Mukherjee and Ji Zhu. “Reduced Rank Ridge Regression and Its Kernel Extensions”. In: Statistical Analysis and Data
Mining 4.6 (Dec. 2011), pp. 612–622. doi: 10.1002/sam.10138. pmid: 22993641.
Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. “Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via
Nuclear Norm Minimization”. In: SIAM review 52.3 (2010), pp. 471–501.
Angelika Rohde and Alexandre B. Tsybakov. “Estimation of High-Dimensional Low-Rank Matrices”. In: The Annals of Statistics
39.2 (2011), pp. 887–930. doi: 10.1214/10-AOS860.
Or Zuk and Avishai Wagner. “Low-Rank Matrix Recovery from Row-and-Column Affine Measurements”. In: Proceedings of The
32nd International Conference on Machine Learning. Proceedings of The 32nd International Conference on Machine Learning.
2015, pp. 2012–2020.
33/35
41. How to calculate PA(X)
For some forms of masks, there are efficient methods.
Matrix completion: replacing the observed entries.
Temporal aggregates: simplex projection
min
vId
vId
−
t0(d)+h(d)
t=t0(d)+1
(fr)t(FT
c )nd
2
s.t. vId
≥ 0, vT
Id
1 = ad.
An efficient simplex proejction algorithm (Chen and Ye 2011) is used in this case.
General case: iterate between
V = V + A†
(a − A(V));
vi,j = max(0, vi,j ).
34/35
42. Which functional spaces to choose
HALSX is rather agnostic in the choice of regression models.
There is a biais-variance trade-off between flexible models with many parameters
and simple models with few parameters.
Overfitting can be adressed by cross-validation at each model update.
35/35