introduction_to_basic_numerical_prediction_ECML15_tutorial

M1 - Linear Regression and Time Series Analysis
Luis Moreira-Matias
luis.matias[at]neclab.eu
www.luis-matias.pt.vu
NEC Laboratories Europe,
Heidelberg Germany
07/09/2015, Porto, Portugal
“Eureka!” - How to Build Accurate Predictors for Real-valued Outputs from Simple Methods

M1 - Linear Regression and Time-series Analysis
Outline
Regression Analysis
Basic concepts: Target, Objective and Learning/Induction Functions
Simple Linear Regression
Numerical Example with Least Squares
Multivariate Linear Regression, Bayesian Statistics and Kernel-Based
Approaches
Time Series Analysis - when the time becomes a feature
Basic concepts: Stationarity, ACF/PACF, Seasonality
AutoRegressive (AR) and Moving Average (MA) models
Box-Jenkins ARIMA forecasting for short-term predictions
Lessons Learned
Luis Moreira-Matias | NEC Europe Ltd. | 07/09/2015, Porto, Portugal 2 / 41

M1 - Linear Regression and Time-series Analysis | Basics in Regression Analysis
An Introduction to Regression
Numerical prediction problems are about to generalize the behavior
of a target variable y given a predeﬁned explanatory context (i.e.
explanatory variables) such as y = f(x);
Example: Energy consumption of a given family y along the time of
the day x;
Inductive learning method: estimate a behavioral function ˆf(x) given
a set of data samples, i.e. training set;
The range of all explanatory variables: feature space;
Example: [0, 24] hours deﬁnes the feature space of the time of the
day x;

Basic Concepts in Regression
Given a training set X with N = |X| samples, we want to estimate a
Target Function ˆf(x), x ∈ X such that
ˆf : X → n
, such that ˆf(x) = f(x), ∀x ∈ X
n denotes the number of features which range in
n
denotes the feature space where the training set X is mapped
Basic Concepts
Target Function: ˆf(x) ∼ f(x)
Induction Function/Learner/Method: the function used to construct ˆf(x)
from the input samples/training set
Objective/Loss Function: The function that we aim to minimize by
approximating ˆf(x) to f(x)

Overview on the Types of Learning Functions
Target Function vs. Learners
Can take either Linear or Non-Linear form, depending on the type of
relationship between y and X built by the Learner;
Parametric Learning methods do assume a functional form to ˆf(x)
apriori
Non-parametric do not!
White-Box Learning methods can express ˆf(x) on a equation
Black-box methods cannot!
Examples on Learners
Linear Least Squares [Legendre, 1805]: White-box Parametric
(Linear) Learner;
k-Nearest Neighbors [Cover and Hart, 1967] Black-Box
Non-Parametric Learner;

Objective Function in Regression
Typically, the regression task ends up to be the following
arg min
ˆY
l(ˆY, Y), ∀x ∈ X, f(x) = y ∈ Y,ˆf(x) = y ∈ ˆY
l is the so-called loss function to be minimized by defining ˆf(x)
if l(ˆY, Y) ∼ 0, we may be approximating ˆf(x) too much to f(x). Possible
Overfitting!
Regularization can be performed over the loss function to avoid
overfitting (to be discussed further)
Typical Loss Functions in Regression
Absolute Deviation N
i=1 |yi − ˆyi|
Least Squares N
i=1(yi − ˆyi)2
r = y − ˆy is often refered as the prediction residuals

M1 - Linear Regression and Time-series Analysis | An Overview on Linear Regression
Simple Linear Regression
Simple Linear Regression is a special case of Linear Regression
which considers only one independent variable x, i.e. n = 1
It is parametric as it assumes that the target function must be a linear
combination of the feature values
It is a white-box method as the target function is known before
learned
Target Function: ˆf(x) = a · x + b
To estimate ˆf(x), we need to compute a, b which minimize
N
i=1(yi − ˆyi)2
→ Least Squares

Mr. Burns loves money!

He lives in Blue Street Mansion, where is always very Hot!!!

Blue Street as a Bus Stop which is always crowded!!!

Blue Street as a Bus Stop which is always crowded!!!
He wants to create an on-street lemonade’s salesplace to explore
those poor persons!

Problem 1: his freezer cannot hold more than 80 lemonades.
Problem 2: his freezer cannot hold each lemonade more than 1-2
hours.
Question: He wants to know how many persons will be waiting at stop
along the day.
Lets help him :(

We got some direct observations X
x y
8 30
9 60
10 75
11 60
12 75
q
q
q
q
q
6 7 8 9 10 11 12 13 14
Bus Arrival Time (in hours)
Independent Variable
0102030405060708090100
NumberofPassengersBoardedinBlueStreet
DependentVariable

Goal: estimate ¯f(x) ∼ f(x) (dashed line)
x y
8 30
9 60
10 75
11 60
12 75
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

Any regression curve must pass the (x,y) mean values
x y
8 30
9 60
10 75
11 60
12 75
Mean Mean
10 60
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

Compute sample differences to ¯x
x y x − ¯x
8 30 -2
9 60 -1
10 75 0
11 60 1
12 75 2
Mean Mean
10 60
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

Compute sample differences to ¯y
x y x − ¯x y − ¯y
8 30 -2 -30
9 60 -1 0
10 75 0 15
11 60 1 30
12 75 2 45
Mean Mean
10 60
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

¯f(x) = a · x + b; a =
N
i=1(xi−¯x)(yi−¯y)
N
i=1(xi−¯x)2
= 180
10 = 18, i.e. slope
x y x − ¯x y − ¯y (x − ¯x)2
(x − ¯x)(y − ¯y)
8 30 -2 -30 4 60
9 60 -1 0 1 0
10 75 0 15 0 0
11 60 1 30 1 30
12 75 2 45 4 90
Mean Mean Sum Sum
10 60 10 180
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

b = ¯y − ¯x · a = 60 − 10 × 18 = −120, i.e. intersect
x y x − ¯x y − ¯y (x − ¯x)2
(x − ¯x)(y − ¯y)
8 30 -2 -30 4 60
9 60 -1 0 1 0
10 75 0 15 0 0
11 60 1 30 1 30
12 75 2 45 4 90
Mean Mean Sum Sum
10 60 10 180
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

Estimated Target Function: ¯f(x) = 18x − 120
q
q
q
q
q
6 7 8 9 10 11 12 13 14
0102030405060708090100
DependentVariable

Multivariate Linear Regression
What if there are multiple features, i.e. n > 1?
X will be a matrix... XN,n =


x1,1 x1,2 · · · x1,n
x2,1 x2,2 · · · x2,n
...
...
...
...
xN,1 xN,2 · · · xN,n


All the previous operations can be performed through algebric
operators and transformations (to see further)!
¯f(x) = w1 · x[1] + w2 · x[2] + ... + wn · x[n] + b = wT
X + b
We do know that f(x) = ¯f(x) + . Assuming that ∼ N, we have that
the Least Squares is similar to the Maximum likelihood Estimator
(MLE)!
Consequently, we can obtain ¯f(x) as the maximum likelihood
w = (XT
X)−1
(XT
Y)

M1 - Linear Regression and Time-series Analysis | Other Useful Regression Methods
Result Analysis
Are Mr. Burns Happy with our ﬁnal result?
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100120140160180200
DependentVariable

Prediction Exercise
Let’s do an estimation to 14h...we got 130 pax!
However, he does know that the bus has a capacity of 100 pax!!!!
How certain can he be about our prediction?
q
q
q
q
q
q
?
Bus Capacity
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100120140160180200
DependentVariable

Bayesian Linear Regression
Why doesn’t our LS/MLE output a satisfactory model?
It overﬁtted the training data - which do not adequately cover the
feature space!!!
One solution: Go Bayesian [Box and Tiao, 2011]! Our prior
knowledge can optimize the loss function!

Apriori, we do know that the Bus capacity is 100 pax.
Empirically, we could state that f(x) ∼ N(60, 10) → Predictive
Distribution
µ = 60 is easily veriﬁable on the data, while σ2
= 10 comes from our
belief;
Result: p(20 < f(x) < 80) = 90%
20 40 60 80 100
0.000.010.020.030.04

Result: A line with a smoother slope. Mr. Burns is (almost...) an
happy man!
q
q
q
q
q
q
?
q
?
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100120140160180200
DependentVariable

Other Regression Methods
What if we have more samples? Do you think that the target function
would be linear still?
Would the overﬁtting be a bigger problem in the presence of outliers?
q
q
q
q
q
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100
DependentVariable

Other Regression Methods
Kernel Regression [Nadaraya, 1964] can deal with outliers.
It is a white-box non-parametric regression method.
It returns an weighted average of all samples where their weight is
computed by the concept of neighborhood given by bandwidth
parameter (i.e. λ).
¯f(x∗) =
N
i=1 K(x∗,xi,λ)yi
N
i=1 K(x∗,xi,λ)
the K function is the kernel used to compute such weights
Distinct applications may require distinct kernels!
λ must be tuned before usage (e.g. cross validation)
Two common kernels: Nadaraya-Watson K(x∗, xi, λ) = x∗−xi
λ ,
Normal K(x∗, xi, λ) = e
−
(x∗−xi)2
2λ2

Nadaraya-Watson Kernel Regression
Example using a Nadaraya-Watson Kernel...
q
q
q
q
q
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100
DependentVariable
λ=1
q
q
q
q
q
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100
DependentVariable
λ=3
q
q
q
q
q
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100
DependentVariable
λ=5
q
q
q
q
q
q
q
q
q
q
6 7 8 9 10 11 12 13 14 15 16 17 18
020406080100
DependentVariable
λ=10

Other Relevant Methods worthy to be explored
Conjugate Gradients [Hestenes and Stiefel, 1952]
Weighted Linear Regression [Strutz, 2010]
Regression (Decision) Trees [Breiman et al., 1984]
Local Regression [Cleveland, 1981]

M1 - Linear Regression and Time-series Analysis | Time Series Analysis
What is Time Series analysis?
Time Series Analysis is a subﬁeld of signal processing that deals with
modeling the behavior of a timestamped series of numerical values.
It can be faced as a subset of simple linear regression problem where
x is deﬁned in time and, even more important, the sample’s arrival
sequence is relevant for their future values!!!
Finally, it is expected that the signal would follow some trend, evolving
over seasons.

Time Series - An Example
qqq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qq
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
4 8 16 24 8 16 24 8 16 24 8 16 24 8 16 24
MON TUE WED THU FRI
020406080120160200240

Time Series - Basic Concepts
Stationarity - Are the mean, the variance and the co-variance
constant along t?
Stationarity is a key property to deal with time series. Even if the
series are not stationary, they may turn by differencing the signal with
some lag d
Trend - Is there any periodicity for which the signal repeats itself (in
some pattern) along t?
Seasonality - Is the signal stationary for subsets of t? Has this signal
a trend for subsets of t? Then it is said to be seasonal!
ACF - Autocorrelation Function. Measures the correlation of the
signal with itself lagged over a pool of possible lag values.
PACF - Partial Autocorrelation Function. Measures the ACF after
removing linear dependences

AutoRegressive and Moving Average Models
Autoregressive model of order p, i.e. AR(p)
yt = δ + φ1yt−1 + φ2yt−2 + ... + φqyt−q + t
yt is an weighted average of its p previous values
Moving Average model of order q, i.e. MA(q)
yt = δ − θ1 t−1 − θ1 t−2 − ... − θq t−q + t
yt is an weighted average of its q previous error terms ∗
Are these equations somehow familiar to you!?

ARIMA models
ARIMA models stands for AutoRegressive Integrated Moving Average
and are used to estimate time series models [Box and Pierce, 1970]
ARIMA models are linear regression models that use lagged values
of the dependent variable and/or a random disturbance term as
explanatory variables
They rely heavily on the autocorrelation patterns of the data, i.e. they
assume that the signal repeats itself somehow
An ARIMA model is deﬁned by three values (p, d, q) where p denotes
its AR component, q the MA component and d the differentiation
needed to make the series stationary. Any of those values can be 0.
The random disturbances t are assumed to be gaussian, i.e.
t ∼ N(0, σ2
)

An Non-Seasonal ARIMA example in a nutshell
Stationarity: Yes (but not sure with visual inspection)
Trend: It repeats itself each 24h with two peaks around 10am and
24am
How to estimate the ARIMA model in place? (p, d, q)?
qqq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qq
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
4 8 16 24 8 16 24 8 16 24 8 16 24 8 16 24
MON TUE WED THU FRI
020406080120160200240

Stationarity: If Stationary, ACF should tail off abruptly. Otherwise, it
will do it smoothly, going nowhere...
Series is Stationary → d = 0
ACF cuts off after p = 3 lags; PACF cuts off after q = 1 lags
Model (p, d, q)=(3, 0, 1)
0 10 20 30 40
−0.20.20.61.0
Lag
ACF
Series y
0 10 20 30 40
−0.20.2
Lag
PartialACF
Series y

ARIMA are very powerful methods for short-term forecasting
horizons. Looking ahead more than one terms means to re-use the
predictions as true past values to estimate their future outputs.
Our exercise is to predict the bus demand for the last day, i.e. Friday.
We did it so using a rolling horizon of one hour (we predicted
one-step ahead on each iteration, including the last true output value
on the training series used to estimate the next weight set, i.e φ∗, θ∗)

The forecasting result is very close to the real series. It ﬁnds the
peaks/valleys but it under/overestimate their true values...
qqq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qq
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
4 8 16 24 8 16 24 8 16 24 8 16 24 8 16 24
MON TUE WED THU FRI
020406080120160200240

Other Relevant Methods worthy to be explored
Auto-ARIMA Learning Model [Hyndman and Khandakar, 2007]
Seasonal ARIMA [Box et al., 1976]
Holt-Winters Exponential Smoothing [Goodwin et al., 2010]
Inhomogeneous Poisson Models [Lee et al., 1991]
GARCH models [Engle, 1982]

Lessons Learned
White-box Regression methods are easier to understand as they
provide a direct relationship between input-output values
Linear Regression methods can be powerful inference tools
Bayesian Statistics can help of inputting prior knowledge about target
variable into our learning model
The adequate choice of a loss function for a given problem can
enhance the predictive power of a method
Time Series Analysis methods can be used when the sample’s order
is relevant and time-dependent
They are easy to understand and specially powerful on predicting for
short-term horizons
A brief study of the problem and the selection of the proper
combination of ML components can easy your daily life.

Box, G., Jenkins, G., and Reinsel, G. (1976).
Time series analysis.
Holden-day San Francisco.
Box, G. and Pierce, D. (1970).
Distribution of residual autocorrelations in autoregressive-integrated
moving average time series models.
Journal of the American Statistical Association, 65(332):1509–1526.
Box, G. E. and Tiao, G. C. (2011).
Bayesian inference in statistical analysis, volume 40.
John Wiley & Sons.
Breiman, L., Friedman, J., Stone, C., and Olshen, R. (1984).
Classiﬁcation and regression trees.
CRC press.
Cleveland, W. S. (1981).
Lowess: A program for smoothing scatterplots by robust locally
weighted regression.

The American Statistician, 35(1):54–54.
Cover, T. and Hart, P. (1967).
Nearest neighbor pattern classiﬁcation.
IEEE Transactions on Information Theory, 13(1):21–27.
Engle, R. F. (1982).
Autoregressive conditional heteroscedasticity with estimates of the
variance of united kingdom inﬂation.
Econometrica: Journal of the Econometric Society, pages 987–1007.
Goodwin, P. et al. (2010).
The holt-winters approach to exponential smoothing: 50 years old and
going strong.
Foresight, 19:30–33.
Hestenes, M. R. and Stiefel, E. (1952).
Methods of conjugate gradients for solving linear systems.
Hyndman, R. and Khandakar, Y. (2007).
Automatic time series for forecasting: the forecast package for r.

Technical report, Monash University, Department of Econometrics and
Business Statistics.
Lee, S., Wilson, J. R., and Crawford, M. M. (1991).
Modeling and simulation of a nonhomogeneous poisson process
having cyclic behavior.
Communications in Statistics-Simulation and Computation,
20(2-3):777–809.
Legendre, A. (1805).
Nouvelles méthodes pour la détermination des orbites des comètes.
Number 1. F. Didot.
Nadaraya, E. (1964).
On estimating regression.
Theory of Probability & Its Applications, 9(1):141–142.
Strutz, T. (2010).
Data fitting and uncertainty.
A practical introduction to weighted least squares and beyond.
Vieweg+ Teubner.

introduction_to_basic_numerical_prediction_ECML15_tutorial

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to introduction_to_basic_numerical_prediction_ECML15_tutorial

Similar to introduction_to_basic_numerical_prediction_ECML15_tutorial (20)

More from Luís Moreira-Matias

More from Luís Moreira-Matias (6)

introduction_to_basic_numerical_prediction_ECML15_tutorial