Regression using Apache SystemML by Alexandre V Evfimievski

Regression in SystemML
Alexandre Evfimievski
1

Linear Regression
• INPUT: Records (x1, y1), (x2, y2), …, (xn, yn)
– Each xi is m-dimensional: xi1, xi2, …, xim
– Each yi is 1-dimensional
• Want to approximate yi as a linear combination of xi-entries
– yi ≈ β1xi1 + β2xi2 + … + βmxim
– Case m = 1: yi ≈ β1xi1 ( Note: x = 0 maps to y = 0 )
• Intercept: a “free parameter” for default value of yi
– yi ≈ β1xi1 + β2xi2 + … + βmxim + βm+1
– Case m = 1: yi ≈ β1xi1 + β2
• Matrix notation: Y ≈ Xβ, or Y ≈ (X |1) β if with intercept
– X is n × m, Y is n × 1, β is m × 1 or (m+1) × 1
2

Linear Regression: Least Squares
• How to aggregate errors: yi – (β1xi1 + β2xi2 + … + βmxim) ?
– What’s worse: many small errors, or a few big errors?
• Sum of squares: ∑i≤n (yi – (β1xi1 + β2xi2 + … + βmxim))2 → min
– A few big errors are much worse! We square them!
• Matrix notation: (Y – Xβ)T (Y – Xβ) → min
• Good news: easy to solve and find the β’s
• Bad news: too sensitive to outliers!
3

Linear Regression: Direct Solve
• (Y – Xβ)T (Y – Xβ) → min
• YT Y – YT (Xβ) – (Xβ)T Y + (Xβ)T (Xβ) → min
• ½ βT (XTX) β – βT (XTY) → min
• Take the gradient and set it to 0: (XTX) β – (XTY) = 0
• Linear equation: (XTX) β = XTY; Solution: β = (XTX)–1 (XTY)
A = t(X) %*% X;
b = t(X) %*% y;
. . .
. . .
beta_unscaled = solve (A, b);
4

Computation of XTX
• Input (n × m)-matrix X is often huge and sparse
– Rows X[i, ] make up n records, often n >> 106
– Columns X[, j] are the features
• Matrix XTX is (m × m) and dense
– Cells: (XTX) [j1, j2] = ∑ i≤n X[i, j1] * X[i, j2]
– Part of covariance between features # j1 and # j2 across all records
– m could be small or large
• If m ≤ 1000, XTX is small and “direct solve” is efficient…
– … as long as XTX is computed the right way!
– … and as long as XTX is invertible (no linearly dependent features)
5

Computation of XTX
• Naïve computation:
a) Read X into memory
b) Copy it and rearrange cells into the transpose
c) Multiply two huge matrices, XT and X
• There is a better way: XTX = ∑i≤n X[i, ]T X[i, ] (outer product)
– For all i = 1, …, n in parallel:
a) Read one row X[i, ]
b) Compute (m × m)-matrix: Mi [j1, j2] = X[i, j1] * X[i, j2]
c) Aggregate: M = M + Mi
• Extends to (XTX) v and XT diag(w) X, used in other scripts:
– (XTX) v = ∑i≤n (∑ j≤m X[i, j]v[j]) * X[i, ]T
– XT diag(w)X = ∑ i≤n wi * X[i, ]T X[i, ]
6

Conjugate Gradient
• What if XTX is too large, m >> 1000?
– Dense XTX may take far more memory than sparse X
• Full XTX not needed to solve (XTX) β = XTY
– Use iterative method
– Only evaluate (XTX)v for certain vectors v
• Ex.: Gradient Descent for f (β) = ½ βT (XTX) β – βT (XTY)
– Start with any β = β0
– Take the gradient: r = df(β) = (XTX) β – (XTY) (also, residual)
– Find number a to minimize f(β + a ·r): a = – (rT r) / (rT XTX r)
– Update: βnew ← β + a·r
• But gradient is too local
– And “forgetful”
*a · r
7

Conjugate Gradient
• PROBLEM: Gradient takes a very similar direction many times
• Enforce orthogonality to prior directions?
– Take the gradient: r = (XTX) β – (XTY)
– Subtract prior directions: p(k) = r – λ1p(1) – … – λk-1p(k-1)
• Pick λi to ensure (p(k) · p(i)) = 0 ???
– Find number a(k) to minimize f(β + a(k) ·p(k)), etc …
• STILL, PROBLEMS:
– Value a(k) does NOT minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m))
– Keep all prior directions p(1), p(2), … , p(k) ? That’s a lot!
• SOLUTION: Enforce Conjugacy
– Conjugate vectors: uT (XTX) v = 0, instead of uT v = 0
• Matrix XTX acts as the “metric” in distorted space
– This does minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m))
• And, only need p(k-1) and r(k) to compute p(k)
8

Conjugate Gradient
• Algorithm, step by step
i = 0; beta = matrix (0, ...); Initially: β = 0
r = - t(X) %*% y; Residual & gradient r = (XTX) β – (XTY)
p = - r; Direction for β: negative gradient
norm_r2 = sum (r ^ 2); Norm of residual error = rT r
norm_r2_target = norm_r2 * tolerance ^ 2; Desired norm of residual error
while (i < mi & norm_r2 > norm_r2_target)
{ WE HAVE: p is the next direction for β
q = t(X) %*% (X %*% p) + lambda * p; q = (XTX) p
a = norm_r2 / sum (p * q); a = rT r / p (XTX) p minimizes f(β + a· p)
beta = beta + a * p; Update: βnew ← β + a · p
r = r + a * q; rnew ← (XTX) (β + a· p) – (XTY)
old_norm_r2 = norm_r2; = r + a · (XTX) p
norm_r2 = sum (r ^ 2); Update the norm of residual error = rT r
p = -r + (norm_r2 / old_norm_r2) * p; Update direction: (1) take negative gradient;
(2) enforce conjugacy with previous direction
i = i + 1; Conjugacy to all older directions is automatic!
}
9

Degeneracy and Regularization
• PROBLEM: What if X has linearly dependent columns?
– Cause: recoding categorical features, adding composite features
– Then XTX is not a “metric”: exists ǁpǁ > 0 such that pT (XTX) p = 0
– In CG step a = rT r / p (XTX) p : Division By Zero!
• In fact, then Least Squares has ∞ solutions
– Most of them have HUGE β-values
• Regularization: Penalize β with larger values
– L2-Regularization: (Y – Xβ)T (Y – Xβ) + λ·βT β → min
– Replace XTX with XTX + λI
– Pick λ << diag(XTX), refine by cross-validation
– Do NOT regularize intercept
• CG: q = t(X) %*% (X %*% p) + lambda * p;
10

Shifting and Scaling X
• PROBLEM: Features have vastly different range:
– Examples: [0, 1]; [2010, 2015]; [$0.01, $1 Billion]
• Each βi in Y ≈ Xβ has different size & accuracy?
– Regularization λ·βT β also range-dependent?
• SOLUTION: Scale & shift features to mean = 0, variance = 1
– Needs intercept: Y ≈ (X| 1)β
– Equivalently: (Xnew |1) = (X |1) %*% SST “Shift-Scale Transform”
• BUT: Sparse X becomes Dense Xnew …
• SOLUTION: (Xnew |1) %*% M = (X |1) %*% (SST %*% M)
– Extends to XTX and other X-products
– Further optimization: SST has special shape
11

Shifting and Scaling X
– Linear Regression Direct Solve
code snippet example:
A = t(X) %*% X;
b = t(X) %*% y;
if (intercept_status == 2) {
A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]);
A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ];
b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ];
}
A = A + diag (lambda);
beta_unscaled = solve (A, b);
if (intercept_status == 2) {
beta = scale_X * beta_unscaled;
beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled;
} else {
beta = beta_unscaled;
}
12

Regression in Statistics
• Model: Y = Xβ* + ε where ε is a random vector
– There exists a “true” β*
– Each εi is Gaussian with mean μi = Xi β* and variance σ2
• Likelihood maximization to estimate β*
– Likelihood: ℓ(Y | X, β, σ) = ∏i ≤ n C(σ)·exp(– (yi – Xi β)2 / 2σ2)
– Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2
– Maximum likelihood over β = Least Squares
• Why do we need statistical view?
– Confidence intervals for parameters
– Goodness of fit tests
– Generalizations: replace Gaussian with another distribution
13

Maximum Likelihood Estimator
• In each (xi , yi) let yi have distribution ℓ(yi | xi , β, φ)
– Records are mutually independent for i = 1, …, n
• Estimator for β is a function f(X, Y)
– Y is random → f(X, Y) random
– Unbiased estimator: for all β, mean E f(X, Y) = β
• Maximum likelihood estimator
– MLE (X, Y) = argmaxβ ∏i ≤ n ℓ(yi | xi , β, φ)
– Asymptotically unbiased: E MLE (X, Y) → β as n → ∞
• Cramér-Rao Bound
– For unbiased estimators, Var f(X, Y) ≥ FI(X, β, φ) –1
– Fisher information: FI(X, β, φ) = – EY Hessianβ log ℓ(Y| X, β, φ)
– For MLE: Var (MLE (X, Y)) → FI(X, β, φ)–1 as n → ∞
14

Variance of M.L.E.
• Cramér-Rao Bound is a simple way to estimate variance of
predicted parameters (for large n):
1. Maximize log ℓ(Y |X, β, φ) to estimate β
2. Compute the Hessian (2nd derivatives) of log ℓ(Y |X, β, φ)
3. Compute “expected” Hessian: FI = – EY Hessian
4. Invert FI as a matrix: get FI–1
5. Use FI–1 as approx. covariance matrix for the estimated β
• For linear regression:
– Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2
– Hessian = –(1/σ2)·XTX; FI = (1/σ2)·XTX
– Cov β ≈ σ2 ·(XTX) –1 ; Var βj ≈ σ2 ·diag((XTX) –1) j
15

Variance of Y given X
• MLE for variance of Y = 1/n · ∑ i ≤ n (yi – y avg)2
– To make it unbiased, replace 1/n with 1/(n – 1)
• Variance of ε in Y = Xβ* + ε is residual variance
– Estimator for Var(ε) = 1/(n – m – 1) · ∑i ≤ n (yi – Xi β)2
• Good regression must have: Var(ε) << Var(Y)
– “Explained” variance = Var(Y) – Var(ε)
• R-squared: estimate 1 – Var(ε) / Var(Y) to test fitness:
– R2
plain = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2)
– R2
adj. = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) · (n – 1) / (n – m – 1)
• Pearson residual: ri = (yi – Xi β) / Var(ε)1/2
– Should be approximately Gaussian with mean 0 and variance 1
– Can use in another fitness test (more on tests later)
16

LinReg Scripts: Inputs
# INPUT PARAMETERS:
# --------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# --------------------------------------------------------------------------------------------
# X String --- Location (on HDFS) to read the matrix X of feature vectors
# Y String --- Location (on HDFS) to read the 1-column matrix Y of response values
# B String --- Location to store estimated regression parameters (the betas)
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# icpt Int 0 Intercept presence, shifting and rescaling the columns of X:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero
# for highly dependend/sparse/numerous features
# tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if
# L2 norm of the beta-residual is less than tolerance * its initial norm
# maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum
# fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv"
# --------------------------------------------------------------------------------------------
# OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value:
# OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B:
# icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B
# icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# Col.2: betas for shifted/rescaled X and intercept
17

LinReg Scripts: Outputs
# In addition, some regression statistics are provided in CSV format, one comma-separated
# name-value pair per each line, as follows:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# AVG_TOT_Y Average of the response value Y
# STDEV_TOT_Y Standard Deviation of the response value Y
# AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias
# STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X)
# DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr.
# PLAIN_R2 Plain R^2 of residual with bias included vs. total average
# ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average
# PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average
# ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average
# PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant
# ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant
# -------------------------------------------------------------------------------------
# * The last two statistics are only printed if there is no intercept (icpt=0)
#
# The Log file, when requested, contains the following per-iteration variables in CSV
# format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for
# initial values:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y
# where A = t(X) %*% X + diag (lambda), or a similar quantity
# CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial
# -------------------------------------------------------------------------------------
18

Caveats
• Overfitting: β reflect individual records in X, not distribution
– Typically, too few records (small n) or too many features (large m)
– To detect, use cross-validation
– To mitigate, select fewer features; regularization may help too
• Outliers: Some records in X are highly abnormal
– They badly violate distribution, or have very large cell-values
– Check MIN and MAX of Y, X-columns, Xi β, ri
2 = (yi – Xi β)2 / Var(ε)
– To mitigate, remove outliers, or change distribution or link function
• Interpolation vs. extrapolation
– A model trained on one kind of data may not carry over to another
kind of data; the past may not predict the future
– Great research topic!
19

Generalized Linear Models
• Linear Regression: Y = Xβ* + ε
– Each yi is Normal(μi , σ2) where mean μi = Xi β*
– Variance(yi) = σ2 = constant
• Logistic Regression:
– Each yi is Bernoulli(μi) where mean μi = 1 / (1 + exp (– Xi β*))
– Prob [yi = 1] = μi , Prob [yi = 0] = 1 – μi , mean = probability of 1
– Variance(yi) = μi (1 – μi)
• Poisson Regression:
– Each yi is Poisson(μi) where mean μi = exp(Xi β*)
– Prob [yi = k] = (μi)k exp(– μi)/ k! for k = 0, 1, 2, …
– Variance(yi) = μi
• Only in Linear Regression we add error εi to mean μi
20

• GLM Regression:
– Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)}
– Canonical parameter θi represents the mean: μi = bʹ(θi)
– Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*)
– Variance(yi) = a ·bʺ(θi)
• Example: Linear Regression as GLM
– C(σ)·exp(– (yi – Xi β)2 / 2σ2) = exp{(yi ·θi – b(θi))/a + c(yi , a)}
– θi = μi = Xi β; b(θi) = (Xi β)2 / 2; a = σ2 = variance
• Link function = identity; c(yi , a) = – yi
2 /2σ2 + log C(σ)
• Example: Logistic Regression as GLM
– (μi )y[i] (1 – μi)1 – y[i] = exp{yi · log(μi) – yi · log(1 – μi) + log(1 – μi)}
= exp{(yi ·θi – b(θi))/ a + c(yi , a)}
– θi = log(μi / (1 – μi)) = Xi β; b(θi) = – log(1 – μi) = log(1 + exp(θi))
• Link function = log (μ / (1 – μ)) ; Variance = μ(1 – μ) ; a = 1
21

• GLM Regression:
– Canonical parameter θi represents the mean: μi = bʹ(θi)
– Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*)
– Variance(yi) = a ·bʺ(θi)
• Why θi ? What is b(θi)?
– θi makes formulas simpler, stands for μi (no big deal)
– b(θi) defines what distribution it is: linear, logistic, Poisson, etc.
– b(θi) connects mean with variance: Var(yi) = a·bʺ(θi), μi = bʹ(θi)
• What is link function?
– You choose it to link μi with your features β1xi1 + β2xi2 + … + βmxim
– Additive effects: μi = Xi β; Multiplicative effects: μi = exp(Xi β)
Bayes law effects: μi = 1 / (1 + exp (– Xi β)); Inverse: μi = 1 / (Xi β)
– Xi β has range (– ∞, +∞), but μi may range in [0, 1], [0, +∞) etc.
22

GLMs We Support
• We specify GLM by:
– Mean to variance connection
– Link function (mean to feature sum connection)
• Mean-to-variance for common distributions:
– Var (yi) = a ·(μi)0 = σ2 : Linear / Gaussian
– Var (yi) = a ·μi (1 – μi): Logistic / Binomial
– Var (yi) = a ·(μi)1 : Poisson
– Var (yi) = a ·(μi)2 : Gamma
– Var (yi) = a ·(μi)3 : Inverse Gaussian
• We support two types: Power and Binomial
– Var (yi) = a ·(μi)α : Power, for any α
– Var (yi) = a ·μi (1 – μi): Binomial
23

GLMs We Support
• We specify GLM by:
– Mean to variance connection
– Link function (mean to feature sum connection)
Supported link functions
• Power: Xi β = (μi)s where s = 0 stands for Xi β = log (μi)
– Examples: identity, inverse, log, square root
• Link functions used in binomial / logistic regression:
– Logit, Probit, Cloglog, Cauchit
– Link Xi β-range (– ∞, +∞) with μi-range (0, 1)
– Differ in tail behavior
• Canonical link function:
– Makes Xi β = the canonical parameter θi , i.e. sets μi = bʹ(Xi β)
– Power link Xi β = (μi)1 – α if Var = a·(μi)α ; Logit link for binomial
24

GLM Script Inputs
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
# X String --- Location to read the matrix X of feature vectors
# Y String --- Location to read response matrix Y with either 1 or 2 columns:
# if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)
# B String --- Location to store estimated regression parameters (the betas)
# fmt String "text" The betas matrix output format, such as "text" or "csv"
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
# vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1):
# 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
# link Int 0 Link function code: 0 = canonical (depends on distribution),
# 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit
# lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1):
# -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
# yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0
# icpt Int 0 Intercept presence, X columns shifting and rescaling:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.0 Regularization parameter (lambda) for L2 regularization
# tol Double 0.000001 Tolerance (epsilon)
# disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data
# moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations
# mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum
# ---------------------------------------------------------------------------------------------
# OUTPUT: Matrix beta, whose size depends on icpt:
# icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2
25

GLM Script Outputs
# In addition, some GLM statistics are provided in CSV format, one comma-separated name-value
# pair per each line, as follows:
# -------------------------------------------------------------------------------------------
# TERMINATION_CODE A positive integer indicating success/failure as follows:
# 1 = Converged successfully; 2 = Maximum number of iterations reached;
# 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported
# BETA_MIN Smallest beta value (regression coefficient), excluding the intercept
# BETA_MIN_INDEX Column index for the smallest beta value
# BETA_MAX Largest beta value (regression coefficient), excluding the intercept
# BETA_MAX_INDEX Column index for the largest beta value
# INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0)
# DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter
# or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0
# DISPERSION_EST Dispersion estimated from the dataset
# DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0
# DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value
# -------------------------------------------------------------------------------------------
#
# The Log file, when requested, contains the following per-iteration variables in CSV format,
# each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values:
# -------------------------------------------------------------------------------------------
# NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration
# IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise
# POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point
# OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood)
# OBJ_DROP_REAL Reduction in the objective during this iteration, actual value
# OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation
# OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region
# GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted)
# LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows
# LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows
# IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored
# TRUST_DELTA Updated trust region size, the "delta"
# -------------------------------------------------------------------------------------------
26

GLM Likelihood Maximization
• 1 record: ℓ (yi | θi , a) = exp{(yi ·θi – b(θi))/ a + c(yi , a)}
• Log ℓ (Y |Θ, a) = 1/a · ∑ i ≤ n (yi · θi – b(θi)) + const(Θ)
• f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min
– Here θi is a function of β: θi = bʹ–1 (g –1 (Xi β))
– Add regularization with λ/2 to agree with least squares
– If X has intercept, do NOT regularize its β-value
• Non-quadratic; how to optimize?
– Gradient descent: fastest when far from optimum
– Newton method: fastest when close to optimum
• Trust Region Conjugate Gradient
– Strikes a good balance between the above two
27

GLM Likelihood Maximization
• f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min
• Outer iteration: From β to βnew = β + z
– ∆f (z; β) := f(β + z; X, Y) – f(β; X, Y)
• Use “Fisher Scoring” to approximate Hessian and ∆f (z; β)
– ∆f (z; β) ≈ ½·zT A z + GT z, where:
– A = XT diag(w)X + λI and G = – XT u + λ·β
– Vectors u, w depend on β via mean-to-variance and link functions
• Trust Region: Area ǁzǁ2 ≤ δ where we trust the
approximation ∆f (z; β) ≈ ½ ·zT A z + GT z
– ǁzǁ2 ≤ δ too small → Gradient Descent step (1 inner iteration)
– ǁzǁ2 ≤ δ mid-size → Cut-off Conjugate Gradient step (2 or more)
– ǁzǁ2 ≤ δ too wide → Full Conjugate Gradient step
FI = XT diag(w) X is
“expected” Hessian
28

Trust Region Conj. Gradient
• Code snippet for
Logistic Regression
g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5);
delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2)));
exit_g2 = sum (g ^ 2) * tolerance ^ 2;
while (sum (g ^ 2) > exit_g2 & i < max_i)
{
i = i + 1;
r = g;
r2 = sum (r ^ 2); exit_r2 = 0.01 * r2;
d = - r;
z = zeros_D; j = 0; trust_bound_reached = FALSE;
while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j)
{
j = j + 1;
Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d;
c = r2 / sum (d * Hd);
[c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2);
z = z + c * d;
r = r + c * Hd;
r2_new = sum (r ^ 2);
d = - r + (r2_new / r2) * d;
r2 = r2_new;
}
p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z))));
f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val;
delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g)));
if (f_chg < 0)
{
beta = beta + z;
f_val = f_val + f_chg;
w = p * (1 - p);
g = - t(X) %*% ((1 - p) * y) + lambda * beta;
} }
ensure_quadratic =
function (double x, a, b, c)
return (double x_new, boolean test)
{
test = (a * x^2 + b * x + c > 0);
if (test) {
rad = sqrt (b ^ 2 - 4 * a * c);
if (b >= 0) {
x_new = - (2 * c) / (b + rad);
} else {
x_new = - (b - rad) / (2 * a);
}
} else {
x_new = x;
} }
29

Trust Region Conj. Gradient
• Trust region update in
Logistic Regression snippet
update_trust_region =
function (double delta,
double z_distance,
double f_chg_exact,
double f_chg_linear_approx,
double f_chg_quadratic_approx)
return (double delta)
{
sigma1 = 0.25;
sigma2 = 0.5;
sigma3 = 4.0;
if (f_chg_exact <= f_chg_linear_approx) {
alpha = sigma3;
} else {
alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx));
}
rho = f_chg_exact / f_chg_quadratic_approx;
if (rho < 0.0001) {
delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta);
} else { if (rho < 0.25) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta));
} else { if (rho < 0.75) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta));
} else {
delta = max (delta, min (alpha * z_distance, sigma3 * delta));
}}}
}
30

GLM: Other Statistics
• REMINDER:
– Variance(yi) = a ·bʺ(θi) = a·V(μi)
• Variance of Y given X
– Estimating the β gives V(μi) = V (g–1 (Xi β))
– Constant “a” is called dispersion, analogue of σ2
– Estimator: a ≈ 1/(n – m)·∑ i ≤ n (yi – μi)2 / V(μi)
• Variance of parameters β
– We use MLE, hence Cramér-Rao formula applies (for large n)
– Fisher Information: FI = (1/a)· XT diag(w)X, wi = (V(μi) ·gʹ(μi)2)–1
– Estimator: Cov β ≈ a·(XT diag(w)X)–1, Var βj = (Cov β)jj
31

GLM: Deviance
• Let X have m features, of which k may have no effect on Y
– Will “no effect” result in βj ≈ 0 ? (Unlikely.)
– Estimate βj and Var βj then test βj / (Var βj)1/2 against N(0, 1)?
• Student’s t-test is better
• Likelihood Ratio Test:
• Null Hypothesis: Y given X follows GLM with β1 = … = βk = 0
– If NH is true, D is asympt. distributed as χ2 with k deg. of freedom
– If NH is false, D → +¥as n → +¥
• P-value % = Prob[ χ2
k > D] · 100%
( )
( )
0
...,,,0...,,0;,|max
...,,,...,,;,|max
log2
1GLM
11GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
⋅=
+
+
mk
mkk
aXYL
aXYL
D
ββ
ββββ
β
β
32

GLM: Deviance
• To test many nested models (feature subsets) we need their
maximum likelihoods to compute D
– PROBLEM: Term “c(yi , a)” in GLM’s exp{(yi ·θi – b(θi))/ a + c(yi , a)}
• Instead, compute deviance:
• “Saturated model” has no X, no β, but picks the best θi for each
individual yi (not realistic at all, just convention)
– Term “c(yi , a)” is the same in both models!
– But “a” has to be fixed, e.g. to 1
• Deviance itself is used for goodness of fit tests, too
( )
( )
0
...,,,...,,;,|max
modelsaturated:;|max
log2
11GLM
GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡ Θ
⋅=
+
Θ
mkkaXYL
aYL
D
ββββ
β
33

Survival Analysis
Given
Survival data from individuals as (time, event)
Categorical/continuous features for each individual
Estimate
Probability of survival to a feature time
Rate of hazard at a given time
Ex.
† death from specific cancer
? lost to follow-up
†
†
?
†
?
1 2 3 4 5 6 7 8
9
I I I I I I I I I
Patients
2
1
3
4
5
Time
27
34

Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Baseline hazard covariates
coefficients
29
35

36
Event Hazard Rate
• Symptom events E follow a Poisson process:
timeE1 E2 E3 E4
Death
Hazard
function
Hazard function = Poisson rate:
Given state and hazard, we could compute the probability of the
observed event count:
[ ]
t
tttE
th
t Δ
Δ+∈
=
→Δ
state),[Prob
limstate);(
|
0
[ ] ,
!
ineventsProb 21
K
He
tttK
KH−
=≤≤ dttthH
t
t
))(state;(
2
1
∫=

37
Cox Proportional Hazards
• Assume that exactly 1 patient gets event E at time t
• The probability that it is Patient #i is the hazard ratio:
• Cox assumption:
• Time confounder cancels out!
t
[ ] ∑ =
=
n
j ji sthsthEi 1
);();(gets#Prob
s1
si = statei
s2
sn
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
)(exp)((state))(state);( T
00 sththth λ⋅=Λ⋅=

38
Cox “Partial” Likelihood
• Cox “partial” likelihood for the dataset is a product over all E:
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
[ ] ∏
∑
∏
∑ ==
=== EtEt n
j j
t
n
j j
t
ts
ts
tsth
tsth
EL ::
1
T
)(who
T
1
)(who
Cox
)(
)(
)(
)(
)(exp
)(exp
)(;
)(;
allProb)(
λ
λ
λ

Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Cox regression in DML
Fitting parameters via negative partial log-likelihood
Method: trust region Newton with conjugate gradient
Inverting the Hessian using block Cholesky for
computing std error of betas
Similar features as coxph() in R, e.g., stratification,
frequency weights, offsets, goodness of fit
testing, recurrent event analysis
Baseline hazard covariates
coefficients
29
39

Confidence Intervals
• Definition of Confidence Interval; p-value
• Likelihood ratio test
• How to use it for confidence interval
• Degrees of freedom
43

Regression using Apache SystemML by Alexandre V Evfimievski

More Related Content

What's hot

Viewers also liked

Similar to Regression using Apache SystemML by Alexandre V Evfimievski

More from Arvind Surve

Recently uploaded

Regression using Apache SystemML by Alexandre V Evfimievski