Regression	in	SystemML
Alexandre	Evfimievski
1
Linear	Regression
• INPUT:		Records	(x1,	y1),	(x2,	y2),	…,	(xn,	yn)
– Each	xi is	m-dimensional:	xi1,	xi2,	…,	xim
– Each	yi is	1-dimensional
• Want	to	approximate	yi as	a	linear	combination	of	xi-entries
– yi ≈		β1xi1 +	β2xi2 +	…	+	βmxim
– Case	m	=	1:			yi ≈	β1xi1 (	Note:		x	=	0		maps	to		y	=	0	)
• Intercept:		a	“free	parameter”	for	default	value	of	yi
– yi ≈		β1xi1 +	β2xi2 +	…	+	βmxim +	βm+1
– Case	m	=	1:			yi ≈	β1xi1 +	β2
• Matrix	notation:		Y	≈	Xβ,		or		Y	≈	(X |1) β if	with	intercept
– X		is		n	× m,		Y		is		n	× 1,		β is		m	× 1		or		(m+1)	× 1
2
Linear	Regression:	Least	Squares
• How	to	aggregate	errors:		yi – (β1xi1 +	β2xi2 +	…	+	βmxim)		?
– What’s	worse:		many	small	errors,		or	a	few	big	errors?
• Sum	of	squares:		∑i≤n (yi – (β1xi1 +	β2xi2 +	…	+	βmxim))2 →		min
– A	few	big	errors	are	much	worse!		We	square	them!
• Matrix	notation:		(Y	– Xβ)T (Y	– Xβ)		→		min
• Good	news:		easy	to	solve	and	find	the	β’s
• Bad	news:		too	sensitive	to	outliers!
3
Linear	Regression:	Direct	Solve
• (Y	– Xβ)T (Y	– Xβ)		→		min
• YT	Y		– YT	(Xβ)		– (Xβ)T	Y		+		(Xβ)T	(Xβ)		→		min
• ½	βT	(XTX) β – βT	(XTY)		→		min
• Take	the	gradient	and	set	it	to	0:			(XTX) β – (XTY)		=		0
• Linear	equation:		(XTX) β =		XTY;			Solution:		β =		(XTX)–1	(XTY)
A = t(X) %*% X;
b = t(X) %*% y;
. . .
. . .
beta_unscaled = solve (A, b);
4
Computation		of		XTX
• Input	(n	× m)-matrix		X		is	often	huge	and	sparse
– Rows		X[i,	]		make	up		n		records,		often		n	>>	106
– Columns		X[,	j]		are	the	features
• Matrix		XTX		is	(m	× m)	and	dense
– Cells:		(XTX)	[j1,	j2]		=		∑ i≤n X[i,	j1]	*	X[i,	j2]
– Part	of	covariance	between	features		#	j1 and		#	j2 across	all	records
– m		could	be	small	or	large
• If	m	≤	1000,		XTX		is	small	and	“direct	solve”	is	efficient…
– …	as	long	as		XTX		is	computed	the	right	way!
– …	and	as	long	as		XTX		is	invertible	(no	linearly	dependent	features)
5
Computation		of		XTX
• Naïve	computation:
a) Read	X	into	memory
b) Copy	it	and	rearrange	cells	into	the	transpose
c) Multiply	two	huge	matrices,	XT and	X
• There	is	a	better	way:		XTX		=		∑i≤n X[i,	]T X[i,	]				(outer	product)
– For	all		i =	1,	…,	n		in	parallel:
a) Read	one	row		X[i,	]
b) Compute	(m	× m)-matrix:		Mi	[j1,	j2]		=		X[i,	j1]	*	X[i,	j2]
c) Aggregate:		M	=	M	+	Mi
• Extends	to		(XTX) v		and		XT	diag(w) X,		used	in	other	scripts:
– (XTX) v		=		∑i≤n (∑ j≤m X[i,	j]v[j]) *	X[i,	]T
– XT	diag(w)X		=	∑ i≤n wi *	X[i,	]T X[i,	]
6
Conjugate	Gradient
• What	if		XTX		is	too	large,	m	>>	1000?
– Dense		XTX		may	take	far	more	memory	than	sparse		X
• Full		XTX		not	needed	to	solve		(XTX) β =		XTY
– Use	iterative	method
– Only	evaluate		(XTX)v		for	certain	vectors		v
• Ex.:	Gradient	Descent	for		f (β)		=		½	βT	(XTX) β – βT	(XTY)	
– Start	with	any		β =	β0
– Take	the	gradient:		r		=		df(β)		=		(XTX) β – (XTY)								(also,	residual)
– Find	number		a to	minimize		f(β + a ·r):			a =		– (rT	r)	/	(rT	XTX r)
– Update:		βnew		←		β + a·r
• But	gradient	is	too	local
– And	“forgetful”
*a · r
7
Conjugate	Gradient
• PROBLEM:		Gradient	takes	a	very	similar	direction	many	times
• Enforce	orthogonality	to	prior	directions?
– Take	the	gradient:		r		=		(XTX) β – (XTY)
– Subtract	prior	directions:		p(k) =		r		– λ1p(1) – …	– λk-1p(k-1)
• Pick		λi to	ensure		(p(k) ·	p(i))		=		0			???
– Find	number		a(k) to	minimize		f(β + a(k)	·p(k)),		etc	…
• STILL,	PROBLEMS:
– Value		a(k) does	NOT	minimize		f(a(1)	·p(1)		+	…	+ a(k)	·p(k)		+	…	+ a(m)	·p(m))
– Keep	all	prior	directions		p(1),	p(2),	…	,	p(k)	?		That’s	a	lot!
• SOLUTION:		Enforce	Conjugacy
– Conjugate	vectors:			uT	(XTX)	v		=		0,		instead	of		uT	v		=		0
• Matrix		XTX		acts	as	the	“metric”	in	distorted	space
– This	does	minimize		f(a(1)	·p(1)		+	…	+ a(k)	·p(k)		+	…	+ a(m)	·p(m))
• And,		only	need		p(k-1) and		r(k) to	compute		p(k)
8
Conjugate	Gradient
• Algorithm,	step	by	step
i = 0; beta = matrix (0, ...); Initially:		β =	0
r = - t(X) %*% y; Residual	&	gradient		r		=		(XTX) β – (XTY)
p = - r; Direction	for		β:		negative	gradient
norm_r2 = sum (r ^ 2); Norm	of	residual	error		=		rT	r
norm_r2_target = norm_r2 * tolerance ^ 2; Desired	norm	of	residual	error
while (i < mi & norm_r2 > norm_r2_target)
{ WE	HAVE:		p		is	the	next	direction	for		β
q = t(X) %*% (X %*% p) + lambda * p; q		=		(XTX)	p
a = norm_r2 / sum (p * q); a =		rT	r		/		p	(XTX)	p			minimizes			f(β + a· p)
beta = beta + a * p; Update:			βnew ←		β +		a ·	p
r = r + a * q; rnew ←		(XTX) (β + a· p)		– (XTY)
old_norm_r2 = norm_r2; =			r		+		a ·	(XTX)	p
norm_r2 = sum (r ^ 2); Update	the	norm	of	residual	error		=		rT	r
p = -r + (norm_r2 / old_norm_r2) * p; Update	direction:		(1)	take	negative	gradient;
(2)	enforce	conjugacy	with	previous	direction
i = i + 1; Conjugacy	to	all	older	directions	is	automatic!
}
9
Degeneracy	and	Regularization
• PROBLEM:		What	if		X		has	linearly	dependent	columns?
– Cause:		recoding	categorical	features,	adding	composite	features
– Then		XTX		is	not	a	“metric”:		exists		ǁpǁ	>	0		such	that		pT	(XTX)	p		=		0
– In	CG	step		a =		rT	r		/		p	(XTX)	p	:		Division	By	Zero!
• In	fact,	then		Least	Squares		has		∞		solutions
– Most	of	them	have		HUGE		β-values
• Regularization:		Penalize		β with	larger	values
– L2-Regularization:			(Y	– Xβ)T (Y	– Xβ)		+		λ·βT	β →		min
– Replace		XTX		with		XTX		+		λI
– Pick		λ <<		diag(XTX),		refine	by	cross-validation
– Do	NOT	regularize	intercept
• CG: q = t(X) %*% (X %*% p) + lambda * p;
10
Shifting	and	Scaling	X
• PROBLEM:		Features	have	vastly	different	range:
– Examples:		[0,	1];		[2010,	2015];		[$0.01,		$1	Billion]
• Each		βi in		Y	≈	Xβ has	different	size	&	accuracy?
– Regularization			λ·βT	β also	range-dependent?
• SOLUTION:		Scale	&	shift	features	to	mean	=	0,	variance	=	1
– Needs	intercept:		Y	≈	(X| 1)β
– Equivalently:		(Xnew |1)		=		(X |1)		%*% SST				“Shift-Scale	Transform”
• BUT:		Sparse		X		becomes		Dense		Xnew …
• SOLUTION:			(Xnew |1)	 %*% M		=		(X |1)	 %*% (SST	 %*% M)
– Extends	to		XTX		and	other	X-products
– Further	optimization:		SST		has	special	shape
11
Shifting	and	Scaling	X
– Linear	Regression	Direct	Solve	
code	snippet	example:
A = t(X) %*% X;
b = t(X) %*% y;
if (intercept_status == 2) {
A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]);
A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ];
b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ];
}
A = A + diag (lambda);
beta_unscaled = solve (A, b);
if (intercept_status == 2) {
beta = scale_X * beta_unscaled;
beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled;
} else {
beta = beta_unscaled;
}
12
Regression	in	Statistics
• Model:		Y	=	Xβ* +	ε where		ε is	a	random	vector
– There	exists	a	“true”	β*
– Each		εi is	Gaussian	with	mean		μi =	Xi	β* and	variance		σ2
• Likelihood	maximization	to	estimate		β*
– Likelihood:		ℓ(Y	|	X,	β,	σ)		=		∏i ≤	n C(σ)·exp(– (yi – Xi	β)2 /	2σ2)
– Log	ℓ(Y	|	X,	β,	σ)		=		n·c(σ)		– ∑i ≤	n (yi – Xi	β)2 /	2σ2
– Maximum	likelihood	over	β =		Least	Squares
• Why	do	we	need	statistical	view?
– Confidence	intervals	for	parameters
– Goodness	of	fit	tests
– Generalizations:	replace	Gaussian	with	another	distribution
13
Maximum	Likelihood	Estimator
• In	each		(xi	,	yi)		let		yi have	distribution		ℓ(yi |	xi	,	β,	φ)
– Records	are	mutually	independent	for		i =	1,	…,	n
• Estimator	for		β is	a	function		f(X,	Y)
– Y	is	random		→			f(X,	Y)	random
– Unbiased	estimator:		for	all	β,	mean		E	f(X,	Y)	=	β
• Maximum	likelihood	estimator
– MLE (X,	Y)		=		argmaxβ ∏i ≤	n ℓ(yi |	xi	,	β,	φ)
– Asymptotically	unbiased:		E	MLE (X,	Y)	→	β as		n	→	∞
• Cramér-Rao	Bound
– For	unbiased	estimators,		Var f(X,	Y)		≥		FI(X,	β,	φ) –1
– Fisher	information:		FI(X,	β,	φ)		=		– EY Hessianβ log	ℓ(Y| X,	β,	φ)
– For	MLE:		Var (MLE (X,	Y)) →		FI(X,	β,	φ)–1 as		n	→	∞
14
Variance	of	M.L.E.
• Cramér-Rao	Bound	is	a	simple	way	to	estimate	variance	of	
predicted	parameters	(for	large	n):
1. Maximize		log	ℓ(Y |X,	β,	φ)		to	estimate		β
2. Compute	the	Hessian	(2nd derivatives)	of		log	ℓ(Y |X,	β,	φ)
3. Compute	“expected”	Hessian:		FI		=		– EY Hessian
4. Invert		FI		as	a	matrix:		get		FI–1
5. Use		FI–1 as	approx.	covariance	matrix	for	the	estimated		β
• For	linear	regression:
– Log	ℓ(Y	|	X,	β,	σ)		=		n·c(σ)		– ∑i ≤	n (yi – Xi	β)2 /	2σ2
– Hessian		=		–(1/σ2)·XTX;				FI		=		(1/σ2)·XTX
– Cov β ≈		σ2 ·(XTX) –1	;				Var βj ≈		σ2 ·diag((XTX) –1) j
15
Variance	of		Y		given		X
• MLE	for	variance	of	Y		=		1/n	·	∑ i ≤	n (yi – y avg)2
– To	make	it	unbiased,	replace		1/n		with		1/(n	– 1)
• Variance	of		ε in		Y	=	Xβ* +	ε is	residual	variance
– Estimator	for	Var(ε)		=		1/(n	– m	– 1)	·	∑i ≤	n (yi – Xi	β)2
• Good	regression	must	have:		Var(ε)		<<		Var(Y)
– “Explained”	variance		=		Var(Y)		– Var(ε)
• R-squared:		estimate		1	– Var(ε)	/	Var(Y)		to	test	fitness:
– R2
plain =		1		– (∑ i ≤	n (yi – Xi	β)2)	/	(∑ i ≤	n (yi – yavg)2)
– R2
adj. =		1		– (∑ i ≤	n (yi – Xi	β)2)	/	(∑ i ≤	n (yi – yavg)2) ·	(n	– 1)	/	(n	– m	– 1)	
• Pearson	residual:		ri =		(yi – Xi	β)	/	Var(ε)1/2
– Should	be	approximately	Gaussian	with	mean	0	and	variance	1
– Can	use	in	another	fitness	test		(more	on	tests	later)
16
LinReg	Scripts:	Inputs
# INPUT PARAMETERS:
# --------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# --------------------------------------------------------------------------------------------
# X String --- Location (on HDFS) to read the matrix X of feature vectors
# Y String --- Location (on HDFS) to read the 1-column matrix Y of response values
# B String --- Location to store estimated regression parameters (the betas)
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# icpt Int 0 Intercept presence, shifting and rescaling the columns of X:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero
# for highly dependend/sparse/numerous features
# tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if
# L2 norm of the beta-residual is less than tolerance * its initial norm
# maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum
# fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv"
# --------------------------------------------------------------------------------------------
# OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value:
# OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B:
# icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B
# icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# Col.2: betas for shifted/rescaled X and intercept
17
LinReg	Scripts:	Outputs
# In addition, some regression statistics are provided in CSV format, one comma-separated
# name-value pair per each line, as follows:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# AVG_TOT_Y Average of the response value Y
# STDEV_TOT_Y Standard Deviation of the response value Y
# AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias
# STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X)
# DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr.
# PLAIN_R2 Plain R^2 of residual with bias included vs. total average
# ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average
# PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average
# ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average
# PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant
# ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant
# -------------------------------------------------------------------------------------
# * The last two statistics are only printed if there is no intercept (icpt=0)
#
# The Log file, when requested, contains the following per-iteration variables in CSV
# format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for
# initial values:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y
# where A = t(X) %*% X + diag (lambda), or a similar quantity
# CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial
# -------------------------------------------------------------------------------------
18
Caveats
• Overfitting:		β reflect	individual	records	in		X,	not	distribution
– Typically,	too	few	records	(small	n)	or	too	many	features	(large	m)
– To	detect,	use	cross-validation
– To	mitigate,	select	fewer	features;		regularization	may	help	too
• Outliers:		Some	records	in	X	are	highly	abnormal
– They	badly	violate	distribution,	or	have	very	large	cell-values
– Check	MIN	and	MAX	of		Y,		X-columns,		Xi	β,		ri
2 =	(yi		– Xi	β)2	/ Var(ε)
– To	mitigate,	remove	outliers,	or	change	distribution	or	link	function
• Interpolation	vs.	extrapolation
– A	model	trained	on	one	kind	of	data	may	not	carry	over	to	another	
kind	of	data;		the	past	may	not	predict	the	future
– Great	research	topic!
19
Generalized	Linear	Models
• Linear	Regression:		Y = Xβ* +	ε
– Each		yi is	Normal(μi ,	σ2)		where	mean		μi =	Xi	β*
– Variance(yi)		=		σ2 =		constant
• Logistic	Regression:
– Each		yi is	Bernoulli(μi)		where	mean		μi =	1	/	(1	+	exp	(– Xi	β*))
– Prob [yi =	1]		=		μi ,		Prob [yi =	0]		=		1	– μi ,		mean		=		probability	of	1
– Variance(yi)		=		μi (1	– μi)
• Poisson	Regression:
– Each		yi is	Poisson(μi)		where	mean		μi =	exp(Xi	β*)
– Prob [yi =	k]		=		(μi)k	exp(– μi)/ k!			for		k	=	0,	1,	2,	…
– Variance(yi)		=		μi
• Only	in	Linear	Regression	we		add error		εi to	mean		μi
20
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi – b(θi))/a + c(yi ,	a)}
– Canonical	parameter θi represents	the	mean:			μi =		bʹ(θi)
– Link	function connects		μi and		Xi	β*	:			Xi	β* =		g(μi),			μi =		g –1	(Xi	β*)
– Variance(yi)		=		a ·bʺ(θi)		
• Example:		Linear	Regression	as	GLM
– C(σ)·exp(– (yi – Xi	β)2 /	2σ2)		=		exp{(yi ·θi – b(θi))/a + c(yi ,	a)}
– θi =		μi =		Xi	β;				b(θi)		=		(Xi	β)2	/ 2;				a		=		σ2 =		variance
• Link	function		=		identity;				c(yi ,	a)		=		– yi
2	/2σ2		+		log	C(σ)
• Example:		Logistic	Regression	as	GLM
– (μi )y[i] (1	– μi)1	– y[i] =		exp{yi ·	log(μi)		– yi ·	log(1	– μi)		+		log(1	– μi)}
=		exp{(yi ·θi – b(θi))/ a + c(yi ,	a)}
– θi =		log(μi / (1	– μi))		=		Xi	β;				b(θi)		=		– log(1	– μi)		=		log(1	+	exp(θi))
• Link	function		=		log (μ / (1	– μ))	;				Variance		=		μ(1	– μ)	;				a	=	1
21
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Canonical	parameter θi represents	the	mean:			μi =		bʹ(θi)
– Link	function connects		μi and		Xi	β*	:			Xi	β* =		g(μi),			μi =		g –1	(Xi	β*)
– Variance(yi)		=		a ·bʺ(θi)
• Why	θi	?		What	is	b(θi)?
– θi makes	formulas	simpler,	stands	for		μi (no	big	deal)
– b(θi)		defines	what	distribution	it	is:		linear,		logistic,		Poisson,		etc.
– b(θi)		connects	mean	with	variance:			Var(yi)		=		a·bʺ(θi),			μi =		bʹ(θi)
• What	is	link	function?
– You	choose	it to	link		μi with	your	features		β1xi1 +	β2xi2 +	…	+	βmxim
– Additive	effects:		μi =		Xi	β;				Multiplicative	effects:		μi =		exp(Xi	β)
Bayes	law	effects:		μi =	1	/	(1	+	exp	(– Xi	β));				Inverse:		μi =	1	/	(Xi	β)
– Xi	β has	range	(– ∞,	+∞),		but		μi may	range	in		[0,	1],		[0,	+∞)		etc.
22
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
• Mean-to-variance	for	common	distributions:
– Var (yi)		=		a ·(μi)0 =		σ2	:				Linear	/	Gaussian
– Var (yi)		=		a ·μi	(1	– μi):				Logistic	/	Binomial
– Var (yi)		=		a ·(μi)1	:				Poisson
– Var (yi)		=		a ·(μi)2	:				Gamma
– Var (yi)		=		a ·(μi)3	:				Inverse	Gaussian
• We	support	two	types:		Power	and	Binomial
– Var (yi)		=		a ·(μi)α :				Power,	for	any		α
– Var (yi)		=		a ·μi	(1	– μi):				Binomial
23
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
Supported	link	functions
• Power:		Xi	β =		(μi)s where		s	=	0		stands	for		Xi	β =		log	(μi)
– Examples:		identity,		inverse,		log,		square	root
• Link	functions	used	in	binomial	/	logistic	regression:
– Logit,		Probit,		Cloglog,		Cauchit
– Link		Xi	β-range		(– ∞,	+∞)		with		μi-range		(0,	1)
– Differ	in	tail	behavior
• Canonical	link	function:
– Makes		Xi	β =		the	canonical	parameter θi	,		i.e.	sets		μi =		bʹ(Xi	β)
– Power	link		Xi	β =		(μi)1	– α if		Var	=	a·(μi)α ;		Logit	link	for	binomial
24
GLM	Script	Inputs
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
# X String --- Location to read the matrix X of feature vectors
# Y String --- Location to read response matrix Y with either 1 or 2 columns:
# if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)
# B String --- Location to store estimated regression parameters (the betas)
# fmt String "text" The betas matrix output format, such as "text" or "csv"
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
# vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1):
# 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
# link Int 0 Link function code: 0 = canonical (depends on distribution),
# 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit
# lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1):
# -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
# yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0
# icpt Int 0 Intercept presence, X columns shifting and rescaling:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.0 Regularization parameter (lambda) for L2 regularization
# tol Double 0.000001 Tolerance (epsilon)
# disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data
# moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations
# mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum
# ---------------------------------------------------------------------------------------------
# OUTPUT: Matrix beta, whose size depends on icpt:
# icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2
25
GLM	Script	Outputs
# In addition, some GLM statistics are provided in CSV format, one comma-separated name-value
# pair per each line, as follows:
# -------------------------------------------------------------------------------------------
# TERMINATION_CODE A positive integer indicating success/failure as follows:
# 1 = Converged successfully; 2 = Maximum number of iterations reached;
# 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported
# BETA_MIN Smallest beta value (regression coefficient), excluding the intercept
# BETA_MIN_INDEX Column index for the smallest beta value
# BETA_MAX Largest beta value (regression coefficient), excluding the intercept
# BETA_MAX_INDEX Column index for the largest beta value
# INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0)
# DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter
# or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0
# DISPERSION_EST Dispersion estimated from the dataset
# DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0
# DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value
# -------------------------------------------------------------------------------------------
#
# The Log file, when requested, contains the following per-iteration variables in CSV format,
# each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values:
# -------------------------------------------------------------------------------------------
# NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration
# IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise
# POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point
# OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood)
# OBJ_DROP_REAL Reduction in the objective during this iteration, actual value
# OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation
# OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region
# GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted)
# LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows
# LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows
# IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored
# TRUST_DELTA Updated trust region size, the "delta"
# -------------------------------------------------------------------------------------------
26
GLM	Likelihood	Maximization
• 1	record:		ℓ (yi	| θi	,	a)		=		exp{(yi ·θi		– b(θi))/ a + c(yi	,	a)}
• Log	ℓ (Y |Θ,	a)		=		1/a	·	∑ i	≤	n (yi · θi		– b(θi)) +		const(Θ)
• f(β;	X,	Y)		=		– ∑i	≤	n (yi · θi		– b(θi)) +		λ/2 · βT	β →		min
– Here		θi is	a	function	of	β:			θi =		bʹ–1	(g –1	(Xi	β))
– Add	regularization	with		λ/2		to	agree	with	least	squares
– If		X		has	intercept,	do	NOT	regularize	its	β-value
• Non-quadratic;		how	to	optimize?
– Gradient	descent:		fastest	when	far	from	optimum
– Newton	method:		fastest	when	close	to	optimum
• Trust	Region	Conjugate	Gradient
– Strikes	a	good	balance	between	the	above	two
27
GLM	Likelihood	Maximization
• f(β;	X,	Y)		=		– ∑i	≤	n (yi · θi		– b(θi)) +		λ/2 · βT	β →		min
• Outer	iteration:		From		β to		βnew =		β +	z
– ∆f	(z;	β)		:=		f(β +	z;	X,	Y)		– f(β;	X,	Y)
• Use	“Fisher	Scoring”	to	approximate	Hessian	and		∆f	(z;	β)
– ∆f	(z;	β)		≈		½·zT	A z		+		GT	z,				where:
– A		=		XT	diag(w)X		+		λI and				G		=		– XT	u		+		λ·β
– Vectors		u,	w		depend	on		β via	mean-to-variance	and	link	functions
• Trust	Region:		Area		ǁzǁ2 ≤	δ where	we	trust	the	
approximation		∆f	(z;	β)		≈		½ ·zT	A z		+		GT	z
– ǁzǁ2 ≤	δ too	small		→		Gradient	Descent	step	(1	inner	iteration)
– ǁzǁ2 ≤	δ mid-size		→		Cut-off	Conjugate	Gradient	step	(2	or	more)
– ǁzǁ2 ≤	δ too	wide		→		Full	Conjugate	Gradient	step
FI		=		XT	diag(w) X		is	
“expected”	Hessian		
28
Trust	Region	Conj.	Gradient
• Code	snippet	for	
Logistic	Regression
g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5);
delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2)));
exit_g2 = sum (g ^ 2) * tolerance ^ 2;
while (sum (g ^ 2) > exit_g2 & i < max_i)
{
i = i + 1;
r = g;
r2 = sum (r ^ 2); exit_r2 = 0.01 * r2;
d = - r;
z = zeros_D; j = 0; trust_bound_reached = FALSE;
while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j)
{
j = j + 1;
Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d;
c = r2 / sum (d * Hd);
[c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2);
z = z + c * d;
r = r + c * Hd;
r2_new = sum (r ^ 2);
d = - r + (r2_new / r2) * d;
r2 = r2_new;
}
p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z))));
f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val;
delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g)));
if (f_chg < 0)
{
beta = beta + z;
f_val = f_val + f_chg;
w = p * (1 - p);
g = - t(X) %*% ((1 - p) * y) + lambda * beta;
} }
ensure_quadratic =
function (double x, a, b, c)
return (double x_new, boolean test)
{
test = (a * x^2 + b * x + c > 0);
if (test) {
rad = sqrt (b ^ 2 - 4 * a * c);
if (b >= 0) {
x_new = - (2 * c) / (b + rad);
} else {
x_new = - (b - rad) / (2 * a);
}
} else {
x_new = x;
} }
29
Trust	Region	Conj.	Gradient
• Trust	region	update	in	
Logistic	Regression	snippet
update_trust_region =
function (double delta,
double z_distance,
double f_chg_exact,
double f_chg_linear_approx,
double f_chg_quadratic_approx)
return (double delta)
{
sigma1 = 0.25;
sigma2 = 0.5;
sigma3 = 4.0;
if (f_chg_exact <= f_chg_linear_approx) {
alpha = sigma3;
} else {
alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx));
}
rho = f_chg_exact / f_chg_quadratic_approx;
if (rho < 0.0001) {
delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta);
} else { if (rho < 0.25) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta));
} else { if (rho < 0.75) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta));
} else {
delta = max (delta, min (alpha * z_distance, sigma3 * delta));
}}}
}
30
GLM:	Other	Statistics
• REMINDER:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Variance(yi)		=		a ·bʺ(θi)		=		a·V(μi)
• Variance	of		Y		given		X
– Estimating	the	β gives		V(μi)	=	V (g–1	(Xi	β))
– Constant		“a”		is	called	dispersion,	analogue	of		σ2
– Estimator:		a		≈		1/(n	– m)·∑ i	≤	n	(yi – μi)2	/	V(μi)
• Variance	of	parameters	β
– We	use	MLE,	hence	Cramér-Rao	formula	applies	(for	large	n)
– Fisher	Information:			FI		=		(1/a)·	XT	diag(w)X,			wi		= (V(μi) ·gʹ(μi)2)–1
– Estimator:			Cov	β ≈		a·(XT	diag(w)X)–1,				Var	βj =		(Cov	β)jj
31
GLM:		Deviance
• Let		X		have		m		features,	of	which		k		may	have	no	effect	on		Y
– Will	“no	effect”	result	in		βj ≈	0	?				(Unlikely.)
– Estimate		βj and		Var βj then	test		βj /	(Var βj)1/2 against		N(0,	1)?
• Student’s	t-test	is	better
• Likelihood	Ratio	Test:
• Null	Hypothesis:		Y		given		X		follows	GLM	with		β1 =	…	=	βk =	0
– If NH is	true,		D is	asympt.	distributed	as		χ2 with		k		deg.	of	freedom
– If NH is	false,		D → +¥as		n	→	+¥
• P-value	%		=		Prob[ χ2
k > D]	· 100%
( )
( )
0
...,,,0...,,0;,|max
...,,,...,,;,|max
log2
1GLM
11GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
⋅=
+
+
mk
mkk
aXYL
aXYL
D
ββ
ββββ
β
β
32
GLM:		Deviance
• To	test	many	nested	models	(feature	subsets)	we	need	their	
maximum	likelihoods	to	compute		D
– PROBLEM:		Term		“c(yi	,	a)”		in	GLM’s		exp{(yi ·θi		– b(θi))/ a + c(yi	,	a)}
• Instead,	compute	deviance:
• “Saturated	model”	has	no	X,	no	β,	but	picks	the	best		θi for	each	
individual		yi (not	realistic	at	all,	just	convention)
– Term		“c(yi	,	a)”		is	the	same	in	both	models!
– But		“a”		has	to	be	fixed,	e.g.	to	1
• Deviance	itself	is	used	for	goodness	of	fit	tests,	too
( )
( )
0
...,,,...,,;,|max
modelsaturated:;|max
log2
11GLM
GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡ Θ
⋅=
+
Θ
mkkaXYL
aYL
D
ββββ
β
33
Survival Analysis
Given
Survival data from individuals as (time, event)
Categorical/continuous features for each individual
Estimate
Probability of survival to a feature time
Rate of hazard at a given time
Ex.
† death from specific cancer
? lost to follow-up
†
†
?
†
?
1 2 3 4 5 6 7 8
9
I I I I I I I I I
Patients
2
1
3
4
5
Time
27
34
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Baseline hazard covariates
coefficients
29
35
36
Event	Hazard	Rate
• Symptom	events	E follow	a	Poisson	process:
timeE1 E2 E3 E4
Death
Hazard
function
Hazard function = Poisson rate:
Given state and hazard, we could compute the probability of the
observed event count:
[ ]
t
tttE
th
t Δ
Δ+∈
=
→Δ
state),[Prob
limstate);(
|
0
[ ] ,
!
ineventsProb 21
K
He
tttK
KH−
=≤≤ dttthH
t
t
))(state;(
2
1
∫=
37
Cox	Proportional	Hazards
• Assume	that	exactly	1	patient	gets	event	E at	time	t
• The	probability	that	it	is		Patient	#i is	the	hazard	ratio:
• Cox	assumption:
• Time	confounder	cancels	out!
t
[ ] ∑ =
=
n
j ji sthsthEi 1
);();(gets#Prob
s1
si = statei
s2
sn
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
)(exp)((state))(state);( T
00 sththth λ⋅=Λ⋅=
38
Cox	“Partial”	Likelihood
• Cox	“partial”	likelihood	for	the	dataset	is	a	product	over	all	E:
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
[ ] ∏
∑
∏
∑ ==
=== EtEt n
j j
t
n
j j
t
ts
ts
tsth
tsth
EL ::
1
T
)(who
T
1
)(who
Cox
)(
)(
)(
)(
)(exp
)(exp
)(;
)(;
allProb)(
λ
λ
λ
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Cox regression in DML
Fitting parameters via negative partial log-likelihood
Method: trust region Newton with conjugate gradient
Inverting the Hessian using block Cholesky for
computing std error of betas
Similar features as coxph() in R, e.g., stratification,
frequency weights, offsets, goodness of fit
testing, recurrent event analysis
Baseline hazard covariates
coefficients
29
39
BACK-UP
40
Kaplan-Meier Estimator
28
41
Kaplan-Meier Estimator
28
42
Confidence	Intervals
• Definition	of	Confidence	Interval;	p-value
• Likelihood	ratio	test
• How	to	use	it	for	confidence	interval
• Degrees	of	freedom
43

Regression using Apache SystemML by Alexandre V Evfimievski

  • 1.
  • 2.
    Linear Regression • INPUT: Records (x1, y1), (x2, y2), …, (xn, yn) – Each xiis m-dimensional: xi1, xi2, …, xim – Each yi is 1-dimensional • Want to approximate yi as a linear combination of xi-entries – yi ≈ β1xi1 + β2xi2 + … + βmxim – Case m = 1: yi ≈ β1xi1 ( Note: x = 0 maps to y = 0 ) • Intercept: a “free parameter” for default value of yi – yi ≈ β1xi1 + β2xi2 + … + βmxim + βm+1 – Case m = 1: yi ≈ β1xi1 + β2 • Matrix notation: Y ≈ Xβ, or Y ≈ (X |1) β if with intercept – X is n × m, Y is n × 1, β is m × 1 or (m+1) × 1 2
  • 3.
    Linear Regression: Least Squares • How to aggregate errors: yi –(β1xi1 + β2xi2 + … + βmxim) ? – What’s worse: many small errors, or a few big errors? • Sum of squares: ∑i≤n (yi – (β1xi1 + β2xi2 + … + βmxim))2 → min – A few big errors are much worse! We square them! • Matrix notation: (Y – Xβ)T (Y – Xβ) → min • Good news: easy to solve and find the β’s • Bad news: too sensitive to outliers! 3
  • 4.
    Linear Regression: Direct Solve • (Y – Xβ)T(Y – Xβ) → min • YT Y – YT (Xβ) – (Xβ)T Y + (Xβ)T (Xβ) → min • ½ βT (XTX) β – βT (XTY) → min • Take the gradient and set it to 0: (XTX) β – (XTY) = 0 • Linear equation: (XTX) β = XTY; Solution: β = (XTX)–1 (XTY) A = t(X) %*% X; b = t(X) %*% y; . . . . . . beta_unscaled = solve (A, b); 4
  • 5.
    Computation of XTX • Input (n × m)-matrix X is often huge and sparse –Rows X[i, ] make up n records, often n >> 106 – Columns X[, j] are the features • Matrix XTX is (m × m) and dense – Cells: (XTX) [j1, j2] = ∑ i≤n X[i, j1] * X[i, j2] – Part of covariance between features # j1 and # j2 across all records – m could be small or large • If m ≤ 1000, XTX is small and “direct solve” is efficient… – … as long as XTX is computed the right way! – … and as long as XTX is invertible (no linearly dependent features) 5
  • 6.
    Computation of XTX • Naïve computation: a) Read X into memory b)Copy it and rearrange cells into the transpose c) Multiply two huge matrices, XT and X • There is a better way: XTX = ∑i≤n X[i, ]T X[i, ] (outer product) – For all i = 1, …, n in parallel: a) Read one row X[i, ] b) Compute (m × m)-matrix: Mi [j1, j2] = X[i, j1] * X[i, j2] c) Aggregate: M = M + Mi • Extends to (XTX) v and XT diag(w) X, used in other scripts: – (XTX) v = ∑i≤n (∑ j≤m X[i, j]v[j]) * X[i, ]T – XT diag(w)X = ∑ i≤n wi * X[i, ]T X[i, ] 6
  • 7.
    Conjugate Gradient • What if XTX is too large, m >> 1000? – Dense XTX may take far more memory than sparse X •Full XTX not needed to solve (XTX) β = XTY – Use iterative method – Only evaluate (XTX)v for certain vectors v • Ex.: Gradient Descent for f (β) = ½ βT (XTX) β – βT (XTY) – Start with any β = β0 – Take the gradient: r = df(β) = (XTX) β – (XTY) (also, residual) – Find number a to minimize f(β + a ·r): a = – (rT r) / (rT XTX r) – Update: βnew ← β + a·r • But gradient is too local – And “forgetful” *a · r 7
  • 8.
    Conjugate Gradient • PROBLEM: Gradient takes a very similar direction many times • Enforce orthogonality to prior directions? –Take the gradient: r = (XTX) β – (XTY) – Subtract prior directions: p(k) = r – λ1p(1) – … – λk-1p(k-1) • Pick λi to ensure (p(k) · p(i)) = 0 ??? – Find number a(k) to minimize f(β + a(k) ·p(k)), etc … • STILL, PROBLEMS: – Value a(k) does NOT minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) – Keep all prior directions p(1), p(2), … , p(k) ? That’s a lot! • SOLUTION: Enforce Conjugacy – Conjugate vectors: uT (XTX) v = 0, instead of uT v = 0 • Matrix XTX acts as the “metric” in distorted space – This does minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) • And, only need p(k-1) and r(k) to compute p(k) 8
  • 9.
    Conjugate Gradient • Algorithm, step by step i =0; beta = matrix (0, ...); Initially: β = 0 r = - t(X) %*% y; Residual & gradient r = (XTX) β – (XTY) p = - r; Direction for β: negative gradient norm_r2 = sum (r ^ 2); Norm of residual error = rT r norm_r2_target = norm_r2 * tolerance ^ 2; Desired norm of residual error while (i < mi & norm_r2 > norm_r2_target) { WE HAVE: p is the next direction for β q = t(X) %*% (X %*% p) + lambda * p; q = (XTX) p a = norm_r2 / sum (p * q); a = rT r / p (XTX) p minimizes f(β + a· p) beta = beta + a * p; Update: βnew ← β + a · p r = r + a * q; rnew ← (XTX) (β + a· p) – (XTY) old_norm_r2 = norm_r2; = r + a · (XTX) p norm_r2 = sum (r ^ 2); Update the norm of residual error = rT r p = -r + (norm_r2 / old_norm_r2) * p; Update direction: (1) take negative gradient; (2) enforce conjugacy with previous direction i = i + 1; Conjugacy to all older directions is automatic! } 9
  • 10.
    Degeneracy and Regularization • PROBLEM: What if X has linearly dependent columns? – Cause: recoding categorical features, adding composite features –Then XTX is not a “metric”: exists ǁpǁ > 0 such that pT (XTX) p = 0 – In CG step a = rT r / p (XTX) p : Division By Zero! • In fact, then Least Squares has ∞ solutions – Most of them have HUGE β-values • Regularization: Penalize β with larger values – L2-Regularization: (Y – Xβ)T (Y – Xβ) + λ·βT β → min – Replace XTX with XTX + λI – Pick λ << diag(XTX), refine by cross-validation – Do NOT regularize intercept • CG: q = t(X) %*% (X %*% p) + lambda * p; 10
  • 11.
    Shifting and Scaling X • PROBLEM: Features have vastly different range: – Examples: [0, 1]; [2010, 2015]; [$0.01, $1 Billion] •Each βi in Y ≈ Xβ has different size & accuracy? – Regularization λ·βT β also range-dependent? • SOLUTION: Scale & shift features to mean = 0, variance = 1 – Needs intercept: Y ≈ (X| 1)β – Equivalently: (Xnew |1) = (X |1) %*% SST “Shift-Scale Transform” • BUT: Sparse X becomes Dense Xnew … • SOLUTION: (Xnew |1) %*% M = (X |1) %*% (SST %*% M) – Extends to XTX and other X-products – Further optimization: SST has special shape 11
  • 12.
    Shifting and Scaling X – Linear Regression Direct Solve code snippet example: A =t(X) %*% X; b = t(X) %*% y; if (intercept_status == 2) { A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]); A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]; b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]; } A = A + diag (lambda); beta_unscaled = solve (A, b); if (intercept_status == 2) { beta = scale_X * beta_unscaled; beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled; } else { beta = beta_unscaled; } 12
  • 13.
    Regression in Statistics • Model: Y = Xβ* + εwhere ε is a random vector – There exists a “true” β* – Each εi is Gaussian with mean μi = Xi β* and variance σ2 • Likelihood maximization to estimate β* – Likelihood: ℓ(Y | X, β, σ) = ∏i ≤ n C(σ)·exp(– (yi – Xi β)2 / 2σ2) – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Maximum likelihood over β = Least Squares • Why do we need statistical view? – Confidence intervals for parameters – Goodness of fit tests – Generalizations: replace Gaussian with another distribution 13
  • 14.
    Maximum Likelihood Estimator • In each (xi , yi) let yi have distribution ℓ(yi| xi , β, φ) – Records are mutually independent for i = 1, …, n • Estimator for β is a function f(X, Y) – Y is random → f(X, Y) random – Unbiased estimator: for all β, mean E f(X, Y) = β • Maximum likelihood estimator – MLE (X, Y) = argmaxβ ∏i ≤ n ℓ(yi | xi , β, φ) – Asymptotically unbiased: E MLE (X, Y) → β as n → ∞ • Cramér-Rao Bound – For unbiased estimators, Var f(X, Y) ≥ FI(X, β, φ) –1 – Fisher information: FI(X, β, φ) = – EY Hessianβ log ℓ(Y| X, β, φ) – For MLE: Var (MLE (X, Y)) → FI(X, β, φ)–1 as n → ∞ 14
  • 15.
    Variance of M.L.E. • Cramér-Rao Bound is a simple way to estimate variance of predicted parameters (for large n): 1. Maximize log ℓ(Y|X, β, φ) to estimate β 2. Compute the Hessian (2nd derivatives) of log ℓ(Y |X, β, φ) 3. Compute “expected” Hessian: FI = – EY Hessian 4. Invert FI as a matrix: get FI–1 5. Use FI–1 as approx. covariance matrix for the estimated β • For linear regression: – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Hessian = –(1/σ2)·XTX; FI = (1/σ2)·XTX – Cov β ≈ σ2 ·(XTX) –1 ; Var βj ≈ σ2 ·diag((XTX) –1) j 15
  • 16.
    Variance of Y given X • MLE for variance of Y = 1/n · ∑ i≤ n (yi – y avg)2 – To make it unbiased, replace 1/n with 1/(n – 1) • Variance of ε in Y = Xβ* + ε is residual variance – Estimator for Var(ε) = 1/(n – m – 1) · ∑i ≤ n (yi – Xi β)2 • Good regression must have: Var(ε) << Var(Y) – “Explained” variance = Var(Y) – Var(ε) • R-squared: estimate 1 – Var(ε) / Var(Y) to test fitness: – R2 plain = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) – R2 adj. = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) · (n – 1) / (n – m – 1) • Pearson residual: ri = (yi – Xi β) / Var(ε)1/2 – Should be approximately Gaussian with mean 0 and variance 1 – Can use in another fitness test (more on tests later) 16
  • 17.
    LinReg Scripts: Inputs # INPUT PARAMETERS: #-------------------------------------------------------------------------------------------- # NAME TYPE DEFAULT MEANING # -------------------------------------------------------------------------------------------- # X String --- Location (on HDFS) to read the matrix X of feature vectors # Y String --- Location (on HDFS) to read the 1-column matrix Y of response values # B String --- Location to store estimated regression parameters (the betas) # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # icpt Int 0 Intercept presence, shifting and rescaling the columns of X: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero # for highly dependend/sparse/numerous features # tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if # L2 norm of the beta-residual is less than tolerance * its initial norm # maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum # fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv" # -------------------------------------------------------------------------------------------- # OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value: # OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B: # icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B # icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # Col.2: betas for shifted/rescaled X and intercept 17
  • 18.
    LinReg Scripts: Outputs # In addition,some regression statistics are provided in CSV format, one comma-separated # name-value pair per each line, as follows: # # NAME MEANING # ------------------------------------------------------------------------------------- # AVG_TOT_Y Average of the response value Y # STDEV_TOT_Y Standard Deviation of the response value Y # AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias # STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X) # DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr. # PLAIN_R2 Plain R^2 of residual with bias included vs. total average # ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average # PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average # ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average # PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant # ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant # ------------------------------------------------------------------------------------- # * The last two statistics are only printed if there is no intercept (icpt=0) # # The Log file, when requested, contains the following per-iteration variables in CSV # format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for # initial values: # # NAME MEANING # ------------------------------------------------------------------------------------- # CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y # where A = t(X) %*% X + diag (lambda), or a similar quantity # CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial # ------------------------------------------------------------------------------------- 18
  • 19.
    Caveats • Overfitting: β reflect individual records in X, not distribution –Typically, too few records (small n) or too many features (large m) – To detect, use cross-validation – To mitigate, select fewer features; regularization may help too • Outliers: Some records in X are highly abnormal – They badly violate distribution, or have very large cell-values – Check MIN and MAX of Y, X-columns, Xi β, ri 2 = (yi – Xi β)2 / Var(ε) – To mitigate, remove outliers, or change distribution or link function • Interpolation vs. extrapolation – A model trained on one kind of data may not carry over to another kind of data; the past may not predict the future – Great research topic! 19
  • 20.
    Generalized Linear Models • Linear Regression: Y =Xβ* + ε – Each yi is Normal(μi , σ2) where mean μi = Xi β* – Variance(yi) = σ2 = constant • Logistic Regression: – Each yi is Bernoulli(μi) where mean μi = 1 / (1 + exp (– Xi β*)) – Prob [yi = 1] = μi , Prob [yi = 0] = 1 – μi , mean = probability of 1 – Variance(yi) = μi (1 – μi) • Poisson Regression: – Each yi is Poisson(μi) where mean μi = exp(Xi β*) – Prob [yi = k] = (μi)k exp(– μi)/ k! for k = 0, 1, 2, … – Variance(yi) = μi • Only in Linear Regression we add error εi to mean μi 20
  • 21.
    Generalized Linear Models • GLM Regression: – Each yihas distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Example: Linear Regression as GLM – C(σ)·exp(– (yi – Xi β)2 / 2σ2) = exp{(yi ·θi – b(θi))/a + c(yi , a)} – θi = μi = Xi β; b(θi) = (Xi β)2 / 2; a = σ2 = variance • Link function = identity; c(yi , a) = – yi 2 /2σ2 + log C(σ) • Example: Logistic Regression as GLM – (μi )y[i] (1 – μi)1 – y[i] = exp{yi · log(μi) – yi · log(1 – μi) + log(1 – μi)} = exp{(yi ·θi – b(θi))/ a + c(yi , a)} – θi = log(μi / (1 – μi)) = Xi β; b(θi) = – log(1 – μi) = log(1 + exp(θi)) • Link function = log (μ / (1 – μ)) ; Variance = μ(1 – μ) ; a = 1 21
  • 22.
    Generalized Linear Models • GLM Regression: – Each yihas distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Why θi ? What is b(θi)? – θi makes formulas simpler, stands for μi (no big deal) – b(θi) defines what distribution it is: linear, logistic, Poisson, etc. – b(θi) connects mean with variance: Var(yi) = a·bʺ(θi), μi = bʹ(θi) • What is link function? – You choose it to link μi with your features β1xi1 + β2xi2 + … + βmxim – Additive effects: μi = Xi β; Multiplicative effects: μi = exp(Xi β) Bayes law effects: μi = 1 / (1 + exp (– Xi β)); Inverse: μi = 1 / (Xi β) – Xi β has range (– ∞, +∞), but μi may range in [0, 1], [0, +∞) etc. 22
  • 23.
    GLMs We Support • We specify GLM by: – Mean to variance connection –Link function (mean to feature sum connection) • Mean-to-variance for common distributions: – Var (yi) = a ·(μi)0 = σ2 : Linear / Gaussian – Var (yi) = a ·μi (1 – μi): Logistic / Binomial – Var (yi) = a ·(μi)1 : Poisson – Var (yi) = a ·(μi)2 : Gamma – Var (yi) = a ·(μi)3 : Inverse Gaussian • We support two types: Power and Binomial – Var (yi) = a ·(μi)α : Power, for any α – Var (yi) = a ·μi (1 – μi): Binomial 23
  • 24.
    GLMs We Support • We specify GLM by: – Mean to variance connection –Link function (mean to feature sum connection) Supported link functions • Power: Xi β = (μi)s where s = 0 stands for Xi β = log (μi) – Examples: identity, inverse, log, square root • Link functions used in binomial / logistic regression: – Logit, Probit, Cloglog, Cauchit – Link Xi β-range (– ∞, +∞) with μi-range (0, 1) – Differ in tail behavior • Canonical link function: – Makes Xi β = the canonical parameter θi , i.e. sets μi = bʹ(Xi β) – Power link Xi β = (μi)1 – α if Var = a·(μi)α ; Logit link for binomial 24
  • 25.
    GLM Script Inputs # NAME TYPEDEFAULT MEANING # --------------------------------------------------------------------------------------------- # X String --- Location to read the matrix X of feature vectors # Y String --- Location to read response matrix Y with either 1 or 2 columns: # if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg) # B String --- Location to store estimated regression parameters (the betas) # fmt String "text" The betas matrix output format, such as "text" or "csv" # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial # vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1): # 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian # link Int 0 Link function code: 0 = canonical (depends on distribution), # 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit # lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1): # -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity # yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0 # icpt Int 0 Intercept presence, X columns shifting and rescaling: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.0 Regularization parameter (lambda) for L2 regularization # tol Double 0.000001 Tolerance (epsilon) # disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data # moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations # mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum # --------------------------------------------------------------------------------------------- # OUTPUT: Matrix beta, whose size depends on icpt: # icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2 25
  • 26.
    GLM Script Outputs # In addition,some GLM statistics are provided in CSV format, one comma-separated name-value # pair per each line, as follows: # ------------------------------------------------------------------------------------------- # TERMINATION_CODE A positive integer indicating success/failure as follows: # 1 = Converged successfully; 2 = Maximum number of iterations reached; # 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported # BETA_MIN Smallest beta value (regression coefficient), excluding the intercept # BETA_MIN_INDEX Column index for the smallest beta value # BETA_MAX Largest beta value (regression coefficient), excluding the intercept # BETA_MAX_INDEX Column index for the largest beta value # INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0) # DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter # or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0 # DISPERSION_EST Dispersion estimated from the dataset # DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0 # DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value # ------------------------------------------------------------------------------------------- # # The Log file, when requested, contains the following per-iteration variables in CSV format, # each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values: # ------------------------------------------------------------------------------------------- # NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration # IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise # POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point # OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood) # OBJ_DROP_REAL Reduction in the objective during this iteration, actual value # OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation # OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region # GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted) # LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows # LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows # IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored # TRUST_DELTA Updated trust region size, the "delta" # ------------------------------------------------------------------------------------------- 26
  • 27.
    GLM Likelihood Maximization • 1 record: ℓ (yi |θi , a) = exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Log ℓ (Y |Θ, a) = 1/a · ∑ i ≤ n (yi · θi – b(θi)) + const(Θ) • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min – Here θi is a function of β: θi = bʹ–1 (g –1 (Xi β)) – Add regularization with λ/2 to agree with least squares – If X has intercept, do NOT regularize its β-value • Non-quadratic; how to optimize? – Gradient descent: fastest when far from optimum – Newton method: fastest when close to optimum • Trust Region Conjugate Gradient – Strikes a good balance between the above two 27
  • 28.
    GLM Likelihood Maximization • f(β; X, Y) = – ∑i ≤ n(yi · θi – b(θi)) + λ/2 · βT β → min • Outer iteration: From β to βnew = β + z – ∆f (z; β) := f(β + z; X, Y) – f(β; X, Y) • Use “Fisher Scoring” to approximate Hessian and ∆f (z; β) – ∆f (z; β) ≈ ½·zT A z + GT z, where: – A = XT diag(w)X + λI and G = – XT u + λ·β – Vectors u, w depend on β via mean-to-variance and link functions • Trust Region: Area ǁzǁ2 ≤ δ where we trust the approximation ∆f (z; β) ≈ ½ ·zT A z + GT z – ǁzǁ2 ≤ δ too small → Gradient Descent step (1 inner iteration) – ǁzǁ2 ≤ δ mid-size → Cut-off Conjugate Gradient step (2 or more) – ǁzǁ2 ≤ δ too wide → Full Conjugate Gradient step FI = XT diag(w) X is “expected” Hessian 28
  • 29.
    Trust Region Conj. Gradient • Code snippet for Logistic Regression g =- 0.5 * t(X) %*% y; f_val = - N * log (0.5); delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2))); exit_g2 = sum (g ^ 2) * tolerance ^ 2; while (sum (g ^ 2) > exit_g2 & i < max_i) { i = i + 1; r = g; r2 = sum (r ^ 2); exit_r2 = 0.01 * r2; d = - r; z = zeros_D; j = 0; trust_bound_reached = FALSE; while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j) { j = j + 1; Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d; c = r2 / sum (d * Hd); [c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2); z = z + c * d; r = r + c * Hd; r2_new = sum (r ^ 2); d = - r + (r2_new / r2) * d; r2 = r2_new; } p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z)))); f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val; delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g))); if (f_chg < 0) { beta = beta + z; f_val = f_val + f_chg; w = p * (1 - p); g = - t(X) %*% ((1 - p) * y) + lambda * beta; } } ensure_quadratic = function (double x, a, b, c) return (double x_new, boolean test) { test = (a * x^2 + b * x + c > 0); if (test) { rad = sqrt (b ^ 2 - 4 * a * c); if (b >= 0) { x_new = - (2 * c) / (b + rad); } else { x_new = - (b - rad) / (2 * a); } } else { x_new = x; } } 29
  • 30.
    Trust Region Conj. Gradient • Trust region update in Logistic Regression snippet update_trust_region = function(double delta, double z_distance, double f_chg_exact, double f_chg_linear_approx, double f_chg_quadratic_approx) return (double delta) { sigma1 = 0.25; sigma2 = 0.5; sigma3 = 4.0; if (f_chg_exact <= f_chg_linear_approx) { alpha = sigma3; } else { alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx)); } rho = f_chg_exact / f_chg_quadratic_approx; if (rho < 0.0001) { delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta); } else { if (rho < 0.25) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta)); } else { if (rho < 0.75) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta)); } else { delta = max (delta, min (alpha * z_distance, sigma3 * delta)); }}} } 30
  • 31.
    GLM: Other Statistics • REMINDER: – Each yihas distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Variance(yi) = a ·bʺ(θi) = a·V(μi) • Variance of Y given X – Estimating the β gives V(μi) = V (g–1 (Xi β)) – Constant “a” is called dispersion, analogue of σ2 – Estimator: a ≈ 1/(n – m)·∑ i ≤ n (yi – μi)2 / V(μi) • Variance of parameters β – We use MLE, hence Cramér-Rao formula applies (for large n) – Fisher Information: FI = (1/a)· XT diag(w)X, wi = (V(μi) ·gʹ(μi)2)–1 – Estimator: Cov β ≈ a·(XT diag(w)X)–1, Var βj = (Cov β)jj 31
  • 32.
    GLM: Deviance • Let X have m features, of which k may have no effect on Y – Will “no effect” result in βj≈ 0 ? (Unlikely.) – Estimate βj and Var βj then test βj / (Var βj)1/2 against N(0, 1)? • Student’s t-test is better • Likelihood Ratio Test: • Null Hypothesis: Y given X follows GLM with β1 = … = βk = 0 – If NH is true, D is asympt. distributed as χ2 with k deg. of freedom – If NH is false, D → +¥as n → +¥ • P-value % = Prob[ χ2 k > D] · 100% ( ) ( ) 0 ...,,,0...,,0;,|max ...,,,...,,;,|max log2 1GLM 11GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅= + + mk mkk aXYL aXYL D ββ ββββ β β 32
  • 33.
    GLM: Deviance • To test many nested models (feature subsets) we need their maximum likelihoods to compute D – PROBLEM: Term “c(yi , a)” in GLM’s exp{(yi·θi – b(θi))/ a + c(yi , a)} • Instead, compute deviance: • “Saturated model” has no X, no β, but picks the best θi for each individual yi (not realistic at all, just convention) – Term “c(yi , a)” is the same in both models! – But “a” has to be fixed, e.g. to 1 • Deviance itself is used for goodness of fit tests, too ( ) ( ) 0 ...,,,...,,;,|max modelsaturated:;|max log2 11GLM GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ Θ ⋅= + Θ mkkaXYL aYL D ββββ β 33
  • 34.
    Survival Analysis Given Survival datafrom individuals as (time, event) Categorical/continuous features for each individual Estimate Probability of survival to a feature time Rate of hazard at a given time Ex. † death from specific cancer ? lost to follow-up † † ? † ? 1 2 3 4 5 6 7 8 9 I I I I I I I I I Patients 2 1 3 4 5 Time 27 34
  • 35.
    Cox Regression Semi-parametric model“robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Baseline hazard covariates coefficients 29 35
  • 36.
    36 Event Hazard Rate • Symptom events E follow a Poisson process: timeE1E2 E3 E4 Death Hazard function Hazard function = Poisson rate: Given state and hazard, we could compute the probability of the observed event count: [ ] t tttE th t Δ Δ+∈ = →Δ state),[Prob limstate);( | 0 [ ] , ! ineventsProb 21 K He tttK KH− =≤≤ dttthH t t ))(state;( 2 1 ∫=
  • 37.
    37 Cox Proportional Hazards • Assume that exactly 1 patient gets event E at time t •The probability that it is Patient #i is the hazard ratio: • Cox assumption: • Time confounder cancels out! t [ ] ∑ = = n j ji sthsthEi 1 );();(gets#Prob s1 si = statei s2 sn Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . )(exp)((state))(state);( T 00 sththth λ⋅=Λ⋅=
  • 38.
    38 Cox “Partial” Likelihood • Cox “partial” likelihood for the dataset is a product over all E: Patient #1 Patient#2 Patient #3 Patient #n – 1 Patient #n . . . . . [ ] ∏ ∑ ∏ ∑ == === EtEt n j j t n j j t ts ts tsth tsth EL :: 1 T )(who T 1 )(who Cox )( )( )( )( )(exp )(exp )(; )(; allProb)( λ λ λ
  • 39.
    Cox Regression Semi-parametric model“robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Cox regression in DML Fitting parameters via negative partial log-likelihood Method: trust region Newton with conjugate gradient Inverting the Hessian using block Cholesky for computing std error of betas Similar features as coxph() in R, e.g., stratification, frequency weights, offsets, goodness of fit testing, recurrent event analysis Baseline hazard covariates coefficients 29 39
  • 40.
  • 41.
  • 42.
  • 43.