SlideShare a Scribd company logo
1 of 43
Download to read offline
Regression	in	SystemML
Alexandre	Evfimievski
1
Linear	Regression
• INPUT:		Records	(x1,	y1),	(x2,	y2),	…,	(xn,	yn)
– Each	xi is	m-dimensional:	xi1,	xi2,	…,	xim
– Each	yi is	1-dimensional
• Want	to	approximate	yi as	a	linear	combination	of	xi-entries
– yi ≈		β1xi1 +	β2xi2 +	…	+	βmxim
– Case	m	=	1:			yi ≈	β1xi1 (	Note:		x	=	0		maps	to		y	=	0	)
• Intercept:		a	“free	parameter”	for	default	value	of	yi
– yi ≈		β1xi1 +	β2xi2 +	…	+	βmxim +	βm+1
– Case	m	=	1:			yi ≈	β1xi1 +	β2
• Matrix	notation:		Y	≈	Xβ,		or		Y	≈	(X |1) β if	with	intercept
– X		is		n	× m,		Y		is		n	× 1,		β is		m	× 1		or		(m+1)	× 1
2
Linear	Regression:	Least	Squares
• How	to	aggregate	errors:		yi – (β1xi1 +	β2xi2 +	…	+	βmxim)		?
– What’s	worse:		many	small	errors,		or	a	few	big	errors?
• Sum	of	squares:		∑i≤n (yi – (β1xi1 +	β2xi2 +	…	+	βmxim))2 →		min
– A	few	big	errors	are	much	worse!		We	square	them!
• Matrix	notation:		(Y	– Xβ)T (Y	– Xβ)		→		min
• Good	news:		easy	to	solve	and	find	the	β’s
• Bad	news:		too	sensitive	to	outliers!
3
Linear	Regression:	Direct	Solve
• (Y	– Xβ)T (Y	– Xβ)		→		min
• YT	Y		– YT	(Xβ)		– (Xβ)T	Y		+		(Xβ)T	(Xβ)		→		min
• ½	βT	(XTX) β – βT	(XTY)		→		min
• Take	the	gradient	and	set	it	to	0:			(XTX) β – (XTY)		=		0
• Linear	equation:		(XTX) β =		XTY;			Solution:		β =		(XTX)–1	(XTY)
A = t(X) %*% X;
b = t(X) %*% y;
. . .
. . .
beta_unscaled = solve (A, b);
4
Computation		of		XTX
• Input	(n	× m)-matrix		X		is	often	huge	and	sparse
– Rows		X[i,	]		make	up		n		records,		often		n	>>	106
– Columns		X[,	j]		are	the	features
• Matrix		XTX		is	(m	× m)	and	dense
– Cells:		(XTX)	[j1,	j2]		=		∑ i≤n X[i,	j1]	*	X[i,	j2]
– Part	of	covariance	between	features		#	j1 and		#	j2 across	all	records
– m		could	be	small	or	large
• If	m	≤	1000,		XTX		is	small	and	“direct	solve”	is	efficient…
– …	as	long	as		XTX		is	computed	the	right	way!
– …	and	as	long	as		XTX		is	invertible	(no	linearly	dependent	features)
5
Computation		of		XTX
• Naïve	computation:
a) Read	X	into	memory
b) Copy	it	and	rearrange	cells	into	the	transpose
c) Multiply	two	huge	matrices,	XT and	X
• There	is	a	better	way:		XTX		=		∑i≤n X[i,	]T X[i,	]				(outer	product)
– For	all		i =	1,	…,	n		in	parallel:
a) Read	one	row		X[i,	]
b) Compute	(m	× m)-matrix:		Mi	[j1,	j2]		=		X[i,	j1]	*	X[i,	j2]
c) Aggregate:		M	=	M	+	Mi
• Extends	to		(XTX) v		and		XT	diag(w) X,		used	in	other	scripts:
– (XTX) v		=		∑i≤n (∑ j≤m X[i,	j]v[j]) *	X[i,	]T
– XT	diag(w)X		=	∑ i≤n wi *	X[i,	]T X[i,	]
6
Conjugate	Gradient
• What	if		XTX		is	too	large,	m	>>	1000?
– Dense		XTX		may	take	far	more	memory	than	sparse		X
• Full		XTX		not	needed	to	solve		(XTX) β =		XTY
– Use	iterative	method
– Only	evaluate		(XTX)v		for	certain	vectors		v
• Ex.:	Gradient	Descent	for		f (β)		=		½	βT	(XTX) β – βT	(XTY)	
– Start	with	any		β =	β0
– Take	the	gradient:		r		=		df(β)		=		(XTX) β – (XTY)								(also,	residual)
– Find	number		a to	minimize		f(β + a ·r):			a =		– (rT	r)	/	(rT	XTX r)
– Update:		βnew		←		β + a·r
• But	gradient	is	too	local
– And	“forgetful”
*a · r
7
Conjugate	Gradient
• PROBLEM:		Gradient	takes	a	very	similar	direction	many	times
• Enforce	orthogonality	to	prior	directions?
– Take	the	gradient:		r		=		(XTX) β – (XTY)
– Subtract	prior	directions:		p(k) =		r		– λ1p(1) – …	– λk-1p(k-1)
• Pick		λi to	ensure		(p(k) ·	p(i))		=		0			???
– Find	number		a(k) to	minimize		f(β + a(k)	·p(k)),		etc	…
• STILL,	PROBLEMS:
– Value		a(k) does	NOT	minimize		f(a(1)	·p(1)		+	…	+ a(k)	·p(k)		+	…	+ a(m)	·p(m))
– Keep	all	prior	directions		p(1),	p(2),	…	,	p(k)	?		That’s	a	lot!
• SOLUTION:		Enforce	Conjugacy
– Conjugate	vectors:			uT	(XTX)	v		=		0,		instead	of		uT	v		=		0
• Matrix		XTX		acts	as	the	“metric”	in	distorted	space
– This	does	minimize		f(a(1)	·p(1)		+	…	+ a(k)	·p(k)		+	…	+ a(m)	·p(m))
• And,		only	need		p(k-1) and		r(k) to	compute		p(k)
8
Conjugate	Gradient
• Algorithm,	step	by	step
i = 0; beta = matrix (0, ...); Initially:		β =	0
r = - t(X) %*% y; Residual	&	gradient		r		=		(XTX) β – (XTY)
p = - r; Direction	for		β:		negative	gradient
norm_r2 = sum (r ^ 2); Norm	of	residual	error		=		rT	r
norm_r2_target = norm_r2 * tolerance ^ 2; Desired	norm	of	residual	error
while (i < mi & norm_r2 > norm_r2_target)
{ WE	HAVE:		p		is	the	next	direction	for		β
q = t(X) %*% (X %*% p) + lambda * p; q		=		(XTX)	p
a = norm_r2 / sum (p * q); a =		rT	r		/		p	(XTX)	p			minimizes			f(β + a· p)
beta = beta + a * p; Update:			βnew ←		β +		a ·	p
r = r + a * q; rnew ←		(XTX) (β + a· p)		– (XTY)
old_norm_r2 = norm_r2; =			r		+		a ·	(XTX)	p
norm_r2 = sum (r ^ 2); Update	the	norm	of	residual	error		=		rT	r
p = -r + (norm_r2 / old_norm_r2) * p; Update	direction:		(1)	take	negative	gradient;
(2)	enforce	conjugacy	with	previous	direction
i = i + 1; Conjugacy	to	all	older	directions	is	automatic!
}
9
Degeneracy	and	Regularization
• PROBLEM:		What	if		X		has	linearly	dependent	columns?
– Cause:		recoding	categorical	features,	adding	composite	features
– Then		XTX		is	not	a	“metric”:		exists		ǁpǁ	>	0		such	that		pT	(XTX)	p		=		0
– In	CG	step		a =		rT	r		/		p	(XTX)	p	:		Division	By	Zero!
• In	fact,	then		Least	Squares		has		∞		solutions
– Most	of	them	have		HUGE		β-values
• Regularization:		Penalize		β with	larger	values
– L2-Regularization:			(Y	– Xβ)T (Y	– Xβ)		+		λ·βT	β →		min
– Replace		XTX		with		XTX		+		λI
– Pick		λ <<		diag(XTX),		refine	by	cross-validation
– Do	NOT	regularize	intercept
• CG: q = t(X) %*% (X %*% p) + lambda * p;
10
Shifting	and	Scaling	X
• PROBLEM:		Features	have	vastly	different	range:
– Examples:		[0,	1];		[2010,	2015];		[$0.01,		$1	Billion]
• Each		βi in		Y	≈	Xβ has	different	size	&	accuracy?
– Regularization			λ·βT	β also	range-dependent?
• SOLUTION:		Scale	&	shift	features	to	mean	=	0,	variance	=	1
– Needs	intercept:		Y	≈	(X| 1)β
– Equivalently:		(Xnew |1)		=		(X |1)		%*% SST				“Shift-Scale	Transform”
• BUT:		Sparse		X		becomes		Dense		Xnew …
• SOLUTION:			(Xnew |1)	 %*% M		=		(X |1)	 %*% (SST	 %*% M)
– Extends	to		XTX		and	other	X-products
– Further	optimization:		SST		has	special	shape
11
Shifting	and	Scaling	X
– Linear	Regression	Direct	Solve	
code	snippet	example:
A = t(X) %*% X;
b = t(X) %*% y;
if (intercept_status == 2) {
A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]);
A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ];
b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ];
}
A = A + diag (lambda);
beta_unscaled = solve (A, b);
if (intercept_status == 2) {
beta = scale_X * beta_unscaled;
beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled;
} else {
beta = beta_unscaled;
}
12
Regression	in	Statistics
• Model:		Y	=	Xβ* +	ε where		ε is	a	random	vector
– There	exists	a	“true”	β*
– Each		εi is	Gaussian	with	mean		μi =	Xi	β* and	variance		σ2
• Likelihood	maximization	to	estimate		β*
– Likelihood:		ℓ(Y	|	X,	β,	σ)		=		∏i ≤	n C(σ)·exp(– (yi – Xi	β)2 /	2σ2)
– Log	ℓ(Y	|	X,	β,	σ)		=		n·c(σ)		– ∑i ≤	n (yi – Xi	β)2 /	2σ2
– Maximum	likelihood	over	β =		Least	Squares
• Why	do	we	need	statistical	view?
– Confidence	intervals	for	parameters
– Goodness	of	fit	tests
– Generalizations:	replace	Gaussian	with	another	distribution
13
Maximum	Likelihood	Estimator
• In	each		(xi	,	yi)		let		yi have	distribution		ℓ(yi |	xi	,	β,	φ)
– Records	are	mutually	independent	for		i =	1,	…,	n
• Estimator	for		β is	a	function		f(X,	Y)
– Y	is	random		→			f(X,	Y)	random
– Unbiased	estimator:		for	all	β,	mean		E	f(X,	Y)	=	β
• Maximum	likelihood	estimator
– MLE (X,	Y)		=		argmaxβ ∏i ≤	n ℓ(yi |	xi	,	β,	φ)
– Asymptotically	unbiased:		E	MLE (X,	Y)	→	β as		n	→	∞
• Cramér-Rao	Bound
– For	unbiased	estimators,		Var f(X,	Y)		≥		FI(X,	β,	φ) –1
– Fisher	information:		FI(X,	β,	φ)		=		– EY Hessianβ log	ℓ(Y| X,	β,	φ)
– For	MLE:		Var (MLE (X,	Y)) →		FI(X,	β,	φ)–1 as		n	→	∞
14
Variance	of	M.L.E.
• Cramér-Rao	Bound	is	a	simple	way	to	estimate	variance	of	
predicted	parameters	(for	large	n):
1. Maximize		log	ℓ(Y |X,	β,	φ)		to	estimate		β
2. Compute	the	Hessian	(2nd derivatives)	of		log	ℓ(Y |X,	β,	φ)
3. Compute	“expected”	Hessian:		FI		=		– EY Hessian
4. Invert		FI		as	a	matrix:		get		FI–1
5. Use		FI–1 as	approx.	covariance	matrix	for	the	estimated		β
• For	linear	regression:
– Log	ℓ(Y	|	X,	β,	σ)		=		n·c(σ)		– ∑i ≤	n (yi – Xi	β)2 /	2σ2
– Hessian		=		–(1/σ2)·XTX;				FI		=		(1/σ2)·XTX
– Cov β ≈		σ2 ·(XTX) –1	;				Var βj ≈		σ2 ·diag((XTX) –1) j
15
Variance	of		Y		given		X
• MLE	for	variance	of	Y		=		1/n	·	∑ i ≤	n (yi – y avg)2
– To	make	it	unbiased,	replace		1/n		with		1/(n	– 1)
• Variance	of		ε in		Y	=	Xβ* +	ε is	residual	variance
– Estimator	for	Var(ε)		=		1/(n	– m	– 1)	·	∑i ≤	n (yi – Xi	β)2
• Good	regression	must	have:		Var(ε)		<<		Var(Y)
– “Explained”	variance		=		Var(Y)		– Var(ε)
• R-squared:		estimate		1	– Var(ε)	/	Var(Y)		to	test	fitness:
– R2
plain =		1		– (∑ i ≤	n (yi – Xi	β)2)	/	(∑ i ≤	n (yi – yavg)2)
– R2
adj. =		1		– (∑ i ≤	n (yi – Xi	β)2)	/	(∑ i ≤	n (yi – yavg)2) ·	(n	– 1)	/	(n	– m	– 1)	
• Pearson	residual:		ri =		(yi – Xi	β)	/	Var(ε)1/2
– Should	be	approximately	Gaussian	with	mean	0	and	variance	1
– Can	use	in	another	fitness	test		(more	on	tests	later)
16
LinReg	Scripts:	Inputs
# INPUT PARAMETERS:
# --------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# --------------------------------------------------------------------------------------------
# X String --- Location (on HDFS) to read the matrix X of feature vectors
# Y String --- Location (on HDFS) to read the 1-column matrix Y of response values
# B String --- Location to store estimated regression parameters (the betas)
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# icpt Int 0 Intercept presence, shifting and rescaling the columns of X:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero
# for highly dependend/sparse/numerous features
# tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if
# L2 norm of the beta-residual is less than tolerance * its initial norm
# maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum
# fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv"
# --------------------------------------------------------------------------------------------
# OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value:
# OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B:
# icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B
# icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
# Col.2: betas for shifted/rescaled X and intercept
17
LinReg	Scripts:	Outputs
# In addition, some regression statistics are provided in CSV format, one comma-separated
# name-value pair per each line, as follows:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# AVG_TOT_Y Average of the response value Y
# STDEV_TOT_Y Standard Deviation of the response value Y
# AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias
# STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X)
# DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr.
# PLAIN_R2 Plain R^2 of residual with bias included vs. total average
# ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average
# PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average
# ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average
# PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant
# ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant
# -------------------------------------------------------------------------------------
# * The last two statistics are only printed if there is no intercept (icpt=0)
#
# The Log file, when requested, contains the following per-iteration variables in CSV
# format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for
# initial values:
#
# NAME MEANING
# -------------------------------------------------------------------------------------
# CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y
# where A = t(X) %*% X + diag (lambda), or a similar quantity
# CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial
# -------------------------------------------------------------------------------------
18
Caveats
• Overfitting:		β reflect	individual	records	in		X,	not	distribution
– Typically,	too	few	records	(small	n)	or	too	many	features	(large	m)
– To	detect,	use	cross-validation
– To	mitigate,	select	fewer	features;		regularization	may	help	too
• Outliers:		Some	records	in	X	are	highly	abnormal
– They	badly	violate	distribution,	or	have	very	large	cell-values
– Check	MIN	and	MAX	of		Y,		X-columns,		Xi	β,		ri
2 =	(yi		– Xi	β)2	/ Var(ε)
– To	mitigate,	remove	outliers,	or	change	distribution	or	link	function
• Interpolation	vs.	extrapolation
– A	model	trained	on	one	kind	of	data	may	not	carry	over	to	another	
kind	of	data;		the	past	may	not	predict	the	future
– Great	research	topic!
19
Generalized	Linear	Models
• Linear	Regression:		Y = Xβ* +	ε
– Each		yi is	Normal(μi ,	σ2)		where	mean		μi =	Xi	β*
– Variance(yi)		=		σ2 =		constant
• Logistic	Regression:
– Each		yi is	Bernoulli(μi)		where	mean		μi =	1	/	(1	+	exp	(– Xi	β*))
– Prob [yi =	1]		=		μi ,		Prob [yi =	0]		=		1	– μi ,		mean		=		probability	of	1
– Variance(yi)		=		μi (1	– μi)
• Poisson	Regression:
– Each		yi is	Poisson(μi)		where	mean		μi =	exp(Xi	β*)
– Prob [yi =	k]		=		(μi)k	exp(– μi)/ k!			for		k	=	0,	1,	2,	…
– Variance(yi)		=		μi
• Only	in	Linear	Regression	we		add error		εi to	mean		μi
20
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi – b(θi))/a + c(yi ,	a)}
– Canonical	parameter θi represents	the	mean:			μi =		bʹ(θi)
– Link	function connects		μi and		Xi	β*	:			Xi	β* =		g(μi),			μi =		g –1	(Xi	β*)
– Variance(yi)		=		a ·bʺ(θi)		
• Example:		Linear	Regression	as	GLM
– C(σ)·exp(– (yi – Xi	β)2 /	2σ2)		=		exp{(yi ·θi – b(θi))/a + c(yi ,	a)}
– θi =		μi =		Xi	β;				b(θi)		=		(Xi	β)2	/ 2;				a		=		σ2 =		variance
• Link	function		=		identity;				c(yi ,	a)		=		– yi
2	/2σ2		+		log	C(σ)
• Example:		Logistic	Regression	as	GLM
– (μi )y[i] (1	– μi)1	– y[i] =		exp{yi ·	log(μi)		– yi ·	log(1	– μi)		+		log(1	– μi)}
=		exp{(yi ·θi – b(θi))/ a + c(yi ,	a)}
– θi =		log(μi / (1	– μi))		=		Xi	β;				b(θi)		=		– log(1	– μi)		=		log(1	+	exp(θi))
• Link	function		=		log (μ / (1	– μ))	;				Variance		=		μ(1	– μ)	;				a	=	1
21
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Canonical	parameter θi represents	the	mean:			μi =		bʹ(θi)
– Link	function connects		μi and		Xi	β*	:			Xi	β* =		g(μi),			μi =		g –1	(Xi	β*)
– Variance(yi)		=		a ·bʺ(θi)
• Why	θi	?		What	is	b(θi)?
– θi makes	formulas	simpler,	stands	for		μi (no	big	deal)
– b(θi)		defines	what	distribution	it	is:		linear,		logistic,		Poisson,		etc.
– b(θi)		connects	mean	with	variance:			Var(yi)		=		a·bʺ(θi),			μi =		bʹ(θi)
• What	is	link	function?
– You	choose	it to	link		μi with	your	features		β1xi1 +	β2xi2 +	…	+	βmxim
– Additive	effects:		μi =		Xi	β;				Multiplicative	effects:		μi =		exp(Xi	β)
Bayes	law	effects:		μi =	1	/	(1	+	exp	(– Xi	β));				Inverse:		μi =	1	/	(Xi	β)
– Xi	β has	range	(– ∞,	+∞),		but		μi may	range	in		[0,	1],		[0,	+∞)		etc.
22
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
• Mean-to-variance	for	common	distributions:
– Var (yi)		=		a ·(μi)0 =		σ2	:				Linear	/	Gaussian
– Var (yi)		=		a ·μi	(1	– μi):				Logistic	/	Binomial
– Var (yi)		=		a ·(μi)1	:				Poisson
– Var (yi)		=		a ·(μi)2	:				Gamma
– Var (yi)		=		a ·(μi)3	:				Inverse	Gaussian
• We	support	two	types:		Power	and	Binomial
– Var (yi)		=		a ·(μi)α :				Power,	for	any		α
– Var (yi)		=		a ·μi	(1	– μi):				Binomial
23
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
Supported	link	functions
• Power:		Xi	β =		(μi)s where		s	=	0		stands	for		Xi	β =		log	(μi)
– Examples:		identity,		inverse,		log,		square	root
• Link	functions	used	in	binomial	/	logistic	regression:
– Logit,		Probit,		Cloglog,		Cauchit
– Link		Xi	β-range		(– ∞,	+∞)		with		μi-range		(0,	1)
– Differ	in	tail	behavior
• Canonical	link	function:
– Makes		Xi	β =		the	canonical	parameter θi	,		i.e.	sets		μi =		bʹ(Xi	β)
– Power	link		Xi	β =		(μi)1	– α if		Var	=	a·(μi)α ;		Logit	link	for	binomial
24
GLM	Script	Inputs
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
# X String --- Location to read the matrix X of feature vectors
# Y String --- Location to read response matrix Y with either 1 or 2 columns:
# if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)
# B String --- Location to store estimated regression parameters (the betas)
# fmt String "text" The betas matrix output format, such as "text" or "csv"
# O String " " Location to write the printed statistics; by default is standard output
# Log String " " Location to write per-iteration variables for log/debugging purposes
# dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
# vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1):
# 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
# link Int 0 Link function code: 0 = canonical (depends on distribution),
# 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit
# lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1):
# -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
# yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0
# icpt Int 0 Intercept presence, X columns shifting and rescaling:
# 0 = no intercept, no shifting, no rescaling;
# 1 = add intercept, but neither shift nor rescale X;
# 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
# reg Double 0.0 Regularization parameter (lambda) for L2 regularization
# tol Double 0.000001 Tolerance (epsilon)
# disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data
# moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations
# mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum
# ---------------------------------------------------------------------------------------------
# OUTPUT: Matrix beta, whose size depends on icpt:
# icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2
25
GLM	Script	Outputs
# In addition, some GLM statistics are provided in CSV format, one comma-separated name-value
# pair per each line, as follows:
# -------------------------------------------------------------------------------------------
# TERMINATION_CODE A positive integer indicating success/failure as follows:
# 1 = Converged successfully; 2 = Maximum number of iterations reached;
# 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported
# BETA_MIN Smallest beta value (regression coefficient), excluding the intercept
# BETA_MIN_INDEX Column index for the smallest beta value
# BETA_MAX Largest beta value (regression coefficient), excluding the intercept
# BETA_MAX_INDEX Column index for the largest beta value
# INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0)
# DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter
# or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0
# DISPERSION_EST Dispersion estimated from the dataset
# DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0
# DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value
# -------------------------------------------------------------------------------------------
#
# The Log file, when requested, contains the following per-iteration variables in CSV format,
# each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values:
# -------------------------------------------------------------------------------------------
# NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration
# IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise
# POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point
# OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood)
# OBJ_DROP_REAL Reduction in the objective during this iteration, actual value
# OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation
# OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region
# GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted)
# LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows
# LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows
# IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored
# TRUST_DELTA Updated trust region size, the "delta"
# -------------------------------------------------------------------------------------------
26
GLM	Likelihood	Maximization
• 1	record:		ℓ (yi	| θi	,	a)		=		exp{(yi ·θi		– b(θi))/ a + c(yi	,	a)}
• Log	ℓ (Y |Θ,	a)		=		1/a	·	∑ i	≤	n (yi · θi		– b(θi)) +		const(Θ)
• f(β;	X,	Y)		=		– ∑i	≤	n (yi · θi		– b(θi)) +		λ/2 · βT	β →		min
– Here		θi is	a	function	of	β:			θi =		bʹ–1	(g –1	(Xi	β))
– Add	regularization	with		λ/2		to	agree	with	least	squares
– If		X		has	intercept,	do	NOT	regularize	its	β-value
• Non-quadratic;		how	to	optimize?
– Gradient	descent:		fastest	when	far	from	optimum
– Newton	method:		fastest	when	close	to	optimum
• Trust	Region	Conjugate	Gradient
– Strikes	a	good	balance	between	the	above	two
27
GLM	Likelihood	Maximization
• f(β;	X,	Y)		=		– ∑i	≤	n (yi · θi		– b(θi)) +		λ/2 · βT	β →		min
• Outer	iteration:		From		β to		βnew =		β +	z
– ∆f	(z;	β)		:=		f(β +	z;	X,	Y)		– f(β;	X,	Y)
• Use	“Fisher	Scoring”	to	approximate	Hessian	and		∆f	(z;	β)
– ∆f	(z;	β)		≈		½·zT	A z		+		GT	z,				where:
– A		=		XT	diag(w)X		+		λI and				G		=		– XT	u		+		λ·β
– Vectors		u,	w		depend	on		β via	mean-to-variance	and	link	functions
• Trust	Region:		Area		ǁzǁ2 ≤	δ where	we	trust	the	
approximation		∆f	(z;	β)		≈		½ ·zT	A z		+		GT	z
– ǁzǁ2 ≤	δ too	small		→		Gradient	Descent	step	(1	inner	iteration)
– ǁzǁ2 ≤	δ mid-size		→		Cut-off	Conjugate	Gradient	step	(2	or	more)
– ǁzǁ2 ≤	δ too	wide		→		Full	Conjugate	Gradient	step
FI		=		XT	diag(w) X		is	
“expected”	Hessian		
28
Trust	Region	Conj.	Gradient
• Code	snippet	for	
Logistic	Regression
g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5);
delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2)));
exit_g2 = sum (g ^ 2) * tolerance ^ 2;
while (sum (g ^ 2) > exit_g2 & i < max_i)
{
i = i + 1;
r = g;
r2 = sum (r ^ 2); exit_r2 = 0.01 * r2;
d = - r;
z = zeros_D; j = 0; trust_bound_reached = FALSE;
while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j)
{
j = j + 1;
Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d;
c = r2 / sum (d * Hd);
[c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2);
z = z + c * d;
r = r + c * Hd;
r2_new = sum (r ^ 2);
d = - r + (r2_new / r2) * d;
r2 = r2_new;
}
p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z))));
f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val;
delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g)));
if (f_chg < 0)
{
beta = beta + z;
f_val = f_val + f_chg;
w = p * (1 - p);
g = - t(X) %*% ((1 - p) * y) + lambda * beta;
} }
ensure_quadratic =
function (double x, a, b, c)
return (double x_new, boolean test)
{
test = (a * x^2 + b * x + c > 0);
if (test) {
rad = sqrt (b ^ 2 - 4 * a * c);
if (b >= 0) {
x_new = - (2 * c) / (b + rad);
} else {
x_new = - (b - rad) / (2 * a);
}
} else {
x_new = x;
} }
29
Trust	Region	Conj.	Gradient
• Trust	region	update	in	
Logistic	Regression	snippet
update_trust_region =
function (double delta,
double z_distance,
double f_chg_exact,
double f_chg_linear_approx,
double f_chg_quadratic_approx)
return (double delta)
{
sigma1 = 0.25;
sigma2 = 0.5;
sigma3 = 4.0;
if (f_chg_exact <= f_chg_linear_approx) {
alpha = sigma3;
} else {
alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx));
}
rho = f_chg_exact / f_chg_quadratic_approx;
if (rho < 0.0001) {
delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta);
} else { if (rho < 0.25) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta));
} else { if (rho < 0.75) {
delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta));
} else {
delta = max (delta, min (alpha * z_distance, sigma3 * delta));
}}}
}
30
GLM:	Other	Statistics
• REMINDER:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Variance(yi)		=		a ·bʺ(θi)		=		a·V(μi)
• Variance	of		Y		given		X
– Estimating	the	β gives		V(μi)	=	V (g–1	(Xi	β))
– Constant		“a”		is	called	dispersion,	analogue	of		σ2
– Estimator:		a		≈		1/(n	– m)·∑ i	≤	n	(yi – μi)2	/	V(μi)
• Variance	of	parameters	β
– We	use	MLE,	hence	Cramér-Rao	formula	applies	(for	large	n)
– Fisher	Information:			FI		=		(1/a)·	XT	diag(w)X,			wi		= (V(μi) ·gʹ(μi)2)–1
– Estimator:			Cov	β ≈		a·(XT	diag(w)X)–1,				Var	βj =		(Cov	β)jj
31
GLM:		Deviance
• Let		X		have		m		features,	of	which		k		may	have	no	effect	on		Y
– Will	“no	effect”	result	in		βj ≈	0	?				(Unlikely.)
– Estimate		βj and		Var βj then	test		βj /	(Var βj)1/2 against		N(0,	1)?
• Student’s	t-test	is	better
• Likelihood	Ratio	Test:
• Null	Hypothesis:		Y		given		X		follows	GLM	with		β1 =	…	=	βk =	0
– If NH is	true,		D is	asympt.	distributed	as		χ2 with		k		deg.	of	freedom
– If NH is	false,		D → +¥as		n	→	+¥
• P-value	%		=		Prob[ χ2
k > D]	· 100%
( )
( )
0
...,,,0...,,0;,|max
...,,,...,,;,|max
log2
1GLM
11GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
⋅=
+
+
mk
mkk
aXYL
aXYL
D
ββ
ββββ
β
β
32
GLM:		Deviance
• To	test	many	nested	models	(feature	subsets)	we	need	their	
maximum	likelihoods	to	compute		D
– PROBLEM:		Term		“c(yi	,	a)”		in	GLM’s		exp{(yi ·θi		– b(θi))/ a + c(yi	,	a)}
• Instead,	compute	deviance:
• “Saturated	model”	has	no	X,	no	β,	but	picks	the	best		θi for	each	
individual		yi (not	realistic	at	all,	just	convention)
– Term		“c(yi	,	a)”		is	the	same	in	both	models!
– But		“a”		has	to	be	fixed,	e.g.	to	1
• Deviance	itself	is	used	for	goodness	of	fit	tests,	too
( )
( )
0
...,,,...,,;,|max
modelsaturated:;|max
log2
11GLM
GLM
>
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡ Θ
⋅=
+
Θ
mkkaXYL
aYL
D
ββββ
β
33
Survival Analysis
Given
Survival data from individuals as (time, event)
Categorical/continuous features for each individual
Estimate
Probability of survival to a feature time
Rate of hazard at a given time
Ex.
† death from specific cancer
? lost to follow-up
†
†
?
†
?
1 2 3 4 5 6 7 8
9
I I I I I I I I I
Patients
2
1
3
4
5
Time
27
34
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Baseline hazard covariates
coefficients
29
35
36
Event	Hazard	Rate
• Symptom	events	E follow	a	Poisson	process:
timeE1 E2 E3 E4
Death
Hazard
function
Hazard function = Poisson rate:
Given state and hazard, we could compute the probability of the
observed event count:
[ ]
t
tttE
th
t Δ
Δ+∈
=
→Δ
state),[Prob
limstate);(
|
0
[ ] ,
!
ineventsProb 21
K
He
tttK
KH−
=≤≤ dttthH
t
t
))(state;(
2
1
∫=
37
Cox	Proportional	Hazards
• Assume	that	exactly	1	patient	gets	event	E at	time	t
• The	probability	that	it	is		Patient	#i is	the	hazard	ratio:
• Cox	assumption:
• Time	confounder	cancels	out!
t
[ ] ∑ =
=
n
j ji sthsthEi 1
);();(gets#Prob
s1
si = statei
s2
sn
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
)(exp)((state))(state);( T
00 sththth λ⋅=Λ⋅=
38
Cox	“Partial”	Likelihood
• Cox	“partial”	likelihood	for	the	dataset	is	a	product	over	all	E:
Patient #1
Patient #2
Patient #3
Patient #n – 1
Patient #n
. . . . .
[ ] ∏
∑
∏
∑ ==
=== EtEt n
j j
t
n
j j
t
ts
ts
tsth
tsth
EL ::
1
T
)(who
T
1
)(who
Cox
)(
)(
)(
)(
)(exp
)(exp
)(;
)(;
allProb)(
λ
λ
λ
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/left/interval) censored data
Cox regression in DML
Fitting parameters via negative partial log-likelihood
Method: trust region Newton with conjugate gradient
Inverting the Hessian using block Cholesky for
computing std error of betas
Similar features as coxph() in R, e.g., stratification,
frequency weights, offsets, goodness of fit
testing, recurrent event analysis
Baseline hazard covariates
coefficients
29
39
BACK-UP
40
Kaplan-Meier Estimator
28
41
Kaplan-Meier Estimator
28
42
Confidence	Intervals
• Definition	of	Confidence	Interval;	p-value
• Likelihood	ratio	test
• How	to	use	it	for	confidence	interval
• Degrees	of	freedom
43

More Related Content

What's hot

Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsGabriel Peyré
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningSungbin Lim
 
Lecture 3 - Introduction to Interpolation
Lecture 3 - Introduction to InterpolationLecture 3 - Introduction to Interpolation
Lecture 3 - Introduction to InterpolationEric Cochran
 
An Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence TestingAn Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence TestingYuchi Matsuoka
 
Numerical Methods - Oridnary Differential Equations - 3
Numerical Methods - Oridnary Differential Equations - 3Numerical Methods - Oridnary Differential Equations - 3
Numerical Methods - Oridnary Differential Equations - 3Dr. Nirav Vyas
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodAlexander Decker
 
Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexityFrank Nielsen
 
Numerical Methods - Oridnary Differential Equations - 2
Numerical Methods - Oridnary Differential Equations - 2Numerical Methods - Oridnary Differential Equations - 2
Numerical Methods - Oridnary Differential Equations - 2Dr. Nirav Vyas
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingEnthought, Inc.
 
Probability Formula sheet
Probability Formula sheetProbability Formula sheet
Probability Formula sheetHaris Hassan
 

What's hot (18)

Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep Learning
 
Legendre Function
Legendre FunctionLegendre Function
Legendre Function
 
Lecture 3 - Introduction to Interpolation
Lecture 3 - Introduction to InterpolationLecture 3 - Introduction to Interpolation
Lecture 3 - Introduction to Interpolation
 
An Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence TestingAn Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence Testing
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, ...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, ...MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, ...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Statistical Sparsity, ...
 
Numerical Methods - Oridnary Differential Equations - 3
Numerical Methods - Oridnary Differential Equations - 3Numerical Methods - Oridnary Differential Equations - 3
Numerical Methods - Oridnary Differential Equations - 3
 
Complex varible
Complex varibleComplex varible
Complex varible
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
test
testtest
test
 
Unit vi
Unit viUnit vi
Unit vi
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis method
 
Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexity
 
Numerical Methods - Oridnary Differential Equations - 2
Numerical Methods - Oridnary Differential Equations - 2Numerical Methods - Oridnary Differential Equations - 2
Numerical Methods - Oridnary Differential Equations - 2
 
HERMITE SERIES
HERMITE SERIESHERMITE SERIES
HERMITE SERIES
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
 
Probability Formula sheet
Probability Formula sheetProbability Formula sheet
Probability Formula sheet
 

Viewers also liked

Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and InvocationArvind Surve
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11Russ Altman
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissSpark Summit
 
Ggianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social MediaGgianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social MediaElena Minchenok
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarArvind Surve
 
Equilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogetherEquilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogetherConferat Conferat
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLJanani C
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表Shi Guo Xian
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenArvind Surve
 
Fazeley Studios Photography
Fazeley Studios PhotographyFazeley Studios Photography
Fazeley Studios PhotographyLaurenClarke123
 

Viewers also liked (20)

Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Inside Apache SystemML
Inside Apache SystemMLInside Apache SystemML
Inside Apache SystemML
 
Ggianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social MediaGgianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social Media
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Resume sachin kuckian
Resume sachin kuckianResume sachin kuckian
Resume sachin kuckian
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Equilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogetherEquilibrium – puttingdemandandsupplytogether
Equilibrium – puttingdemandandsupplytogether
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Fazeley Studios Photography
Fazeley Studios PhotographyFazeley Studios Photography
Fazeley Studios Photography
 

Similar to Regression using Apache SystemML by Alexandre V Evfimievski

Bayesian regression intro with r
Bayesian regression intro with rBayesian regression intro with r
Bayesian regression intro with rJosue Guzman
 
Introduction to the theory of optimization
Introduction to the theory of optimizationIntroduction to the theory of optimization
Introduction to the theory of optimizationDelta Pi Systems
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsVjekoslavKovac1
 
Function evaluation, termination, vertical line test etc
Function evaluation, termination, vertical line test etcFunction evaluation, termination, vertical line test etc
Function evaluation, termination, vertical line test etcsurprisesibusiso07
 
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Christian Robert
 
A sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsA sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsVjekoslavKovac1
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
GATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and DifferentiabilityGATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and DifferentiabilityParthDave57
 
02 2d systems matrix
02 2d systems matrix02 2d systems matrix
02 2d systems matrixRumah Belajar
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon informationFrank Nielsen
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
03 convexfunctions
03 convexfunctions03 convexfunctions
03 convexfunctionsSufyan Sahoo
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Frank Nielsen
 
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and AlgorithmsNIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithmszukun
 

Similar to Regression using Apache SystemML by Alexandre V Evfimievski (20)

talk MCMC & SMC 2004
talk MCMC & SMC 2004talk MCMC & SMC 2004
talk MCMC & SMC 2004
 
Bayesian regression intro with r
Bayesian regression intro with rBayesian regression intro with r
Bayesian regression intro with r
 
Introduction to the theory of optimization
Introduction to the theory of optimizationIntroduction to the theory of optimization
Introduction to the theory of optimization
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurations
 
Function evaluation, termination, vertical line test etc
Function evaluation, termination, vertical line test etcFunction evaluation, termination, vertical line test etc
Function evaluation, termination, vertical line test etc
 
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
 
A sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsA sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentials
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
GATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and DifferentiabilityGATE Engineering Maths : Limit, Continuity and Differentiability
GATE Engineering Maths : Limit, Continuity and Differentiability
 
stoch41.pdf
stoch41.pdfstoch41.pdf
stoch41.pdf
 
Lecture5 kernel svm
Lecture5 kernel svmLecture5 kernel svm
Lecture5 kernel svm
 
02 2d systems matrix
02 2d systems matrix02 2d systems matrix
02 2d systems matrix
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
sada_pres
sada_pressada_pres
sada_pres
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
03 convexfunctions
03 convexfunctions03 convexfunctions
03 convexfunctions
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
 
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and AlgorithmsNIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms
 

More from Arvind Surve

Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarArvind Surve
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation processArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldArvind Surve
 

More from Arvind Surve (12)

Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation process
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 

Recently uploaded

4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxryandux83rd
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxAvaniJani1
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Osopher
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 

Recently uploaded (20)

Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptx
 
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
 
Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptx
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 

Regression using Apache SystemML by Alexandre V Evfimievski

  • 2. Linear Regression • INPUT: Records (x1, y1), (x2, y2), …, (xn, yn) – Each xi is m-dimensional: xi1, xi2, …, xim – Each yi is 1-dimensional • Want to approximate yi as a linear combination of xi-entries – yi ≈ β1xi1 + β2xi2 + … + βmxim – Case m = 1: yi ≈ β1xi1 ( Note: x = 0 maps to y = 0 ) • Intercept: a “free parameter” for default value of yi – yi ≈ β1xi1 + β2xi2 + … + βmxim + βm+1 – Case m = 1: yi ≈ β1xi1 + β2 • Matrix notation: Y ≈ Xβ, or Y ≈ (X |1) β if with intercept – X is n × m, Y is n × 1, β is m × 1 or (m+1) × 1 2
  • 3. Linear Regression: Least Squares • How to aggregate errors: yi – (β1xi1 + β2xi2 + … + βmxim) ? – What’s worse: many small errors, or a few big errors? • Sum of squares: ∑i≤n (yi – (β1xi1 + β2xi2 + … + βmxim))2 → min – A few big errors are much worse! We square them! • Matrix notation: (Y – Xβ)T (Y – Xβ) → min • Good news: easy to solve and find the β’s • Bad news: too sensitive to outliers! 3
  • 4. Linear Regression: Direct Solve • (Y – Xβ)T (Y – Xβ) → min • YT Y – YT (Xβ) – (Xβ)T Y + (Xβ)T (Xβ) → min • ½ βT (XTX) β – βT (XTY) → min • Take the gradient and set it to 0: (XTX) β – (XTY) = 0 • Linear equation: (XTX) β = XTY; Solution: β = (XTX)–1 (XTY) A = t(X) %*% X; b = t(X) %*% y; . . . . . . beta_unscaled = solve (A, b); 4
  • 5. Computation of XTX • Input (n × m)-matrix X is often huge and sparse – Rows X[i, ] make up n records, often n >> 106 – Columns X[, j] are the features • Matrix XTX is (m × m) and dense – Cells: (XTX) [j1, j2] = ∑ i≤n X[i, j1] * X[i, j2] – Part of covariance between features # j1 and # j2 across all records – m could be small or large • If m ≤ 1000, XTX is small and “direct solve” is efficient… – … as long as XTX is computed the right way! – … and as long as XTX is invertible (no linearly dependent features) 5
  • 6. Computation of XTX • Naïve computation: a) Read X into memory b) Copy it and rearrange cells into the transpose c) Multiply two huge matrices, XT and X • There is a better way: XTX = ∑i≤n X[i, ]T X[i, ] (outer product) – For all i = 1, …, n in parallel: a) Read one row X[i, ] b) Compute (m × m)-matrix: Mi [j1, j2] = X[i, j1] * X[i, j2] c) Aggregate: M = M + Mi • Extends to (XTX) v and XT diag(w) X, used in other scripts: – (XTX) v = ∑i≤n (∑ j≤m X[i, j]v[j]) * X[i, ]T – XT diag(w)X = ∑ i≤n wi * X[i, ]T X[i, ] 6
  • 7. Conjugate Gradient • What if XTX is too large, m >> 1000? – Dense XTX may take far more memory than sparse X • Full XTX not needed to solve (XTX) β = XTY – Use iterative method – Only evaluate (XTX)v for certain vectors v • Ex.: Gradient Descent for f (β) = ½ βT (XTX) β – βT (XTY) – Start with any β = β0 – Take the gradient: r = df(β) = (XTX) β – (XTY) (also, residual) – Find number a to minimize f(β + a ·r): a = – (rT r) / (rT XTX r) – Update: βnew ← β + a·r • But gradient is too local – And “forgetful” *a · r 7
  • 8. Conjugate Gradient • PROBLEM: Gradient takes a very similar direction many times • Enforce orthogonality to prior directions? – Take the gradient: r = (XTX) β – (XTY) – Subtract prior directions: p(k) = r – λ1p(1) – … – λk-1p(k-1) • Pick λi to ensure (p(k) · p(i)) = 0 ??? – Find number a(k) to minimize f(β + a(k) ·p(k)), etc … • STILL, PROBLEMS: – Value a(k) does NOT minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) – Keep all prior directions p(1), p(2), … , p(k) ? That’s a lot! • SOLUTION: Enforce Conjugacy – Conjugate vectors: uT (XTX) v = 0, instead of uT v = 0 • Matrix XTX acts as the “metric” in distorted space – This does minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) • And, only need p(k-1) and r(k) to compute p(k) 8
  • 9. Conjugate Gradient • Algorithm, step by step i = 0; beta = matrix (0, ...); Initially: β = 0 r = - t(X) %*% y; Residual & gradient r = (XTX) β – (XTY) p = - r; Direction for β: negative gradient norm_r2 = sum (r ^ 2); Norm of residual error = rT r norm_r2_target = norm_r2 * tolerance ^ 2; Desired norm of residual error while (i < mi & norm_r2 > norm_r2_target) { WE HAVE: p is the next direction for β q = t(X) %*% (X %*% p) + lambda * p; q = (XTX) p a = norm_r2 / sum (p * q); a = rT r / p (XTX) p minimizes f(β + a· p) beta = beta + a * p; Update: βnew ← β + a · p r = r + a * q; rnew ← (XTX) (β + a· p) – (XTY) old_norm_r2 = norm_r2; = r + a · (XTX) p norm_r2 = sum (r ^ 2); Update the norm of residual error = rT r p = -r + (norm_r2 / old_norm_r2) * p; Update direction: (1) take negative gradient; (2) enforce conjugacy with previous direction i = i + 1; Conjugacy to all older directions is automatic! } 9
  • 10. Degeneracy and Regularization • PROBLEM: What if X has linearly dependent columns? – Cause: recoding categorical features, adding composite features – Then XTX is not a “metric”: exists ǁpǁ > 0 such that pT (XTX) p = 0 – In CG step a = rT r / p (XTX) p : Division By Zero! • In fact, then Least Squares has ∞ solutions – Most of them have HUGE β-values • Regularization: Penalize β with larger values – L2-Regularization: (Y – Xβ)T (Y – Xβ) + λ·βT β → min – Replace XTX with XTX + λI – Pick λ << diag(XTX), refine by cross-validation – Do NOT regularize intercept • CG: q = t(X) %*% (X %*% p) + lambda * p; 10
  • 11. Shifting and Scaling X • PROBLEM: Features have vastly different range: – Examples: [0, 1]; [2010, 2015]; [$0.01, $1 Billion] • Each βi in Y ≈ Xβ has different size & accuracy? – Regularization λ·βT β also range-dependent? • SOLUTION: Scale & shift features to mean = 0, variance = 1 – Needs intercept: Y ≈ (X| 1)β – Equivalently: (Xnew |1) = (X |1) %*% SST “Shift-Scale Transform” • BUT: Sparse X becomes Dense Xnew … • SOLUTION: (Xnew |1) %*% M = (X |1) %*% (SST %*% M) – Extends to XTX and other X-products – Further optimization: SST has special shape 11
  • 12. Shifting and Scaling X – Linear Regression Direct Solve code snippet example: A = t(X) %*% X; b = t(X) %*% y; if (intercept_status == 2) { A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]); A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]; b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]; } A = A + diag (lambda); beta_unscaled = solve (A, b); if (intercept_status == 2) { beta = scale_X * beta_unscaled; beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled; } else { beta = beta_unscaled; } 12
  • 13. Regression in Statistics • Model: Y = Xβ* + ε where ε is a random vector – There exists a “true” β* – Each εi is Gaussian with mean μi = Xi β* and variance σ2 • Likelihood maximization to estimate β* – Likelihood: ℓ(Y | X, β, σ) = ∏i ≤ n C(σ)·exp(– (yi – Xi β)2 / 2σ2) – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Maximum likelihood over β = Least Squares • Why do we need statistical view? – Confidence intervals for parameters – Goodness of fit tests – Generalizations: replace Gaussian with another distribution 13
  • 14. Maximum Likelihood Estimator • In each (xi , yi) let yi have distribution ℓ(yi | xi , β, φ) – Records are mutually independent for i = 1, …, n • Estimator for β is a function f(X, Y) – Y is random → f(X, Y) random – Unbiased estimator: for all β, mean E f(X, Y) = β • Maximum likelihood estimator – MLE (X, Y) = argmaxβ ∏i ≤ n ℓ(yi | xi , β, φ) – Asymptotically unbiased: E MLE (X, Y) → β as n → ∞ • Cramér-Rao Bound – For unbiased estimators, Var f(X, Y) ≥ FI(X, β, φ) –1 – Fisher information: FI(X, β, φ) = – EY Hessianβ log ℓ(Y| X, β, φ) – For MLE: Var (MLE (X, Y)) → FI(X, β, φ)–1 as n → ∞ 14
  • 15. Variance of M.L.E. • Cramér-Rao Bound is a simple way to estimate variance of predicted parameters (for large n): 1. Maximize log ℓ(Y |X, β, φ) to estimate β 2. Compute the Hessian (2nd derivatives) of log ℓ(Y |X, β, φ) 3. Compute “expected” Hessian: FI = – EY Hessian 4. Invert FI as a matrix: get FI–1 5. Use FI–1 as approx. covariance matrix for the estimated β • For linear regression: – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Hessian = –(1/σ2)·XTX; FI = (1/σ2)·XTX – Cov β ≈ σ2 ·(XTX) –1 ; Var βj ≈ σ2 ·diag((XTX) –1) j 15
  • 16. Variance of Y given X • MLE for variance of Y = 1/n · ∑ i ≤ n (yi – y avg)2 – To make it unbiased, replace 1/n with 1/(n – 1) • Variance of ε in Y = Xβ* + ε is residual variance – Estimator for Var(ε) = 1/(n – m – 1) · ∑i ≤ n (yi – Xi β)2 • Good regression must have: Var(ε) << Var(Y) – “Explained” variance = Var(Y) – Var(ε) • R-squared: estimate 1 – Var(ε) / Var(Y) to test fitness: – R2 plain = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) – R2 adj. = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) · (n – 1) / (n – m – 1) • Pearson residual: ri = (yi – Xi β) / Var(ε)1/2 – Should be approximately Gaussian with mean 0 and variance 1 – Can use in another fitness test (more on tests later) 16
  • 17. LinReg Scripts: Inputs # INPUT PARAMETERS: # -------------------------------------------------------------------------------------------- # NAME TYPE DEFAULT MEANING # -------------------------------------------------------------------------------------------- # X String --- Location (on HDFS) to read the matrix X of feature vectors # Y String --- Location (on HDFS) to read the 1-column matrix Y of response values # B String --- Location to store estimated regression parameters (the betas) # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # icpt Int 0 Intercept presence, shifting and rescaling the columns of X: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero # for highly dependend/sparse/numerous features # tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if # L2 norm of the beta-residual is less than tolerance * its initial norm # maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum # fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv" # -------------------------------------------------------------------------------------------- # OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value: # OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B: # icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B # icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # Col.2: betas for shifted/rescaled X and intercept 17
  • 18. LinReg Scripts: Outputs # In addition, some regression statistics are provided in CSV format, one comma-separated # name-value pair per each line, as follows: # # NAME MEANING # ------------------------------------------------------------------------------------- # AVG_TOT_Y Average of the response value Y # STDEV_TOT_Y Standard Deviation of the response value Y # AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias # STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X) # DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr. # PLAIN_R2 Plain R^2 of residual with bias included vs. total average # ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average # PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average # ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average # PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant # ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant # ------------------------------------------------------------------------------------- # * The last two statistics are only printed if there is no intercept (icpt=0) # # The Log file, when requested, contains the following per-iteration variables in CSV # format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for # initial values: # # NAME MEANING # ------------------------------------------------------------------------------------- # CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y # where A = t(X) %*% X + diag (lambda), or a similar quantity # CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial # ------------------------------------------------------------------------------------- 18
  • 19. Caveats • Overfitting: β reflect individual records in X, not distribution – Typically, too few records (small n) or too many features (large m) – To detect, use cross-validation – To mitigate, select fewer features; regularization may help too • Outliers: Some records in X are highly abnormal – They badly violate distribution, or have very large cell-values – Check MIN and MAX of Y, X-columns, Xi β, ri 2 = (yi – Xi β)2 / Var(ε) – To mitigate, remove outliers, or change distribution or link function • Interpolation vs. extrapolation – A model trained on one kind of data may not carry over to another kind of data; the past may not predict the future – Great research topic! 19
  • 20. Generalized Linear Models • Linear Regression: Y = Xβ* + ε – Each yi is Normal(μi , σ2) where mean μi = Xi β* – Variance(yi) = σ2 = constant • Logistic Regression: – Each yi is Bernoulli(μi) where mean μi = 1 / (1 + exp (– Xi β*)) – Prob [yi = 1] = μi , Prob [yi = 0] = 1 – μi , mean = probability of 1 – Variance(yi) = μi (1 – μi) • Poisson Regression: – Each yi is Poisson(μi) where mean μi = exp(Xi β*) – Prob [yi = k] = (μi)k exp(– μi)/ k! for k = 0, 1, 2, … – Variance(yi) = μi • Only in Linear Regression we add error εi to mean μi 20
  • 21. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Example: Linear Regression as GLM – C(σ)·exp(– (yi – Xi β)2 / 2σ2) = exp{(yi ·θi – b(θi))/a + c(yi , a)} – θi = μi = Xi β; b(θi) = (Xi β)2 / 2; a = σ2 = variance • Link function = identity; c(yi , a) = – yi 2 /2σ2 + log C(σ) • Example: Logistic Regression as GLM – (μi )y[i] (1 – μi)1 – y[i] = exp{yi · log(μi) – yi · log(1 – μi) + log(1 – μi)} = exp{(yi ·θi – b(θi))/ a + c(yi , a)} – θi = log(μi / (1 – μi)) = Xi β; b(θi) = – log(1 – μi) = log(1 + exp(θi)) • Link function = log (μ / (1 – μ)) ; Variance = μ(1 – μ) ; a = 1 21
  • 22. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Why θi ? What is b(θi)? – θi makes formulas simpler, stands for μi (no big deal) – b(θi) defines what distribution it is: linear, logistic, Poisson, etc. – b(θi) connects mean with variance: Var(yi) = a·bʺ(θi), μi = bʹ(θi) • What is link function? – You choose it to link μi with your features β1xi1 + β2xi2 + … + βmxim – Additive effects: μi = Xi β; Multiplicative effects: μi = exp(Xi β) Bayes law effects: μi = 1 / (1 + exp (– Xi β)); Inverse: μi = 1 / (Xi β) – Xi β has range (– ∞, +∞), but μi may range in [0, 1], [0, +∞) etc. 22
  • 23. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) • Mean-to-variance for common distributions: – Var (yi) = a ·(μi)0 = σ2 : Linear / Gaussian – Var (yi) = a ·μi (1 – μi): Logistic / Binomial – Var (yi) = a ·(μi)1 : Poisson – Var (yi) = a ·(μi)2 : Gamma – Var (yi) = a ·(μi)3 : Inverse Gaussian • We support two types: Power and Binomial – Var (yi) = a ·(μi)α : Power, for any α – Var (yi) = a ·μi (1 – μi): Binomial 23
  • 24. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) Supported link functions • Power: Xi β = (μi)s where s = 0 stands for Xi β = log (μi) – Examples: identity, inverse, log, square root • Link functions used in binomial / logistic regression: – Logit, Probit, Cloglog, Cauchit – Link Xi β-range (– ∞, +∞) with μi-range (0, 1) – Differ in tail behavior • Canonical link function: – Makes Xi β = the canonical parameter θi , i.e. sets μi = bʹ(Xi β) – Power link Xi β = (μi)1 – α if Var = a·(μi)α ; Logit link for binomial 24
  • 25. GLM Script Inputs # NAME TYPE DEFAULT MEANING # --------------------------------------------------------------------------------------------- # X String --- Location to read the matrix X of feature vectors # Y String --- Location to read response matrix Y with either 1 or 2 columns: # if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg) # B String --- Location to store estimated regression parameters (the betas) # fmt String "text" The betas matrix output format, such as "text" or "csv" # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial # vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1): # 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian # link Int 0 Link function code: 0 = canonical (depends on distribution), # 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit # lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1): # -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity # yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0 # icpt Int 0 Intercept presence, X columns shifting and rescaling: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.0 Regularization parameter (lambda) for L2 regularization # tol Double 0.000001 Tolerance (epsilon) # disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data # moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations # mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum # --------------------------------------------------------------------------------------------- # OUTPUT: Matrix beta, whose size depends on icpt: # icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2 25
  • 26. GLM Script Outputs # In addition, some GLM statistics are provided in CSV format, one comma-separated name-value # pair per each line, as follows: # ------------------------------------------------------------------------------------------- # TERMINATION_CODE A positive integer indicating success/failure as follows: # 1 = Converged successfully; 2 = Maximum number of iterations reached; # 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported # BETA_MIN Smallest beta value (regression coefficient), excluding the intercept # BETA_MIN_INDEX Column index for the smallest beta value # BETA_MAX Largest beta value (regression coefficient), excluding the intercept # BETA_MAX_INDEX Column index for the largest beta value # INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0) # DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter # or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0 # DISPERSION_EST Dispersion estimated from the dataset # DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0 # DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value # ------------------------------------------------------------------------------------------- # # The Log file, when requested, contains the following per-iteration variables in CSV format, # each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values: # ------------------------------------------------------------------------------------------- # NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration # IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise # POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point # OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood) # OBJ_DROP_REAL Reduction in the objective during this iteration, actual value # OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation # OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region # GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted) # LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows # LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows # IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored # TRUST_DELTA Updated trust region size, the "delta" # ------------------------------------------------------------------------------------------- 26
  • 27. GLM Likelihood Maximization • 1 record: ℓ (yi | θi , a) = exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Log ℓ (Y |Θ, a) = 1/a · ∑ i ≤ n (yi · θi – b(θi)) + const(Θ) • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min – Here θi is a function of β: θi = bʹ–1 (g –1 (Xi β)) – Add regularization with λ/2 to agree with least squares – If X has intercept, do NOT regularize its β-value • Non-quadratic; how to optimize? – Gradient descent: fastest when far from optimum – Newton method: fastest when close to optimum • Trust Region Conjugate Gradient – Strikes a good balance between the above two 27
  • 28. GLM Likelihood Maximization • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min • Outer iteration: From β to βnew = β + z – ∆f (z; β) := f(β + z; X, Y) – f(β; X, Y) • Use “Fisher Scoring” to approximate Hessian and ∆f (z; β) – ∆f (z; β) ≈ ½·zT A z + GT z, where: – A = XT diag(w)X + λI and G = – XT u + λ·β – Vectors u, w depend on β via mean-to-variance and link functions • Trust Region: Area ǁzǁ2 ≤ δ where we trust the approximation ∆f (z; β) ≈ ½ ·zT A z + GT z – ǁzǁ2 ≤ δ too small → Gradient Descent step (1 inner iteration) – ǁzǁ2 ≤ δ mid-size → Cut-off Conjugate Gradient step (2 or more) – ǁzǁ2 ≤ δ too wide → Full Conjugate Gradient step FI = XT diag(w) X is “expected” Hessian 28
  • 29. Trust Region Conj. Gradient • Code snippet for Logistic Regression g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5); delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2))); exit_g2 = sum (g ^ 2) * tolerance ^ 2; while (sum (g ^ 2) > exit_g2 & i < max_i) { i = i + 1; r = g; r2 = sum (r ^ 2); exit_r2 = 0.01 * r2; d = - r; z = zeros_D; j = 0; trust_bound_reached = FALSE; while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j) { j = j + 1; Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d; c = r2 / sum (d * Hd); [c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2); z = z + c * d; r = r + c * Hd; r2_new = sum (r ^ 2); d = - r + (r2_new / r2) * d; r2 = r2_new; } p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z)))); f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val; delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g))); if (f_chg < 0) { beta = beta + z; f_val = f_val + f_chg; w = p * (1 - p); g = - t(X) %*% ((1 - p) * y) + lambda * beta; } } ensure_quadratic = function (double x, a, b, c) return (double x_new, boolean test) { test = (a * x^2 + b * x + c > 0); if (test) { rad = sqrt (b ^ 2 - 4 * a * c); if (b >= 0) { x_new = - (2 * c) / (b + rad); } else { x_new = - (b - rad) / (2 * a); } } else { x_new = x; } } 29
  • 30. Trust Region Conj. Gradient • Trust region update in Logistic Regression snippet update_trust_region = function (double delta, double z_distance, double f_chg_exact, double f_chg_linear_approx, double f_chg_quadratic_approx) return (double delta) { sigma1 = 0.25; sigma2 = 0.5; sigma3 = 4.0; if (f_chg_exact <= f_chg_linear_approx) { alpha = sigma3; } else { alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx)); } rho = f_chg_exact / f_chg_quadratic_approx; if (rho < 0.0001) { delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta); } else { if (rho < 0.25) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta)); } else { if (rho < 0.75) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta)); } else { delta = max (delta, min (alpha * z_distance, sigma3 * delta)); }}} } 30
  • 31. GLM: Other Statistics • REMINDER: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Variance(yi) = a ·bʺ(θi) = a·V(μi) • Variance of Y given X – Estimating the β gives V(μi) = V (g–1 (Xi β)) – Constant “a” is called dispersion, analogue of σ2 – Estimator: a ≈ 1/(n – m)·∑ i ≤ n (yi – μi)2 / V(μi) • Variance of parameters β – We use MLE, hence Cramér-Rao formula applies (for large n) – Fisher Information: FI = (1/a)· XT diag(w)X, wi = (V(μi) ·gʹ(μi)2)–1 – Estimator: Cov β ≈ a·(XT diag(w)X)–1, Var βj = (Cov β)jj 31
  • 32. GLM: Deviance • Let X have m features, of which k may have no effect on Y – Will “no effect” result in βj ≈ 0 ? (Unlikely.) – Estimate βj and Var βj then test βj / (Var βj)1/2 against N(0, 1)? • Student’s t-test is better • Likelihood Ratio Test: • Null Hypothesis: Y given X follows GLM with β1 = … = βk = 0 – If NH is true, D is asympt. distributed as χ2 with k deg. of freedom – If NH is false, D → +¥as n → +¥ • P-value % = Prob[ χ2 k > D] · 100% ( ) ( ) 0 ...,,,0...,,0;,|max ...,,,...,,;,|max log2 1GLM 11GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅= + + mk mkk aXYL aXYL D ββ ββββ β β 32
  • 33. GLM: Deviance • To test many nested models (feature subsets) we need their maximum likelihoods to compute D – PROBLEM: Term “c(yi , a)” in GLM’s exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Instead, compute deviance: • “Saturated model” has no X, no β, but picks the best θi for each individual yi (not realistic at all, just convention) – Term “c(yi , a)” is the same in both models! – But “a” has to be fixed, e.g. to 1 • Deviance itself is used for goodness of fit tests, too ( ) ( ) 0 ...,,,...,,;,|max modelsaturated:;|max log2 11GLM GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ Θ ⋅= + Θ mkkaXYL aYL D ββββ β 33
  • 34. Survival Analysis Given Survival data from individuals as (time, event) Categorical/continuous features for each individual Estimate Probability of survival to a feature time Rate of hazard at a given time Ex. † death from specific cancer ? lost to follow-up † † ? † ? 1 2 3 4 5 6 7 8 9 I I I I I I I I I Patients 2 1 3 4 5 Time 27 34
  • 35. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Baseline hazard covariates coefficients 29 35
  • 36. 36 Event Hazard Rate • Symptom events E follow a Poisson process: timeE1 E2 E3 E4 Death Hazard function Hazard function = Poisson rate: Given state and hazard, we could compute the probability of the observed event count: [ ] t tttE th t Δ Δ+∈ = →Δ state),[Prob limstate);( | 0 [ ] , ! ineventsProb 21 K He tttK KH− =≤≤ dttthH t t ))(state;( 2 1 ∫=
  • 37. 37 Cox Proportional Hazards • Assume that exactly 1 patient gets event E at time t • The probability that it is Patient #i is the hazard ratio: • Cox assumption: • Time confounder cancels out! t [ ] ∑ = = n j ji sthsthEi 1 );();(gets#Prob s1 si = statei s2 sn Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . )(exp)((state))(state);( T 00 sththth λ⋅=Λ⋅=
  • 38. 38 Cox “Partial” Likelihood • Cox “partial” likelihood for the dataset is a product over all E: Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . [ ] ∏ ∑ ∏ ∑ == === EtEt n j j t n j j t ts ts tsth tsth EL :: 1 T )(who T 1 )(who Cox )( )( )( )( )(exp )(exp )(; )(; allProb)( λ λ λ
  • 39. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Cox regression in DML Fitting parameters via negative partial log-likelihood Method: trust region Newton with conjugate gradient Inverting the Hessian using block Cholesky for computing std error of betas Similar features as coxph() in R, e.g., stratification, frequency weights, offsets, goodness of fit testing, recurrent event analysis Baseline hazard covariates coefficients 29 39
  • 43. Confidence Intervals • Definition of Confidence Interval; p-value • Likelihood ratio test • How to use it for confidence interval • Degrees of freedom 43