SlideShare a Scribd company logo
1 of 15
Download to read offline
Clustering	and	Factorization										
in	SystemML	(part	1)
Alexandre	Evfimievski
1
K-means	Clustering
• INPUT:		n		records		x1,	x2,	…,	xn as	the	rows	of	matrix		X
– Each		xi is		m-dimensional:		xi =		(xi1,	xi2,	…,	xim)
– Matrix		X		is		(n	× m)-dimensional
• INPUT:		k,		an	integer	in		{1,	2,	…,	n}
• OUTPUT:		Partition	the	records	into		k		clusters		S1,	S2,	…,	Sk
– May	use		n		labels		y1,	y2,	…,	yn in		{1,	2,	…,	k}
– NOTE:		Same	clusters	can	label	in		k! ways		– important	if	checking	
correctness	(don’t	just	compare	“predicted”	and	“true”	label)
• METRIC:		Minimize	within-cluster	sum	of	squares (WCSS)
• Cluster	“means”	are		k		vectors	that	capture	as	much	variance	
in	the	data	as	possible
2
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
K-means	Clustering
• K-means	is	a	little	similar	to	linear	regression:
– Linear	regression	error =		∑i≤n	(yi		– xi	·β)2
– BUT:		Clustering	describes		xi ’s		themselves,	not		yi	’s		given		xi	’s
• K-means	can	work	in	“linearization	space”		(like	kernel	SVM)
• How	to	pick		k	?
– Try		k	=	1,	2,	…,		up	to	some	limit;		check	for	overfitting
– Pick	the	best		k		in	the	context	of	the	whole	task
• Caveats	for	k-means
– They	do	NOT	estimate	a	mixture	of	Gaussians
• EM	algorithm	does	this
– The		k		clusters	tend	to	be	of	similar	size
• Do	NOT	use	for	imbalanced	clusters!
3
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
The	K-means	Algorithm
• Pick		k		“centroids”		c1,	c2,	…,	ck from	the	records		{x1,	x2,	…,	xn}
– Try	to	pick	centroids	far	from	each	other
• Assign	each	record	to	the	nearest	centroid:
– For	each		xi compute		di =		min	{dist(xi	,	cj)	over	all	cj	}
– Cluster		Sj ←		{	xi :	dist(xi	,	cj)	=	di	}
• Reset	each	centroid	to	its	cluster’s	mean:
– Centroid		cj ←		mean(Sj)		=		∑i≤n		(xi in	Sj?) ·xi /		|Sj|
• Repeat	“assign”	and	“reset”	steps	until	convergence
• Loss	decreases:		WCSSold ≥		C-WCSSnew ≥		WCSSnew
– Converges	to	local	optimum	(often,	not	global)
4
( )
2
21
:centroidWCSS-C ∑=
∈−=
n
i jiji SxSx
The	K-means	Algorithm
• Runaway	centroid:		closest	to	no	record	at	“assign”	step	
– Occasionally	happens	e.g.	with	k	=	3	centroids	and	2	data	clusters
– Options:	(a)	terminate,	(b)	reduce	k	by	1
• Centroids	vs.	means	@	early	termination:
– After	“assign”	step,	cluster	centroids	≠	their	means
• Centroids:	(a)	define	the	clusters,	(b)	already	computed
• Means:	(a)	define	the	WCSS	metric,	(b)	not	yet	computed
– We	report	centroids	and	centroid-WCSS	(C-WCSS)
• Multiple	runs:
– Required	against	a	bad	local	optimum
– Use	“parfor”	loop,	with	random	initial	centroids
5
K-means:		DML		Implementation
C = All_C [(k * (run - 1) + 1) : (k * run), ];
iter = 0; term_code = 0; wcss = 0;
while (term_code == 0) {
D = -2 * (X %*% t(C)) + t(rowSums(C ^ 2));
minD = rowMins (D); wcss_old = wcss;
wcss = sumXsq + sum (minD);
if (wcss_old - wcss < eps * wcss & iter > 0) {
term_code = 1; # Convergence is reached
} else {
if (iter >= max_iter) { term_code = 2;
} else { iter = iter + 1;
P = ppred (D, minD, "<=");
P = P / rowSums(P);
if (sum (ppred (colSums (P), 0.0, "<=")) > 0) {
term_code = 3; # "Runaway" centroid
} else {
C = t(P / colSums(P)) %*% X;
} } } }
All_C [(k * (run - 1) + 1) : (k * run), ] = C;
final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6
Want	smooth	assign?	
Edit	here
Tensor	avoidance	
maneuver
ParFor I/O
K-means++ Initialization	Heuristic
• Picks	centroids	from		X		at	random,	pushing	them	far	apart
• Gets	WCSS	down	to		O(log	k)	× optimal		in	expectation
• How	to	pick	centroids:
– Centroid c1:		Pick	uniformly	at	random	from	X-rows
– Centroid c2:		Prob	[c2	←xi	]		=		(1/Σ)	·	dist(xi	,	c1)2
– Centroid cj:		Prob	[cj	←xi	]		=		(1/Σ)	·	min{dist(xi	,	c1)2,	…,	dist(xi	,	cj–1	)2}
– Probability	to	pick	a	row	is	proportional	to	its	squared	min-distance	
from	earlier	centroids
• If		X		is	huge,	we	use	a	sample	of		X,		different	across	runs
– Otherwise	picking		k		centroids	requires		k		passes	over		X
7
David	Arthur,	Sergei	Vassilvitskii		“k-means++:	the	advantages	of	careful	seeding”	in	SODA	2007
K-means	Predict	Script
• Predictor	and	Evaluator	in	one:
– Given		X		(data)	and		C		(centroids),	assigns	cluster	labels prY
– Compares	2	clusterings,	“predicted” prY and	“specified” spY
• Computes	WCSS,	as	well	as	Between-Cluster	Sum	of	Squares	
(BCSS)	and	Total	Sum	of	Squares	(TSS)
– Dataset		X		must	be	available
– If	centroids		C		are	given,	also	computes		C-WCSS		and		C-BCSS
• Two	ways	to	compare prY and spY :
– Same-cluster	and	different-cluster		PAIRS		from prY and spY
– For	each		prY-cluster		find	best-matching		spY-cluster,		and	vice	versa
– All	in	count	as	well	as	in	%	to	full	count
8
Weighted	Non-Negative	Matrix	
Factorization	(WNMF)
• INPUT:		X is	non-negative	(n × m)-matrix
– Example:		Xij =	1		if		person #i		clicked	ad #j,		else		Xij =	0
• INPUT (OPTIONAL):		W is	penalty	(n × m)-matrix
– Example:		Wij =	1		if		person #i		saw	ad #j,		else		Wij =	0
• OUTPUT:		(n × k)-matrix		U,		(m × k)-matrix		V such	that:
– k topics:			Uic =	affinity(prs.	#i,	topic	#c),			Vjc =	affinity (ad	#j,	topic	#c)
– Approximation:			Xij ≈		Ui1	·	Vj1 +		Ui2	·	Vj2 +	…	+		Uik	·	Vjk
– Predict	a	“click”	if	for		some #c		both Uic and		Vjc are	high
9
( )( )2
1 1
,
min ij
T
ij
n
i
m
j
ij
VU
VUXW −∑∑= =
0,0t.s. ≥≥ VU
Weighted	Non-Negative	Matrix	
Factorization	(WNMF)
• NOTE:		Non-negativity	is	critical	for	this	“bipartite	clustering”	
interpretation	of		U and		V
– Matrix		U of	size		n × k		=		cluster	affinity	for	people
– Matrix		V of	size		m × k		=		cluster	affinity	for	ads
• Negatives	would	violate	“disjunction	of	conjunctions”	sense:
– Approximation:			Xij ≈		Ui1	·	Vj1 +		Ui2	·	Vj2 +	…	+		Uik	·	Vjk
– Predict	a	“click”	if	for		some #c		both Uic and		Vjc are	high
10
( )( )2
1 1
,
min ij
T
ij
n
i
m
j
ij
VU
VUXW −∑∑= =
0,0t.s. ≥≥ VU
11
§ Easy	to	parallelize	using	SystemML
§ Multiple	runs	help	avoid	bad	local	optima
§ Must	specify		k		:			Run	for	k =	1,	2,	3	...		(as	in	k-means)
( )[ ]
( )[ ] ε+∗
∗
←
ij
TT
ij
T
ijij
UUVW
UXW
VV
( )[ ]
( )[ ] ε+∗
∗
←
ij
T
ij
ijij
VUVW
VXW
UU
WNMF	:	Multiplicative	Update
Daniel	D.	Lee,	H.	Sebastian	Seung		“Algorithms	for	Non-negative	Matrix	Factorization”		in	NIPS	2000
Inside		A		Run		of		(W)NMF
• Assume	that	W	is	a	sparse	matrix
12
U = RND_U [, (r-1)*k + 1 : r*k];
V = RND_V [, (r-1)*k + 1 : r*k];
f_old = 0; i = 0;
f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2);
while (abs (f_new - f_old) > tol * f_new & i < max_iter)
{ {
f_old = f_new; f_old = f_new;
U = U * (X %*% V)
/ (U %*% (t(V) %*% V) + eps);
U = U * ((W * X) %*% V)
/ ( (W * (U %*% t(V))) %*% V + eps);
V = V * t(t(U) %*% X)
/ (V %*% (t(U) %*% U) + eps);
V = V * (t(W * X) %*% U)
/ (t(W * (U %*% t(V))) %*% U + eps);
f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2);
i = i + 1; i = i + 1;
} }
Sum-Product	Rewrites
• Matrix	chain	product	optimization
– Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V)
• Moving	operators	from	big	matrices	to	smaller	ones
– Example: t(X) %*% U = t(t(U) %*% X)
• Opening	brackets	in	expressions	(ongoing	research)
– Example: sum ((X – U %*% t(V))^2) = sum (X^2) –
2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2)
– K-means: D		=		rowSums	(X	^	2)	– 2	*	(X	%*%	t(C))	+	t(rowSums	(C	^	2))
• Indexed	sum	rearrangements:
– sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V))
– sum (U %*% t(V)) = sum (colSums(U) * colSums(V))
13
Operator	Fusion:		W.	Sq.	Loss
• Weighted	Squared	Loss: sum (W * (X – U %*% t(V))^2)
– Common	pattern	for	factorization	algorithms
– W and	X usually	very	sparse	(<	0.001)
– Problem:		“Outer”	product	of		U %*% t(V) creates	three dense
intermediates	in	the	size	of	X
è Fused	w.sq.loss	operator:
– Key	observations:		Sparse		W * allows	selective	computation,	and	“sum”	
aggregate	significantly	reduces	memory	requirements
U–
t(V)
XWsum *
2
BACK-UP
15

More Related Content

What's hot

Lec05 circle ellipse
Lec05 circle ellipseLec05 circle ellipse
Lec05 circle ellipseMaaz Rizwan
 
Circle generation algorithm
Circle generation algorithmCircle generation algorithm
Circle generation algorithmAnkit Garg
 
Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson MethodJigisha Dabhi
 
10CSL67 CG LAB PROGRAM 9
10CSL67 CG LAB PROGRAM 910CSL67 CG LAB PROGRAM 9
10CSL67 CG LAB PROGRAM 9Vanishree Arun
 
Midpoint circle algo
Midpoint circle algoMidpoint circle algo
Midpoint circle algoMohd Arif
 
Applied numerical methods lec9
Applied numerical methods lec9Applied numerical methods lec9
Applied numerical methods lec9Yasser Ahmed
 
Econometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualEconometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualLewisSimmonss
 
Newton divided difference interpolation
Newton divided difference interpolationNewton divided difference interpolation
Newton divided difference interpolationVISHAL DONGA
 
Calculus AB - Slope of secant and tangent lines
Calculus AB - Slope of secant and tangent linesCalculus AB - Slope of secant and tangent lines
Calculus AB - Slope of secant and tangent linesKenyon Hundley
 
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)Mauricio Vargas 帕夏
 
Bressenham’s Midpoint Circle Drawing Algorithm
Bressenham’s Midpoint Circle Drawing AlgorithmBressenham’s Midpoint Circle Drawing Algorithm
Bressenham’s Midpoint Circle Drawing AlgorithmMrinmoy Dalal
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statisticsSteve Nouri
 
Integration
IntegrationIntegration
Integrationlecturer
 
Calculo de integrais_indefinidos_com_aplicacao_das_proprie
Calculo de integrais_indefinidos_com_aplicacao_das_proprieCalculo de integrais_indefinidos_com_aplicacao_das_proprie
Calculo de integrais_indefinidos_com_aplicacao_das_proprieRigo Rodrigues
 
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEAU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEThiyagarajan G
 
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Igor Moiseev
 

What's hot (20)

Lec05 circle ellipse
Lec05 circle ellipseLec05 circle ellipse
Lec05 circle ellipse
 
Circle generation algorithm
Circle generation algorithmCircle generation algorithm
Circle generation algorithm
 
Arrays
ArraysArrays
Arrays
 
Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson Method
 
10CSL67 CG LAB PROGRAM 9
10CSL67 CG LAB PROGRAM 910CSL67 CG LAB PROGRAM 9
10CSL67 CG LAB PROGRAM 9
 
Midpoint circle algo
Midpoint circle algoMidpoint circle algo
Midpoint circle algo
 
Applied numerical methods lec9
Applied numerical methods lec9Applied numerical methods lec9
Applied numerical methods lec9
 
Econometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions ManualEconometric Analysis 8th Edition Greene Solutions Manual
Econometric Analysis 8th Edition Greene Solutions Manual
 
Newton divided difference interpolation
Newton divided difference interpolationNewton divided difference interpolation
Newton divided difference interpolation
 
Cs580
Cs580Cs580
Cs580
 
Calculus AB - Slope of secant and tangent lines
Calculus AB - Slope of secant and tangent linesCalculus AB - Slope of secant and tangent lines
Calculus AB - Slope of secant and tangent lines
 
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
 
Interpolation
InterpolationInterpolation
Interpolation
 
Computer Graphics
Computer GraphicsComputer Graphics
Computer Graphics
 
Bressenham’s Midpoint Circle Drawing Algorithm
Bressenham’s Midpoint Circle Drawing AlgorithmBressenham’s Midpoint Circle Drawing Algorithm
Bressenham’s Midpoint Circle Drawing Algorithm
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Integration
IntegrationIntegration
Integration
 
Calculo de integrais_indefinidos_com_aplicacao_das_proprie
Calculo de integrais_indefinidos_com_aplicacao_das_proprieCalculo de integrais_indefinidos_com_aplicacao_das_proprie
Calculo de integrais_indefinidos_com_aplicacao_das_proprie
 
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEAU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
 
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
 

Viewers also liked

Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenArvind Surve
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarArvind Surve
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and InvocationArvind Surve
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11Russ Altman
 
ЗНО з фізики як складова національної системи моніторингу якості освіти
ЗНО з фізики як складова національної системи моніторингу якості освіти ЗНО з фізики як складова національної системи моніторингу якості освіти
ЗНО з фізики як складова національної системи моніторингу якості освіти reshetfizika
 
Com afecta la dislèxia en l’autoestima
Com afecta la dislèxia en l’autoestimaCom afecta la dislèxia en l’autoestima
Com afecta la dislèxia en l’autoestimaMariapasfu
 
COX-2 Concomitancy Analysis Jan 2, 05
COX-2 Concomitancy Analysis Jan 2, 05COX-2 Concomitancy Analysis Jan 2, 05
COX-2 Concomitancy Analysis Jan 2, 05Aviel Shatz
 

Viewers also liked (20)

Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
Resume sachin kuckian
Resume sachin kuckianResume sachin kuckian
Resume sachin kuckian
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11
 
ЗНО з фізики як складова національної системи моніторингу якості освіти
ЗНО з фізики як складова національної системи моніторингу якості освіти ЗНО з фізики як складова національної системи моніторингу якості освіти
ЗНО з фізики як складова національної системи моніторингу якості освіти
 
Claudia Ringler, IFPRI
Claudia Ringler, IFPRIClaudia Ringler, IFPRI
Claudia Ringler, IFPRI
 
Phonate technologies
Phonate technologiesPhonate technologies
Phonate technologies
 
Web 2.0.-Google plus
Web 2.0.-Google plusWeb 2.0.-Google plus
Web 2.0.-Google plus
 
Kordibedrest
KordibedrestKordibedrest
Kordibedrest
 
Com afecta la dislèxia en l’autoestima
Com afecta la dislèxia en l’autoestimaCom afecta la dislèxia en l’autoestima
Com afecta la dislèxia en l’autoestima
 
S4 tarea4 cagaf
S4 tarea4 cagafS4 tarea4 cagaf
S4 tarea4 cagaf
 
Deepshekhar
DeepshekharDeepshekhar
Deepshekhar
 
COX-2 Concomitancy Analysis Jan 2, 05
COX-2 Concomitancy Analysis Jan 2, 05COX-2 Concomitancy Analysis Jan 2, 05
COX-2 Concomitancy Analysis Jan 2, 05
 
company presentation
company presentationcompany presentation
company presentation
 

Similar to Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski

Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4Roziq Bahtiar
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdfRahul926331
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Order-Picking-Policies.ppt
Order-Picking-Policies.pptOrder-Picking-Policies.ppt
Order-Picking-Policies.pptTaspiyaAfroz
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationAlexander Litvinenko
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmKALAIRANJANI21
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmKALAIRANJANI21
 
10CSL67 CG LAB PROGRAM 6
10CSL67 CG LAB PROGRAM 610CSL67 CG LAB PROGRAM 6
10CSL67 CG LAB PROGRAM 6Vanishree Arun
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniquesKrishna Gali
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsAmol Gaikwad
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonshin
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer GraphicsKamal Acharya
 
Open GL 04 linealgos
Open GL 04 linealgosOpen GL 04 linealgos
Open GL 04 linealgosRoziq Bahtiar
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 

Similar to Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski (20)

Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Order-Picking-Policies.ppt
Order-Picking-Policies.pptOrder-Picking-Policies.ppt
Order-Picking-Policies.ppt
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithm
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithm
 
10CSL67 CG LAB PROGRAM 6
10CSL67 CG LAB PROGRAM 610CSL67 CG LAB PROGRAM 6
10CSL67 CG LAB PROGRAM 6
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniques
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Ch4
Ch4Ch4
Ch4
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithms
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluiton
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
Open GL 04 linealgos
Open GL 04 linealgosOpen GL 04 linealgos
Open GL 04 linealgos
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
2.circle
2.circle2.circle
2.circle
 
raster algorithm.pdf
raster algorithm.pdfraster algorithm.pdf
raster algorithm.pdf
 

More from Arvind Surve

Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarArvind Surve
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenArvind Surve
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation processArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldArvind Surve
 

More from Arvind Surve (13)

Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation process
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 

Recently uploaded

MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 

Recently uploaded (20)

MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 

Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski

  • 2. K-means Clustering • INPUT: n records x1, x2, …, xn as the rows of matrix X – Each xi is m-dimensional: xi = (xi1, xi2, …, xim) – Matrix X is (n × m)-dimensional • INPUT: k, an integer in {1, 2, …, n} • OUTPUT: Partition the records into k clusters S1, S2, …, Sk – May use n labels y1, y2, …, yn in {1, 2, …, k} – NOTE: Same clusters can label in k! ways – important if checking correctness (don’t just compare “predicted” and “true” label) • METRIC: Minimize within-cluster sum of squares (WCSS) • Cluster “means” are k vectors that capture as much variance in the data as possible 2 ( ) 2 21 :meanWCSS ∑= ∈−= n i jiji SxSx
  • 3. K-means Clustering • K-means is a little similar to linear regression: – Linear regression error = ∑i≤n (yi – xi ·β)2 – BUT: Clustering describes xi ’s themselves, not yi ’s given xi ’s • K-means can work in “linearization space” (like kernel SVM) • How to pick k ? – Try k = 1, 2, …, up to some limit; check for overfitting – Pick the best k in the context of the whole task • Caveats for k-means – They do NOT estimate a mixture of Gaussians • EM algorithm does this – The k clusters tend to be of similar size • Do NOT use for imbalanced clusters! 3 ( ) 2 21 :meanWCSS ∑= ∈−= n i jiji SxSx
  • 4. The K-means Algorithm • Pick k “centroids” c1, c2, …, ck from the records {x1, x2, …, xn} – Try to pick centroids far from each other • Assign each record to the nearest centroid: – For each xi compute di = min {dist(xi , cj) over all cj } – Cluster Sj ← { xi : dist(xi , cj) = di } • Reset each centroid to its cluster’s mean: – Centroid cj ← mean(Sj) = ∑i≤n (xi in Sj?) ·xi / |Sj| • Repeat “assign” and “reset” steps until convergence • Loss decreases: WCSSold ≥ C-WCSSnew ≥ WCSSnew – Converges to local optimum (often, not global) 4 ( ) 2 21 :centroidWCSS-C ∑= ∈−= n i jiji SxSx
  • 5. The K-means Algorithm • Runaway centroid: closest to no record at “assign” step – Occasionally happens e.g. with k = 3 centroids and 2 data clusters – Options: (a) terminate, (b) reduce k by 1 • Centroids vs. means @ early termination: – After “assign” step, cluster centroids ≠ their means • Centroids: (a) define the clusters, (b) already computed • Means: (a) define the WCSS metric, (b) not yet computed – We report centroids and centroid-WCSS (C-WCSS) • Multiple runs: – Required against a bad local optimum – Use “parfor” loop, with random initial centroids 5
  • 6. K-means: DML Implementation C = All_C [(k * (run - 1) + 1) : (k * run), ]; iter = 0; term_code = 0; wcss = 0; while (term_code == 0) { D = -2 * (X %*% t(C)) + t(rowSums(C ^ 2)); minD = rowMins (D); wcss_old = wcss; wcss = sumXsq + sum (minD); if (wcss_old - wcss < eps * wcss & iter > 0) { term_code = 1; # Convergence is reached } else { if (iter >= max_iter) { term_code = 2; } else { iter = iter + 1; P = ppred (D, minD, "<="); P = P / rowSums(P); if (sum (ppred (colSums (P), 0.0, "<=")) > 0) { term_code = 3; # "Runaway" centroid } else { C = t(P / colSums(P)) %*% X; } } } } All_C [(k * (run - 1) + 1) : (k * run), ] = C; final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6 Want smooth assign? Edit here Tensor avoidance maneuver ParFor I/O
  • 7. K-means++ Initialization Heuristic • Picks centroids from X at random, pushing them far apart • Gets WCSS down to O(log k) × optimal in expectation • How to pick centroids: – Centroid c1: Pick uniformly at random from X-rows – Centroid c2: Prob [c2 ←xi ] = (1/Σ) · dist(xi , c1)2 – Centroid cj: Prob [cj ←xi ] = (1/Σ) · min{dist(xi , c1)2, …, dist(xi , cj–1 )2} – Probability to pick a row is proportional to its squared min-distance from earlier centroids • If X is huge, we use a sample of X, different across runs – Otherwise picking k centroids requires k passes over X 7 David Arthur, Sergei Vassilvitskii “k-means++: the advantages of careful seeding” in SODA 2007
  • 8. K-means Predict Script • Predictor and Evaluator in one: – Given X (data) and C (centroids), assigns cluster labels prY – Compares 2 clusterings, “predicted” prY and “specified” spY • Computes WCSS, as well as Between-Cluster Sum of Squares (BCSS) and Total Sum of Squares (TSS) – Dataset X must be available – If centroids C are given, also computes C-WCSS and C-BCSS • Two ways to compare prY and spY : – Same-cluster and different-cluster PAIRS from prY and spY – For each prY-cluster find best-matching spY-cluster, and vice versa – All in count as well as in % to full count 8
  • 9. Weighted Non-Negative Matrix Factorization (WNMF) • INPUT: X is non-negative (n × m)-matrix – Example: Xij = 1 if person #i clicked ad #j, else Xij = 0 • INPUT (OPTIONAL): W is penalty (n × m)-matrix – Example: Wij = 1 if person #i saw ad #j, else Wij = 0 • OUTPUT: (n × k)-matrix U, (m × k)-matrix V such that: – k topics: Uic = affinity(prs. #i, topic #c), Vjc = affinity (ad #j, topic #c) – Approximation: Xij ≈ Ui1 · Vj1 + Ui2 · Vj2 + … + Uik · Vjk – Predict a “click” if for some #c both Uic and Vjc are high 9 ( )( )2 1 1 , min ij T ij n i m j ij VU VUXW −∑∑= = 0,0t.s. ≥≥ VU
  • 10. Weighted Non-Negative Matrix Factorization (WNMF) • NOTE: Non-negativity is critical for this “bipartite clustering” interpretation of U and V – Matrix U of size n × k = cluster affinity for people – Matrix V of size m × k = cluster affinity for ads • Negatives would violate “disjunction of conjunctions” sense: – Approximation: Xij ≈ Ui1 · Vj1 + Ui2 · Vj2 + … + Uik · Vjk – Predict a “click” if for some #c both Uic and Vjc are high 10 ( )( )2 1 1 , min ij T ij n i m j ij VU VUXW −∑∑= = 0,0t.s. ≥≥ VU
  • 11. 11 § Easy to parallelize using SystemML § Multiple runs help avoid bad local optima § Must specify k : Run for k = 1, 2, 3 ... (as in k-means) ( )[ ] ( )[ ] ε+∗ ∗ ← ij TT ij T ijij UUVW UXW VV ( )[ ] ( )[ ] ε+∗ ∗ ← ij T ij ijij VUVW VXW UU WNMF : Multiplicative Update Daniel D. Lee, H. Sebastian Seung “Algorithms for Non-negative Matrix Factorization” in NIPS 2000
  • 12. Inside A Run of (W)NMF • Assume that W is a sparse matrix 12 U = RND_U [, (r-1)*k + 1 : r*k]; V = RND_V [, (r-1)*k + 1 : r*k]; f_old = 0; i = 0; f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2); while (abs (f_new - f_old) > tol * f_new & i < max_iter) { { f_old = f_new; f_old = f_new; U = U * (X %*% V) / (U %*% (t(V) %*% V) + eps); U = U * ((W * X) %*% V) / ( (W * (U %*% t(V))) %*% V + eps); V = V * t(t(U) %*% X) / (V %*% (t(U) %*% U) + eps); V = V * (t(W * X) %*% U) / (t(W * (U %*% t(V))) %*% U + eps); f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2); i = i + 1; i = i + 1; } }
  • 13. Sum-Product Rewrites • Matrix chain product optimization – Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V) • Moving operators from big matrices to smaller ones – Example: t(X) %*% U = t(t(U) %*% X) • Opening brackets in expressions (ongoing research) – Example: sum ((X – U %*% t(V))^2) = sum (X^2) – 2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2) – K-means: D = rowSums (X ^ 2) – 2 * (X %*% t(C)) + t(rowSums (C ^ 2)) • Indexed sum rearrangements: – sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V)) – sum (U %*% t(V)) = sum (colSums(U) * colSums(V)) 13
  • 14. Operator Fusion: W. Sq. Loss • Weighted Squared Loss: sum (W * (X – U %*% t(V))^2) – Common pattern for factorization algorithms – W and X usually very sparse (< 0.001) – Problem: “Outer” product of U %*% t(V) creates three dense intermediates in the size of X è Fused w.sq.loss operator: – Key observations: Sparse W * allows selective computation, and “sum” aggregate significantly reduces memory requirements U– t(V) XWsum * 2