SlideShare a Scribd company logo
1 of 19
Download to read offline
Classification	Algorithms	in	
Apache	SystemML
Prithviraj	Sen
Overview
• Supervised	Learning	and	Classification
• Training	Discriminative	Classifiers
• Representer Theorem
• Support	Vector	Machines
• Logistic	Regression
• Generative	Classifiers:	Naïve	Bayes
• Deep	Learning
• Tree	Ensembles
Classification	and	Supervised	Learning
• Supervised	learning	is	a	major	area	of	machine	learning
• Goal	is	to	learn	function	𝑓 such	that:
𝑓:	ℝm →	C
where:	m	is	a	fixed	integer
C	is	a	fixed	domain	of	labels
• Training:	goal	is	to	learn	𝑓 from	a	labeled dataset
• Testing:	goal	is	to	apply	𝑓 to	unseen	x	∈ ℝm
• Applications:	
• spam	detection	(D	=	{spam,	no-spam})
• search	advertising	(each	ad	is	a	label)
• recognizing	hand-written	digits	(D	=	{0,	1,	…,	9})
Training	a	Classifier
• Given	labeled	training	data:	{(x1,	y1),	(x2,	y2),	…,	(xn,	yn)}
𝑓	=	argmin 𝑓 ∑ ℓ01(𝑓(𝒙𝑖), 𝑦𝑖)	/
012
• Multiple	issues*:
• We	have	not	chosen	a	form	for	𝑓
• ℓ01 is	not	convex
ℓ01(u,v)	=	3
0					if	sign(𝑢) = sign(𝑣)
1																										otherwise
*	``Algorithms	for	Direct	0-1	Loss	Optimization	in	Binary	Classification”	by	Nguyen	and	Sanner in	ICML	2013
Training	Discriminative	Classifiers
𝑓	=	argmin 𝑓 ∑ ℓ(𝑓(𝒙𝑖), 𝑦𝑖)	+	ℊ(∥ 𝑓 ∥)/
012
• The	second	term	is	``regularization”
• A	common	form	for	𝑓(𝒙) is	𝐰′𝒙 (linear	classifier)
• ℓ(w’x,y)	is	``convexified”	loss
• Besides	discriminative	classifiers,	generative	classifiers	also	exist
• e.g.,	naïve	Bayes
Classifier Loss	function	(y	∈ {±1})
support	vector	machine max(0,	1	- y	w’x)
logistic	regression log[1	+	exp(-y	w’x)]
adaboost exp(-y	w’x)
square loss (1	– y	w’x)2
Algorithms for Direct 0–1 Loss Optim
−1.5 −1 −0.5 0 0.5 1 1.5 2
0
1
2
3
4
5
6
7
8
margin (m)
LossValue
0−1 Loss
Hinge Loss
Log Loss
Squared Loss
T
ar
is
W
in
(w
fw
as
th
T
{x
t
cl
Representer Theorem*
• If	ℊ is	real-valued,	monotonically	increasing	and	lies	in	[0,∞)
• And	if	ℓ lies	in	ℝ⋃{∞},	then
f(x) =	∑ 𝛼𝑖	
𝒙′𝒙𝑖	/
012
• In	particular,
• Neither	convexity	nor	differentiability	is	necessary
• But	helps	with	the	optimization
• Especially	when	using	gradient	based	methods
*``A	Generalized	Representer Theorem”	by	Scholkopf,	Herbrich and	Smola in	COLT	2001
☨``When	is	there	a	Representer Theorem?”	by	Argyriou,	Micchelli and	Pontil in	JMLR	2009
Binary	Class	Support	Vector	Machines
minw	∑ max 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 	+	
T
U
𝒘S 𝒘	/
012
• Expressed	in	standard	form:
minw ∑ ξi	+	
T
U
𝒘S 𝒘	0
s.t. 	𝑦𝑖 𝒘S 𝒙𝑖 ≥ 1 − ξi	
∀𝑖
ξi ≥ 0	∀𝑖
• Lagrangian (𝛼𝑖, 𝛽𝑖 ≥ 0):
ℒ = 	∑ ξi	+	
T
U
𝒘S 𝒘	 + ∑ 𝛼𝑖(1 − 𝑦𝑖 𝒘S 𝒙𝑖 − ξi00 ) − ∑ 𝛽𝑖 𝜉𝑖0
]ℒ
𝝏w
:	w =	
2
_
∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0
]ℒ
𝝏`i
:						1	=	𝛼𝑖+𝛽𝑖		∀
𝑖
Binary	SVM:	Dual	Formulation
max 𝛼 ∑ 𝛼𝑖	0 −
2
U
∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b
s.t. 0 ≤ 𝛼𝑖 ≤ 1		∀𝑖
• Convex	Quadratic	Program
• Optimization	algorithms	such	as	Platt’s	SMO*	exist
• Also	possible	to	optimize	the	primal	directly	(l2-svm.dml,	next	slide)
• Kernel	trick:
• Redefine	inner	product	K(xi,	xj)
• Projects	data	into	a	space	𝜙(𝒙) where	classes	may	be	separable
• Well	known	kernels:	radial	basis	functions,	polynomial	kernel
*``Sequential	Minimal	Optimization:	A	Fast	Algorithm	for	Training	Support	Vector	Machines”	by	Platt,	Tech	Report	1998.
Binary	SVM	in	DML
minw 𝜆𝒘S 𝒘	+ ∑ max2 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 	/
012
• Solve	for	𝒘 directly	using:
• Non	linear	conjugate	gradient	descent
• Newton’s	method	to	determine	step	size
• Most	complex	operation	in	the	script
• Matrix-vector	product
• Incremental	maintenence using	vector-vector	
operations
Matrix-vector	product
Matrix-vector	product
Fletcher-Reeves	 formula
1D	Newton	method	to
determine	step	size
Multi-Class	SVM	in	DML
• At	least	3	different	ways	to	define	multi-class	
SVMs
• One-against-the-rest*	(OvA)
• Pairwise (or	one-against-one)
• Crammer-Singer	SVM☨
• OvA multi-class	SVM
• Each	binary	class	SVM	learnt	in	parallel
• Inner	body	uses	l2-svm’s	approach
*``In	Defense	of	One-vs-All	 Classification”	by	Rifkin	and	Klautau in	
JMLR	2004
☨``On	the	Algorithmic	Implementation	of	Multiclass	Kernel-
Based	Vector	Machines”	by	Crammer	and	Singer	in	JMLR	2002
…
Parallel	for	loop
Logistic	Regression
maxw -∑ log 1 + 𝑒hi0 𝒘j 𝒙0 	−	
T
U
𝒘S 𝒘	/
012
• To	derive	the	dual	form	use	the	following	bound*:
log
2
2klmn𝒘j 𝒙
≤ min 𝛼	
𝛼𝑦𝒘′𝒙 − 𝐻(𝛼)	
where	0 ≤ 𝛼 ≤ 1 and	𝐻(𝛼)	=	−𝛼log(𝛼) − (1 − 𝛼)log(1 − 𝛼)
• Substituting:
maxw min 𝛼 ∑ 𝛼𝑖 𝑦𝑖 𝒘′𝒙𝑖0 − 𝐻(𝛼𝑖)	−	
T
U
𝒘S 𝒘	s.t. 0 ≤ 𝛼𝑖 ≤ 1	∀𝑖
]ℒ
𝝏w
:	w =	
2
_
∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0
• Dual	form:
min 𝛼
2
UT
∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b − ∑ 𝐻(𝛼𝑖)0 s.t.	0 ≤ 𝛼𝑖 ≤ 1	∀𝑖
• Apply	kernel	trick	to	obtain	kernelized logistic	regression	
*``Probabilistic	Kernel	Regression	Models”	by	Jaakkola and	Haussler	in	AISTATS	1999
Multiclass	Logistic	Regression
• Also	called	softmax regression	or	multinomial	logistic	regression
• W	is	now	a	matrix	of	weights,	jth column	contains	the	jth class’s	weights
Pr(y|x)	=	
lp
j
qn
∑ lp
j
qn
n
minW
_
U
∥ 𝑊 ∥ 2 + ∑ [log(Zi)0 − 𝑥𝑖′𝑊𝑦𝑖]	where	𝑍𝑖 = 1S 𝑒wjx0	
• The	DML	script	is	called	MultiLogReg.dml
• Uses	trust-region	Newton	method	to	learn	the	weights*
• Care	needs	to	be	taken	because	softmax is	an	over-parameterized	function
*See	regression	class’s	slides	on	ibm.biz/AlmadenML
Generative	Classifiers:	Naïve	Bayes
• Generative	models	“explain”	the	generation	of	the	data
• Naïve	Bayes	assumes,	each	feature	is	independent	given	class	label
Pr(x,y)	=		py ∏ (𝑝𝑦𝑗b )nj
• A conjugate	prior	is	used	to	avoid	0	probabilities
Pr({(xi,yi)})	=	∏ [
{|
(_)
{(|_)
∏ 𝑝 𝑦𝑗
𝜆]bi ∏ [𝑝𝑦𝑖∏ 𝑝 𝑦𝑖𝑗
𝑛𝑖𝑗]b0
s.t. 𝑝 𝑦	∀	𝑦, 𝑝𝑦𝑗	∀𝑦	∀𝑗 form	legal	distributions
• Maximum	is	obtained	when:
𝑝 𝑦 =	
𝑛 𝑦
∑ 𝑛 𝑦i
∀𝑦, 𝑝𝑦𝑗 =
𝜆 + ∑ 𝑛𝑖𝑗0:i01i
𝑚𝜆 + ∑ ∑ 𝑛𝑖𝑗b0:i01i
∀𝑦∀𝑗
• This	is	multinomial	naïve	Bayes,	other	forms	include	multivariate	Bernoulli*
*	``A	Comparison	of	Event	Models	for	Naïve	Bayes	Text	Classification”	by	McCallum	and	Nigam	in	AAAI/ICML-98	Workshop	 on	Learning for	Text	
Categorization
Naïve	Bayes	in	DML
• Uses	group	by	aggregates
• Very	efficient
• Non-iterative
• E.g.,	document	classification	with	term	
frequency	feature	vectors	(bag-of-words)
Group	by	aggregate
Matrix-vector	op
Group	by	count
Deep	Learning:	Autoencoders
• Designed	to	discover	the	hidden	subspace	in	
which	the	data	``lives”
• Layer-wise	pretraininghelps*
• Many	of	these	can	be	stacked	together
• Final	layer	is	usually	softmax (for	classification)
• Weights	may	be	tied	or	not,	output	layer	may	
have	a	non-linear	activation	function	or	not,	
many	options☨
*``A	fast	learning	algorithm	for	deep	belief	nets”	by	Hinton,	Osindero
and	Teh in	Neural	Computation	2006
☨ ``On	Optimization	Methods	for	Deep	Learning”	by	V.	Le	et	al	in	ICML	
2011
…
…
…
Input	layer
Output	layer
Hidden	layer
Deep	Learning:	Convolutional	Neural	Networks*
• Designed	to	exploit	spatial	and	temporal	symmetry
• A	kernel	is	a	feature	whose	weights	are	learnable
• The	same	kernel	is	used	on	all	patches	within	an	image
• SystemML surfaces	various	functions	and	also	modules	to	
ease	implementation	of	CNNs
• Builtin functions:	conv2d,	max_pool,	
conv2d_backward_data,	conv2d_backward_filter,	
max_pool_backward
Convolution	 with	1	kernel
*``Gradient-based	Learning	Applied	to	Document	
Recognition”	by	LeCun et	al	in	Proceedings	of	the	IEEE,	1998
Decision	Tree	(Classification)
• Simple	and	easy	to	understand	model	for	classification
• More	interpretable	results	than	other	classifiers
• Recursively	partitions	training	data	until	examples	in	each	partition	belong	to	one	class	or	
partition	becomes	small	enough
• Splitting	test	s for	choosing	feature	j	(xj:	jth feature	value	of	x)
• Numerical:	xj <	𝜎
• Categorical:	xj∈ S where	S ⊆ Domain	of	feature	j
• Measuring	node	impurity	𝒥:
• Entropy:	∑ −𝑓𝑖log(𝑓𝑖)0
• Gini:	∑ 𝑓𝑖(1 − 𝑓𝑖)0
• To	find	the	best	split	across	features	use	information	gain:
argmax 𝒥 𝑋 −	
/…l†‡
/
𝒥 𝑋𝑙𝑒𝑓𝑡 − 	
/Š0‹Œ‡
/
𝒥 𝑋 𝑟𝑖𝑔ℎ𝑡
Decision	Tree	in	DML
• Tree	construction*:
• Breadth-first	expansion	for	nodes	in	top	level
• Depth-first	expansion	for	nodes	in	lower	levels
• Input	data	needs	to	be	transformed	(dummy	coded)
• Can	control	complexity	of	the	tree	(pruning,	early	stopping)
*``PLANET:	Massively	Parallel	Learning	of	Tree	Ensembles	with	MapReduce”	by	Panda,	Herbach,	Basu,	Bayardoin	VLDB	2009
Random	Forest	(Classification)
• Ensemble	of	trees
• Each	tree	is	learnt	from	a	bootstrapped	training	set	sampled	with	replacement
• At	each	node,	we	sample	for	a	random	subset	of	features	to	choose	from
• Prediction	is	by	majority	voting
• In	the	script,	we	sample	using	Poisson	distribution
• By	default,	each	tree	is:
• Trained	using	2/3	training	data	
• Tested	on	the	remaining	1/3	(out-of-bag	error	estimation)

More Related Content

What's hot

A tree kernel based approach for clone detection
A tree kernel based approach for clone detectionA tree kernel based approach for clone detection
A tree kernel based approach for clone detection
ICSM 2010
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
MLconf
 
Advanced data structures slide 1 2
Advanced data structures slide 1 2Advanced data structures slide 1 2
Advanced data structures slide 1 2
jomerson remorosa
 
Learning the Structure of Related Tasks
Learning the Structure of Related Tasks Learning the Structure of Related Tasks
Learning the Structure of Related Tasks
butest
 

What's hot (18)

Tutorial matlab
Tutorial matlabTutorial matlab
Tutorial matlab
 
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes ClassiferAaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
Aaa ped-12-Supervised Learning: Support Vector Machines & Naive Bayes Classifer
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Object and class in java
Object and class in javaObject and class in java
Object and class in java
 
Principles of functional progrmming in scala
Principles of functional progrmming in scalaPrinciples of functional progrmming in scala
Principles of functional progrmming in scala
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Core java concepts
Core java conceptsCore java concepts
Core java concepts
 
Generation of 3D-avatar animation from latent representations
Generation of 3D-avatar animation from latent representationsGeneration of 3D-avatar animation from latent representations
Generation of 3D-avatar animation from latent representations
 
Arrays, Structures And Enums
Arrays, Structures And EnumsArrays, Structures And Enums
Arrays, Structures And Enums
 
A tree kernel based approach for clone detection
A tree kernel based approach for clone detectionA tree kernel based approach for clone detection
A tree kernel based approach for clone detection
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Advanced data structures slide 1 2
Advanced data structures slide 1 2Advanced data structures slide 1 2
Advanced data structures slide 1 2
 
Built in classes in java
Built in classes in javaBuilt in classes in java
Built in classes in java
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 
Core java concepts
Core    java  conceptsCore    java  concepts
Core java concepts
 
Learning the Structure of Related Tasks
Learning the Structure of Related Tasks Learning the Structure of Related Tasks
Learning the Structure of Related Tasks
 
Structure in c sharp
Structure in c sharpStructure in c sharp
Structure in c sharp
 

Viewers also liked

Ggianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social MediaGgianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social Media
Elena Minchenok
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
Janani C
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表
Shi Guo Xian
 

Viewers also liked (20)

Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Ggianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social MediaGgianluca Fiorelli - International Social Media
Ggianluca Fiorelli - International Social Media
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
Resume sachin kuckian
Resume sachin kuckianResume sachin kuckian
Resume sachin kuckian
 
Amia tb-review-11
Amia tb-review-11Amia tb-review-11
Amia tb-review-11
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表南投縣發祥國小辦理教育優先區計畫實施情形考核表
南投縣發祥國小辦理教育優先區計畫實施情形考核表
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Inside Apache SystemML
Inside Apache SystemMLInside Apache SystemML
Inside Apache SystemML
 
Codes and Conventions of Front Covers
Codes and Conventions of Front CoversCodes and Conventions of Front Covers
Codes and Conventions of Front Covers
 
Refractionoflight onplanesurfacescurved
Refractionoflight onplanesurfacescurvedRefractionoflight onplanesurfacescurved
Refractionoflight onplanesurfacescurved
 

Similar to Classification using Apache SystemML by Prithviraj Sen

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 

Similar to Classification using Apache SystemML by Prithviraj Sen (20)

Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity Calorimeter
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 

More from Arvind Surve

More from Arvind Surve (12)

Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by  Prithviraj SenClustering and Factorization using Apache SystemML by  Prithviraj Sen
Clustering and Factorization using Apache SystemML by Prithviraj Sen
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Classification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj SenClassification using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
 
Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...Data preparation, training and validation using SystemML by Faraz Makari Mans...
Data preparation, training and validation using SystemML by Faraz Makari Mans...
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation process
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Regression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V EvfimievskiRegression using Apache SystemML by Alexandre V Evfimievski
Regression using Apache SystemML by Alexandre V Evfimievski
 
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold ReinwaldApache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML 2016 Summer class primer by Berthold Reinwald
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Classification using Apache SystemML by Prithviraj Sen

  • 2. Overview • Supervised Learning and Classification • Training Discriminative Classifiers • Representer Theorem • Support Vector Machines • Logistic Regression • Generative Classifiers: Naïve Bayes • Deep Learning • Tree Ensembles
  • 3. Classification and Supervised Learning • Supervised learning is a major area of machine learning • Goal is to learn function 𝑓 such that: 𝑓: ℝm → C where: m is a fixed integer C is a fixed domain of labels • Training: goal is to learn 𝑓 from a labeled dataset • Testing: goal is to apply 𝑓 to unseen x ∈ ℝm • Applications: • spam detection (D = {spam, no-spam}) • search advertising (each ad is a label) • recognizing hand-written digits (D = {0, 1, …, 9})
  • 4. Training a Classifier • Given labeled training data: {(x1, y1), (x2, y2), …, (xn, yn)} 𝑓 = argmin 𝑓 ∑ ℓ01(𝑓(𝒙𝑖), 𝑦𝑖) / 012 • Multiple issues*: • We have not chosen a form for 𝑓 • ℓ01 is not convex ℓ01(u,v) = 3 0 if sign(𝑢) = sign(𝑣) 1 otherwise * ``Algorithms for Direct 0-1 Loss Optimization in Binary Classification” by Nguyen and Sanner in ICML 2013
  • 5. Training Discriminative Classifiers 𝑓 = argmin 𝑓 ∑ ℓ(𝑓(𝒙𝑖), 𝑦𝑖) + ℊ(∥ 𝑓 ∥)/ 012 • The second term is ``regularization” • A common form for 𝑓(𝒙) is 𝐰′𝒙 (linear classifier) • ℓ(w’x,y) is ``convexified” loss • Besides discriminative classifiers, generative classifiers also exist • e.g., naïve Bayes Classifier Loss function (y ∈ {±1}) support vector machine max(0, 1 - y w’x) logistic regression log[1 + exp(-y w’x)] adaboost exp(-y w’x) square loss (1 – y w’x)2 Algorithms for Direct 0–1 Loss Optim −1.5 −1 −0.5 0 0.5 1 1.5 2 0 1 2 3 4 5 6 7 8 margin (m) LossValue 0−1 Loss Hinge Loss Log Loss Squared Loss T ar is W in (w fw as th T {x t cl
  • 6. Representer Theorem* • If ℊ is real-valued, monotonically increasing and lies in [0,∞) • And if ℓ lies in ℝ⋃{∞}, then f(x) = ∑ 𝛼𝑖 𝒙′𝒙𝑖 / 012 • In particular, • Neither convexity nor differentiability is necessary • But helps with the optimization • Especially when using gradient based methods *``A Generalized Representer Theorem” by Scholkopf, Herbrich and Smola in COLT 2001 ☨``When is there a Representer Theorem?” by Argyriou, Micchelli and Pontil in JMLR 2009
  • 7. Binary Class Support Vector Machines minw ∑ max 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 + T U 𝒘S 𝒘 / 012 • Expressed in standard form: minw ∑ ξi + T U 𝒘S 𝒘 0 s.t. 𝑦𝑖 𝒘S 𝒙𝑖 ≥ 1 − ξi ∀𝑖 ξi ≥ 0 ∀𝑖 • Lagrangian (𝛼𝑖, 𝛽𝑖 ≥ 0): ℒ = ∑ ξi + T U 𝒘S 𝒘 + ∑ 𝛼𝑖(1 − 𝑦𝑖 𝒘S 𝒙𝑖 − ξi00 ) − ∑ 𝛽𝑖 𝜉𝑖0 ]ℒ 𝝏w : w = 2 _ ∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0 ]ℒ 𝝏`i : 1 = 𝛼𝑖+𝛽𝑖 ∀ 𝑖
  • 8. Binary SVM: Dual Formulation max 𝛼 ∑ 𝛼𝑖 0 − 2 U ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖 • Convex Quadratic Program • Optimization algorithms such as Platt’s SMO* exist • Also possible to optimize the primal directly (l2-svm.dml, next slide) • Kernel trick: • Redefine inner product K(xi, xj) • Projects data into a space 𝜙(𝒙) where classes may be separable • Well known kernels: radial basis functions, polynomial kernel *``Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines” by Platt, Tech Report 1998.
  • 9. Binary SVM in DML minw 𝜆𝒘S 𝒘 + ∑ max2 0,1 − 𝑦𝑖𝒘S 𝒙𝑖 / 012 • Solve for 𝒘 directly using: • Non linear conjugate gradient descent • Newton’s method to determine step size • Most complex operation in the script • Matrix-vector product • Incremental maintenence using vector-vector operations Matrix-vector product Matrix-vector product Fletcher-Reeves formula 1D Newton method to determine step size
  • 10. Multi-Class SVM in DML • At least 3 different ways to define multi-class SVMs • One-against-the-rest* (OvA) • Pairwise (or one-against-one) • Crammer-Singer SVM☨ • OvA multi-class SVM • Each binary class SVM learnt in parallel • Inner body uses l2-svm’s approach *``In Defense of One-vs-All Classification” by Rifkin and Klautau in JMLR 2004 ☨``On the Algorithmic Implementation of Multiclass Kernel- Based Vector Machines” by Crammer and Singer in JMLR 2002 … Parallel for loop
  • 11. Logistic Regression maxw -∑ log 1 + 𝑒hi0 𝒘j 𝒙0 − T U 𝒘S 𝒘 / 012 • To derive the dual form use the following bound*: log 2 2klmn𝒘j 𝒙 ≤ min 𝛼 𝛼𝑦𝒘′𝒙 − 𝐻(𝛼) where 0 ≤ 𝛼 ≤ 1 and 𝐻(𝛼) = −𝛼log(𝛼) − (1 − 𝛼)log(1 − 𝛼) • Substituting: maxw min 𝛼 ∑ 𝛼𝑖 𝑦𝑖 𝒘′𝒙𝑖0 − 𝐻(𝛼𝑖) − T U 𝒘S 𝒘 s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖 ]ℒ 𝝏w : w = 2 _ ∑ 𝛼𝑖 𝑦𝑖 𝒙𝑖0 • Dual form: min 𝛼 2 UT ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖′𝒙𝑗0b − ∑ 𝐻(𝛼𝑖)0 s.t. 0 ≤ 𝛼𝑖 ≤ 1 ∀𝑖 • Apply kernel trick to obtain kernelized logistic regression *``Probabilistic Kernel Regression Models” by Jaakkola and Haussler in AISTATS 1999
  • 12. Multiclass Logistic Regression • Also called softmax regression or multinomial logistic regression • W is now a matrix of weights, jth column contains the jth class’s weights Pr(y|x) = lp j qn ∑ lp j qn n minW _ U ∥ 𝑊 ∥ 2 + ∑ [log(Zi)0 − 𝑥𝑖′𝑊𝑦𝑖] where 𝑍𝑖 = 1S 𝑒wjx0 • The DML script is called MultiLogReg.dml • Uses trust-region Newton method to learn the weights* • Care needs to be taken because softmax is an over-parameterized function *See regression class’s slides on ibm.biz/AlmadenML
  • 13. Generative Classifiers: Naïve Bayes • Generative models “explain” the generation of the data • Naïve Bayes assumes, each feature is independent given class label Pr(x,y) = py ∏ (𝑝𝑦𝑗b )nj • A conjugate prior is used to avoid 0 probabilities Pr({(xi,yi)}) = ∏ [ {| (_) {(|_) ∏ 𝑝 𝑦𝑗 𝜆]bi ∏ [𝑝𝑦𝑖∏ 𝑝 𝑦𝑖𝑗 𝑛𝑖𝑗]b0 s.t. 𝑝 𝑦 ∀ 𝑦, 𝑝𝑦𝑗 ∀𝑦 ∀𝑗 form legal distributions • Maximum is obtained when: 𝑝 𝑦 = 𝑛 𝑦 ∑ 𝑛 𝑦i ∀𝑦, 𝑝𝑦𝑗 = 𝜆 + ∑ 𝑛𝑖𝑗0:i01i 𝑚𝜆 + ∑ ∑ 𝑛𝑖𝑗b0:i01i ∀𝑦∀𝑗 • This is multinomial naïve Bayes, other forms include multivariate Bernoulli* * ``A Comparison of Event Models for Naïve Bayes Text Classification” by McCallum and Nigam in AAAI/ICML-98 Workshop on Learning for Text Categorization
  • 14. Naïve Bayes in DML • Uses group by aggregates • Very efficient • Non-iterative • E.g., document classification with term frequency feature vectors (bag-of-words) Group by aggregate Matrix-vector op Group by count
  • 15. Deep Learning: Autoencoders • Designed to discover the hidden subspace in which the data ``lives” • Layer-wise pretraininghelps* • Many of these can be stacked together • Final layer is usually softmax (for classification) • Weights may be tied or not, output layer may have a non-linear activation function or not, many options☨ *``A fast learning algorithm for deep belief nets” by Hinton, Osindero and Teh in Neural Computation 2006 ☨ ``On Optimization Methods for Deep Learning” by V. Le et al in ICML 2011 … … … Input layer Output layer Hidden layer
  • 16. Deep Learning: Convolutional Neural Networks* • Designed to exploit spatial and temporal symmetry • A kernel is a feature whose weights are learnable • The same kernel is used on all patches within an image • SystemML surfaces various functions and also modules to ease implementation of CNNs • Builtin functions: conv2d, max_pool, conv2d_backward_data, conv2d_backward_filter, max_pool_backward Convolution with 1 kernel *``Gradient-based Learning Applied to Document Recognition” by LeCun et al in Proceedings of the IEEE, 1998
  • 17. Decision Tree (Classification) • Simple and easy to understand model for classification • More interpretable results than other classifiers • Recursively partitions training data until examples in each partition belong to one class or partition becomes small enough • Splitting test s for choosing feature j (xj: jth feature value of x) • Numerical: xj < 𝜎 • Categorical: xj∈ S where S ⊆ Domain of feature j • Measuring node impurity 𝒥: • Entropy: ∑ −𝑓𝑖log(𝑓𝑖)0 • Gini: ∑ 𝑓𝑖(1 − 𝑓𝑖)0 • To find the best split across features use information gain: argmax 𝒥 𝑋 − /…l†‡ / 𝒥 𝑋𝑙𝑒𝑓𝑡 − /Š0‹Œ‡ / 𝒥 𝑋 𝑟𝑖𝑔ℎ𝑡
  • 18. Decision Tree in DML • Tree construction*: • Breadth-first expansion for nodes in top level • Depth-first expansion for nodes in lower levels • Input data needs to be transformed (dummy coded) • Can control complexity of the tree (pruning, early stopping) *``PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce” by Panda, Herbach, Basu, Bayardoin VLDB 2009
  • 19. Random Forest (Classification) • Ensemble of trees • Each tree is learnt from a bootstrapped training set sampled with replacement • At each node, we sample for a random subset of features to choose from • Prediction is by majority voting • In the script, we sample using Poisson distribution • By default, each tree is: • Trained using 2/3 training data • Tested on the remaining 1/3 (out-of-bag error estimation)