SlideShare a Scribd company logo
Machine	Learning	101
Talha	Obaid
About	me
• Email	Security	@	Symantec
• Doing	Data	Science	to	fight	Spam	and	Malware
• Organizer	for	Python	Data	Science	Group	Singapore
• Monthly	regular	meet-ups—over	a	year
• http://meetup.com/pydata-sg >1.8K	members
• https://www.facebook.com/groups/pydatasg/ >1k	
members
• https://twitter.com/pydatasg
• https://engineers.sg/organizations/118 recorded	
and	uploaded
• Previously	with	CENSAM	@	MIT
• Co-founded	startup(s)
• NUS	Alumni
• Some	questions
• How	many	of	you	have	heard	about	Machine	Learning	or	
ML?
• How	many	of	you	know	how	to	do	ML?
• How	many	of	you	earn	a	living	doing	ML?
• What	this	talk	offers
• Getting	a foot	in	the	door
• Grossly	oversimplifying	things
• How	to	learn	ML	from	literature
• Relate	to	ML	terms	when	thrown	at	you
• Types	of	ML
• Learning	ML	models	and	their	coding	(SciKit-learn	and	
why?)
• Linear	Regression
• Logistic	Regression
• Clustering
• Lessons	from	Practical	ML
@ObaidTal
Some	terminology
• Data	Science
• Data	Analytics
• Business	Analytics
• Artificial	Intelligence
• Machine	Learning
Ref.	Tuan	Q.	Phan
What	is	Data?
• Available	data	(Internal)
• Health	record
• Organization
• University
• …
• Available	data	(external)
• www.data.gov.sg
• Publicly	available	
corpuses	
• Quality	of	data
• Trustworthy	or	not
• Missing	data
• Huge	challenge	in	scientific	
community
• Other	jargon
• Tiny	Data:	Data	from	sensors
• Big	Data:	Data	on	massive	scale
• Fast	Data:	Hash-based	lookup
@ObaidTal
Machine	Learning,	defined
• A	field	of	study	that	gives	
computers	the	ability	to	learn	
without	being	explicitly	
programmed
– Arthur	Samuel	(1959)
• Samuel	wrote	a	program	to	play	
checkers	
• Eventually	his	program	learned	to	
play	better
@ObaidTal
Ref:	http://infolab.stanford.edu/pub/voy/museum/samuel.html
When	did	we	all	start	with	Machine	Learning?
• Take	a	look	at	the	following	(outputs)	and	guess	the	?:
• 1,	2,	3,	4,	5,	6,	?,	…,	?
• 2,	4,	6,	8,	10,	12,	?,	…,	?
• 3,	6,	9,	12,	?,	…,	?
• 1,	3,	9,	27,	?,	…,	?
• 4,	7,	10,	13,	?,	…,	?
• So	how	can	I	represent	above
• Input	->																		->	output	
• X		->																			->	Y
• call	this	box	as	f()
• Output		=	f(Input)	...	In	maths
• Y	=	f(x)
• Answers
• Y=X
• Y=2*X
• Y=3*X	+	0
• Y=3^X
• Y=3*X	+	1
In	school… Really,	how? How	to	find	‘…,?’	– A:	Equation	(Single	variable)
@ObaidTal
Assuming	input	is
1,2,3,4,5,6,…
Linear	Regression	– Statistical	term
Y=mx+b… from	last	example,	b=?	&	m=? b	=	1
m	=	3
Y=mx+b
Output
Input
Suppose	this	line	
is	Y=3x+1
Assume	that	this	line
is	‘surrounded’	by	‘+’	
shaped	points,	which
we	had,	i.e.	(outputs)
4,	7,	10,	13,	?,	…,	?	(Y)
having	inputs
1,	2,	3,	4,	5,	6,	… (x)
The	line	Y=3x+1
kind	of	‘fits’	in
these	points	as	to	
find	out	‘…,	?’
Where	are	we	headed…
https://www.ltcconline.net/greenl/courses/154/factor/circle.htm
http://machinelearningmastery.com/basic-concepts-in-machine-learning/
x …
Y …
m,	b
Since	x,	Y	are	
already	known,	
therefore	we	
got	Y=mx+b
+
+
+
+
Y=x+2x+3x+(3x+1)+3^x
Y=x1+2*x2+3*x3+(3x4+1)+3^x5
So	far,	there	is	a	single	variable	‘x’	on
the	left-hand	side.	However,	it	can	be	
more	than	one	variable.	Let’s	sum	up	all	the	
previous	equations	on	the	left-hand	side:
Let’s	assume	the	‘x’	to	be	different	
from	each	other	on	the	left-hand	side:
How	to	fit	a	line	between	‘+’
shaped	points?
A:	Distance	formula
Making	sure	each	‘+’	is	
closest to	the	line	
or	vice	versa
x …
mx+b
Y
Next,	let’s	move
to	different	types	of	
Machine	Learning…
Supervised	Learning
• Providing	the	output,	and	a	dataset	(input),	to	come	up	with	the	answer,	
i.e.	model.
• In	literature,	“The	Boston	housing	prices”	example	is	a	“Regression	
problem”, i.e.	predicting	the	continuous	value variable,	as	the	outcome.
• “Classification	problem”	– i.e.	the	variable	trying	to	predict	is	discrete	e.g.	
spam	problem,	output	is	either	0	or	1
• The	feature or	input	dataset variables can	always	be	more	than	one,	i.e.	
graph	with	multiple	dimensions.
• Code	the	model	with	what	the	right	answer,	i.e.	Y is,	and	train	with	
number	of	input sets,	i.e.	x,	and	ask	the	algorithm	or	model to	replicate	
the	same
Types	of	Machine	Learning
Ref.	Andrew	Y.	Ng.
1
Unsupervised	Learning
Types	of	Machine	Learning
• Data	is	given,	and	structure	must	be	inferred
• Clustering is	one	example	of	it
• Deep	Learning	is	also	considered	here
• Example	is	finding	clusters	in	
– Gene	data
– Image	processing,	grouping	pixels	together
– Social	network	analysis
– Lots	of	people	talking,	extracting	the	voice	of	single	person,	considering	
voices	of	others	as	noise	– Cocktail	party	problem
– Text	processing
• Independent	component	analysis	ICA	algorithm Ref.	Andrew	Y.	Ng.
2
Reinforcement	Learning
• Sequence	of	decisions	are	made	over	time
• Example
• Flying	an	autonomous	helicopter
• Reward	function
• Specify	what	you	want	to	get	done
• Specify	a	good	behavior	and	bad	behavior	in	Reward	function
• Learning	algorithm	will	decide	to	maximize	good	behavior	and	minimize	
bad	behavior
Types	of	Machine	Learning
Ref.	Andrew	Y.	Ng.
3
Getting	ready	– some	more	terms…
• Data	set/Input	is	also	called	training	set,	observation
• The	predictor	is	called	hypothesis	for	historical	reasons,	and	it	is	called	
classifier,	estimator,	predictor
• Boston	housing	price	problem	(we’ll	see	more	of	it)
• We	will	train/learn and	predict price
• Features or	input	variable on	right	side	of	Y	=	mx+b,	i.e.	x
• Price,	i.e.	Y,	output	or	target	variable	of	Y	=	mx+b
• Linear	equation,	i.e.	Y=mx+b can	be	written	as	predictor,	where	m	is	slope	
and	b	is	intercept
• Cost	function	– which	Y=mx+b is	better	(we	will	see	more	of	it)
@ObaidTal
Let’s	get	coding… 1
• To	remember
• Will	expand	on
Popular	Machine	Learning	Tool	Kit	–
Introduction
Project Language Highlight
R R A	language	for	statistic analysis	and	ML
Octave Octave A	language to	simulate	Matlab for	numerical	computations
Scikit-learn Python Documentation,	example,	tutorials	available. General purpose	with	simple	API
Tensorflow Py bindings A	library	for	numerical	computation	using	data	flow	graphs
Orange Python General	Purpose	ML	Package
PyBrain Python Neural	networks,	unsupervised	learning
MLlib Python/Scala Apache’s	new	library	based	within	Spark
Mahout Java Apache’s	framework	based	on	Hadoop
Weka java General	Purpose	ML	Package
GoLearn Go Machine	Learning	by	Go
shogun C++ User	interfaces	to	various	languages
Machine	Learning	Kit	– which	to	choose
• Factors	to	consider
• Language
• Performance	(run	speed)
• Scalability
Ref.	T.	Obaid	&	H.	Zhang
• We	choose	Scikit learn
• Language:	Python
• Performance	(run	speed):	
good	enough
• Scalability:	not	critical,	and	
can	switch	to	MLlib in	Spark	
for	mass	data
• Well	documented,	enough	
algorithms,	clean	API,	
robust,	fast	
implementation,	easy	usage
Scikit Learn	– Machine	
Learning	in	Python
• Simple	and	efficient	tools	for	
data	mining	and	data	analysis
• Accessible	to	everybody,	and	
reusable	in	various	contexts
• Built	on	NumPy,	SciPy,	and	
matplotlib
• Open	source,	commercially	
usable	– BSD	license
Scikit Learn	– Examples
• A	lot	of	sample	codes	are	in	source	folder:
scikit-learn-0.16.1/examples
• Boston	housing	prices (we	will	work	with	this	example	dataset)
• Will	try	features	one	by	one	(test	only	3	of	them	in	this	session,	
please	try	more)
• Excerpt	of	data…	(how	our	data	actually	looks	like)
1.	CRIM 2.	ZN 3.	INDUS 4.	CHAS 5.	NOX 6.	RM 7.	AGE 8.	DIS 9.	RAD 10.	TAX
11.	
PTRATIO 12.	B 13.	LSTAT 14.	MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
Details	about	each	
feature	of	this	data	
are	coming	next…
• To	remember
• Please	explore	…
Features	of	Boston	housing	prices
1. CRIM	per	capita	crime	rate	by	town		
2. ZN	proportion	of	residential	land	zoned	for	lots	over	25,000	sq.ft.
3. INDUS	proportion	of	non-retail	business	acres	per	town		
4. CHAS	Charles	River	dummy	variable	(=	1	if	tract	bounds	river;	0	otherwise)		
5. NOX	nitric	oxides	concentration	(parts	per	10	million)		
6. RM	average	number	of	rooms	per	dwelling		
7. AGE	proportion	of	owner-occupied	units	built	prior	to	1940		
8. DIS	weighted	distances	to	five	Boston	employment	centers		
9. RAD	index	of	accessibility	to	radial	highways		
10. TAX	full-value	property-tax	rate	per	$10,000		
11. PTRATIO	pupil-teacher	ratio	by	town
12. B	1000(Bk - 0.63)^2	where	Bk is	the	proportion	of	blacks	by	town		
13. LSTAT	%	lower	status	of	the	population		
14. MEDV	Median	value	of	owner-occupied	homes	in	$1000s	
Features	and	their	details
Which	of	these	features	are	
significant:
• All	of	them?
• A	few	of	them?
• Another	one,	not	in	them?
Let’s	observe	these…
Scikit Learn	– Demo	code	for	Boston	house	
price.	Try	it!
import matplotlib.pyplot as plt # for plotting
import numpy as np # for matrix/array operations
from sklearn import datasets, linear_model # classifier
boston = datasets.load_boston()
boston_X = boston.data[:, np.newaxis]
boston_X_temp = boston_X[:, :, 12] # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
boston_X_train = boston_X_temp[:]
boston_y_train = boston.target[:]
regr = linear_model.LinearRegression() # estimator
regr.fit(boston_X_train, boston_y_train) # train parameters
fig,ax = plt.subplots()
ax.scatter(boston_X_train, boston_y_train, color='black') # we can predict boston_X_test
ax.plot(boston_X_train, regr.predict(boston_X_train), color='green', linewidth=3) # to predict
ax.set_xlabel(boston.feature_names[12]) # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
ax.set_ylabel('Predicted')
fig.show()
plt.show()
Ref.	T.	Obaid	&	H.	Zhang
• Important	...
• Good	Feature?
• Not	so	Good	Feature?
• Comments
Scikit Learn	– Demo	result	for	Boston	house	price
• Parameters
(Coefficients,	 -0.95692593	)
(intercept,	 34.7411998746244)
• Feature:
• %	lower status of	the	population	
• y=-0.95692593	*LSTAT +	34.7411998746244
• Looks	good!
1st Try	with	LSTAT	%	lower	status	of	the	population
Demo	result	Contd.	
• Parameters
(Coefficients,	 -2.1571753)
(intercept,	 62.3446274748)
• Feature:
• pupil-teacher	ratio	by	town		
• y=-2.1571753*PTRATIO +	62.3446274748
• Doesn’t	look	good!
2nd Try	with	PTRATIO	pupil-teacher	ratio	by	town
Demo	result	Contd.
• Parameters
(Coefficients,	 9.126359)
(intercept,	 -34.7856369115583)
• Feature:
• average number of	rooms per	dwelling
• y=9.126359*RM -34.7856369115583	
• Looks good!
3rd Try	with	RM	average	number	of	rooms	per	dwelling
Cost	function	– the	lower the	cost,	the	better	the	model
Real LSTAT Predicted Difference Square
... ... ... ... ...
18.3 14.1 21.24854426 2.948544262 8.693913263
21.2 12.92 22.37771686 1.177716859 1.387017
17.5 15.1 20.29161833 2.791618332 7.793132909
16.8 14.33 21.0284513 4.228451298 17.87980038
22.4 9.67 25.48772613 3.087726132 9.534052663
20.6 9.08 26.05231243 5.45231243 29.72771084
23.9 5.64 29.34413763 5.444137629 29.63863453
22 6.48 28.54031985 6.540319848 42.77578372
11.9 7.88 27.20062355 15.30062355 234.1090809
Total: 19478.69458
Total/2 9739.347291
Ref.	Andrew	Y.	Ng.
Real RM Predicted Difference Square
... ... ... ... ...
18.3 5.794 18.09248713 -0.207512866 0.043061589
21.2 6.019 20.14591791 -1.054082091 1.111089054
17.5 5.569 16.03905636 -1.460943641 2.134356321
16.8 6.027 20.21892878 3.418928781 11.68907401
22.4 6.593 25.38444798 2.984447975 8.906929718
20.6 6.12 21.06768017 0.467680168 0.21872474
23.9 6.976 28.87984347 4.979843472 24.79884101
22 6.794 27.21884613 5.218846134 27.23635497
11.9 6.03 20.24630786 8.346307858 69.66085487
Total: 22062.73306
Total/2 11031.36653
Predicted=-0.95692593	*	LSTAT +	34.7411998746244 Predicted=9.126359	*	RM -34.7856369115583	
Least-squares	cost	function
=	for	(	i =	1;	i <	m;	i++)
Comment:
Here	Summation	is	nothing	
but	a		for	loop	as:
How	well	are	we	doing	– Compare	the Good	ones • 1	Good	
Feature	
VS
• Another	
Good	
Feature
Over-fitting	and	Under-fitting
The	Good	model	is	…	the	“Just	right!”	model	– Why?
• Under-fitting	– high	bias not	matching	and	cost	too	high
• Just	right	is	what	we	need
• Over-fitting	– High	variance happens	mostly	when	
too	many	features	are	used	or	the	model	is	too	complex
• The	model	should	learn,	not	memorize
http://i.imgur.com/W0qejU0.png
Scikit Learn	– Usage
from sklearn import linear_model
X=[][] # source data with (n_samples, n_features)
Y=[] # target value with (n_samples)
clf = linear_model.LinearRegression() # Estimator, or classfier
clf = clf.fit(X, Y) # learn parameters from existing data
Test = [][] # same shape as X
clf.predict(Test)	 #	predict	the	target	for	data	in	Test
Ref.	T.	Obaid	&	H.	Zhang
The	model	program	skeleton	would	look	something	like…
• Important
1. Model
2. Fit
3. Predict
• Comments
Observations	from	code
• There	is	always	a	fit function	call,	i.e.	learning/training	X,	to	give	Y.
• Same	is	a	predict function	call,	given	X	only,	pop	out	Y.
• Panda	library	pd can	alternatively	be	used	to	have	relatively	simpler	
display	of	data
• train_test_split function	call serves	important	purpose,	as	it	
shuffles	the	dataset	so	we	don’t	have	selection	bias,	i.e.	if	for	instance	
data	is	ordered	by	price	ascending,	and	halved	for	training	and	half	
for	testing,	then	the	training	data	may	have	all	the	house	with	lesser	
prices.
• To	remember
• Subtleties
• Probable	Issue
Scikit Learn	– Test	Data
• Scikit-learn	comes	with	a	few	standard	datasets,	for	instance	the	iris	and	
digits	datasets	for	classification	and	the	Boston	house	prices	dataset	for	
regression.
• Boston(boston house	prices),	iris(iris	flower),	mlcomp(20	newsgroups),	
svmlight_file/s,	diabetes,	lfw_pairs(labeled	face),	sample_image/s(china	and	
flower),	digits(0-9	handwriting),	lfw_people(labeld people),	linnerud(for	
multivariate	regression)
• Scipy.misc.lena()
• Load	test	data	…	Try	others!
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
Subset	of	learning	datasets	– just	saw Boston	housing	prices
• … Seen	so	far
• Ahead	…
• Please	Explore	…
Scikit Learn	– Main	Algorithms
• Supervised	learning	(most	have	both	classifier	and	regressor)
• Line	model:	LinearRegression,	Lasso,	Ridge,	LogisticRegression,	SGD
• SVM:	LinearSVC,	SVC,	SVR
• Naïve	Bayes:	GaussianNB,	MultinomiaNB,	BernoulliNB
• Decision	Tree:	DecisionTree(optimized	version	of	the	CART)
• Ensemble	method:	RandomForest,	AdaBoost,	GradientBoosting(GBDT)
• Unsupervised	learning
• Clustering:	Kmeans(Kmeans+,	mini-batch),	DBSCAN
• Manifold	learning(dimension	reduction):	MDS,	Isomap,	LocallyLinearEmbedding.
• Algorithm	whole	list:
http://scikit-learn.org/stable/modules/classes.html
Subset	of	supported	algorithms	– we	just	saw LinearRegression
• … Seen	so	far
• Ahead	…
Logistic	(Classification)	Regression
• Regression is	when	our	labels	y	can	take	any	real	(continuous)	value.	
Examples	include:
• Predicting	stock	market.
• Predicting	sales.
• Detecting	the	age	of	a	person	from	a	picture.
• Classification is	when	our	labels	y	can	only	take	a	finite	set	of	values	
(categories).	Examples	include:
• Handwritten	digit	recognition:	xx	is	an	image	with	a	handwritten	digit,	yy is	a	digit	
between	0	and	9.
• Spam filtering:	xx	is	an	e-mail,	and	yy is	0	or	1	whether	that	e-mail	is	a	spam	or	not.
Linear	(Regression)	vs Logistic	(Classification)
Linear	(Regression)	vs Logistic	(Classification)
Classification	(finite	output	values)	vs Regression	(continuous	output	values)
Logistic	Regression	– with	IRIS	example
• Categorical	output	instead	of	continuous	output
• Will	use	IRIS	dataset	– to	classify	3	species	of	plants	
• Number	of	Instances:	150	(50	in	each	of	three	classes)
• Number	of	Attributes:	4	numeric,	predictive	attributes	and	the	class
• Attribute/Feature Information:
• sepal	length	in	cm	(will	use	this)
• sepal	width	in	cm	(will	use	this)
• petal	length	in	cm
• petal	width	in	cm
• Classes	i.e.	Target:
• Iris-Setosa
• Iris-Versicolour
• Iris-Virginica
IRIS	is	a	database	of	flower	classes…	bears	a	little	bit	of	botany
Setosa Versicolour Virginica
• Petal	is	the	colored	part	of	the	flower
• Sepal	is	the	green	leaf	below	the	petal
Let’s	go	code…	Try	it!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import
LogisticRegression
iris = load_iris()
print "--- Keys ---n", iris.keys()
print "--- Shape ---n", iris.data.shape
print "--- Feature Names ---n",
iris.feature_names
print "--- Description ---n", iris.DESCR
print "--- Target --- n", iris.target
iri = pd.DataFrame(iris.data)
print "--- Panda Head ---n", iri.head()
iri.columns = iris.feature_names
print "--- Panda Columns ---n",
iri.head()
logreg = LogisticRegression(C=1e5)
X = iris.data[:, :2] # we only take
the first two features.
Y = iris.target
print "--- X ---n", X
print "--- y ---n", Y
# we create an instance of Neighbors
Classifier and fit the data.
logreg.fit(X, Y) # again, the
infamous fit method
Part	1 Part	2
• Preparation
• Important
• Debug
A	little	bit	more…	Try	it!
# Plotting
h = .02 # step size in the mesh
# Plot the decision boundary. For that, we will
assign a color to each
# point in the mesh [x_min,
m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:,
0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:,
1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Prediction
Z = logreg.predict(np.c_[xx.ravel(),
yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z,
cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y,
edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width’)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
Part	3 Part	4
• Plotting
• Important
• Debug
Classification	– output
Two	features,	thus	plotted	in	2D	plane
Clustering
• Unsupervised	learning
• Output	unknown
• Grouping	observation
K-Means
• One	of	the	most	popular	"clustering"	algorithms.
• Stores	kk centroids	that	it	uses	to	define	clusters.
• If	a	point	is	closer	to	a	cluster's	centroid.
• Find	best	centroids	by	alternating	between
• assigning	data	points	to	clusters	based	on	the	
current	centroids	
• sing	centroids	(points	which	are	the	center	of	a	
cluster)	based	on	the	current	assignment	of	data	
points	to	clusters.
34
43
49
58
70
81
89
101
116
121
131
145
<=11
<=12
<=15
34
43
49
58
70
81
89
101
116
121
131
145
Primitive	clustering	e.g.
11
6
9
12
11
8
12
15
Input	data	sorted
2
Clustering	applied	on	IRIS	data
• We	used	the	same	IRIS	data,	as	used	in	logistic	regression	demo,	however	
changed	two	things:
• Added	a	feature,	i.e. three	features for	clustering,	that’s	why	a	3D	plot	as	output
• Removed	the	output,	to	demonstrate	unsupervised	learning
Three	features,	thus	plotted	in	3D	plane
Let’s	codeTry	it!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d
import Axes3D
from sklearn.cluster import
KMeans
from sklearn import datasets
np.random.seed(5)
iris = datasets.load_iris()
X = iris.data # No used of Y here
est = KMeans() #	We	try	before	hand	
the	no.	of	clusters,	can	be	even	more,	
default	is	8
est.fit(X) #	NOTICE!,	no	Y	here,	“Unsupervised”,	Yay!
labels = est.labels_
fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
Part	1 Part	2
• Preparation/Plotting
• Important
• Debug
Lessons	learned!
• The	dataset	on	which	the	model	is	executed	here,	is	available	and	well-formatted,	which	is	not	the	case	
always
• Data	acquisition	and	preparation	come	prior	to	feature	extraction
• Extracting	the	interesting	features,	“numerifying”	(converting	to	numbers,	if	not	already)	and	later	
normalizing them,	comes	prior	to	running	model	on	it
• Features	or	data	columns	can	be	categorical or	inferential	variables,	or	can	cause	singularity problem;	
these	affect	the	performance	of	data	model	and	hence	residual	cost
• Selection	of	model,	linear	or	logistic,	and	observing	cost	to	select	appropriate	features,	can	also	be	
achieved	using	R,	i.e.	a	gold	standard	of	p	(Probability	of	incorrectly	rejecting	a	true	null	hypothesis)	
would	be	~	0.05 (At	least	23%	(and	typically	close	to	50%))
• Cross	validation	(CV)	is	done	by	running	test	and	training	a	few	times	and	measuring	difference
• Confusion matrix also	provides	visibility	into	how	many	predictions	are	right	and	wrong
@ObaidTal
From	real-life	Machine	Learning
Lessons	learned!
• If	the	data	is	in	time-series,	and	there	is	missing	data	within	the	time	window,	then we	can	apply	
interpolation or	extrapolation.	Interpolation	works	good	for	archived	data,	whereas	extrapolation	for	
live	data
• Before applying	any	regression,	it	so	happens	that	we	may	have	to	cluster the	data	and	then	apply	
regression	over	it.	This	would	help	control	outliers,	if	any,	which	may	impact	the	model	performance.	
Outliers	are	not	always	noise	in	the	data
• Selection	bias happens	when	we	train	the	model	on	data,	which	is	not	the	true	representation	of	the	
real	occurrences.	For	instance,	dissecting	the	housing	price	ordered	by	ascending,	and	training	over	it,	
would	skip	the	higher-valued	homes.	Thus	to	avoid	it,	data	should	be	shuffled	to	achieve	even	
distribution
• Curse	of	dimensionality,	when	challenged	with	too	many	features.	To	deal	with	it,	carefully	reduce	the	
non-significant	features	including	the	dependent,	categorical	or	composite	features,	depending	on	
where	applicable
…	Continued
@ObaidTal
References
• Stanford’s	CS229	by	Prof	Andrew	Y.	Ng	– Highly	recommended!	
• https://www.youtube.com/watch?v=UzxYlbK2c7E
• Scikit-Learn	tutorial
• http://scikit-learn.org/stable/
• http://scikit-learn.org/stable/install.html
• http://www.shogun-toolbox.org/page/features/
• http://daoudclarke.github.io/machine%20learning%20in%20practice/
2013/10/08/machine-learning-libraries/
References
• http://www-bcf.usc.edu/~gareth/ISL/ – Highly	Recommended!
• http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression-
in-python-scikit-learn/
• http://ipython-books.github.io/featured-04/
• http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
• http://scikit-
learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
• http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-
interpret-p-values
Continued…
Thank	you!
Talha Obaid
• linkedin.com/in/talhaobaid
• twitter.com/ObaidTal
• github.com/TalhaObaid
• talhaobaid@gmail.com

More Related Content

Similar to Machine Learning 101

ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Altinity Ltd
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA
Keiichiro Ono
 
KLA 2013 Mobile Technology
KLA 2013 Mobile TechnologyKLA 2013 Mobile Technology
KLA 2013 Mobile TechnologyJason Griffey
 
Dunham - Data Mining.pdf
Dunham - Data Mining.pdfDunham - Data Mining.pdf
Dunham - Data Mining.pdf
PRAJITBHADURI
 
Dunham - Data Mining.pdf
Dunham - Data Mining.pdfDunham - Data Mining.pdf
Dunham - Data Mining.pdf
ssuserf71896
 
Scanning Channel Islands Cyberspace
Scanning Channel Islands Cyberspace Scanning Channel Islands Cyberspace
Scanning Channel Islands Cyberspace
Paul Dutot IEng MIET MBCS CITP OSCP CSTM
 
Architecting IoT with Machine Learning
Architecting IoT with Machine LearningArchitecting IoT with Machine Learning
Architecting IoT with Machine Learning
Rudradeb Mitra
 
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
Provectus
 
Curation Markets #ethbuenosaires
Curation Markets #ethbuenosairesCuration Markets #ethbuenosaires
Curation Markets #ethbuenosaires
Simon de la Rouviere
 
#msignite2019 #msignite19 #msignite November 2019 by Metricool
#msignite2019  #msignite19 #msignite  November 2019  by Metricool#msignite2019  #msignite19 #msignite  November 2019  by Metricool
#msignite2019 #msignite19 #msignite November 2019 by Metricool
VF Marketing Consultant
 
Singapore's IoT Technical Standard Update at IoT Asia 2018
Singapore's IoT Technical Standard Update at IoT Asia 2018Singapore's IoT Technical Standard Update at IoT Asia 2018
Singapore's IoT Technical Standard Update at IoT Asia 2018
Colin Koh (許国仁)
 
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Amanda Starling Gould
 
When Computers are Everywhere, Will we have superpowers.
When Computers are Everywhere, Will we have superpowers.When Computers are Everywhere, Will we have superpowers.
When Computers are Everywhere, Will we have superpowers.
Guy Bieber
 
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
Sergey A. Razin
 
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Amazon Web Services
 
Using amazon machine learning to identify trends in io t data technical 201
Using amazon machine learning to identify trends in io t data   technical 201Using amazon machine learning to identify trends in io t data   technical 201
Using amazon machine learning to identify trends in io t data technical 201
Amazon Web Services
 
3D Modeling The World Around
3D Modeling The World Around 3D Modeling The World Around
3D Modeling The World Around
Victor Gramm
 
FinalPresentation-GradProject
FinalPresentation-GradProjectFinalPresentation-GradProject
FinalPresentation-GradProjectManabu Mukohyoshi
 
Leveraging IOT and Latest Technologies
Leveraging IOT and Latest TechnologiesLeveraging IOT and Latest Technologies
Leveraging IOT and Latest Technologies
Mithileysh Sathiyanarayanan
 
Neotys PAC - Todd De Capua
Neotys PAC - Todd De CapuaNeotys PAC - Todd De Capua
Neotys PAC - Todd De Capua
Neotys_Partner
 

Similar to Machine Learning 101 (20)

ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA
 
KLA 2013 Mobile Technology
KLA 2013 Mobile TechnologyKLA 2013 Mobile Technology
KLA 2013 Mobile Technology
 
Dunham - Data Mining.pdf
Dunham - Data Mining.pdfDunham - Data Mining.pdf
Dunham - Data Mining.pdf
 
Dunham - Data Mining.pdf
Dunham - Data Mining.pdfDunham - Data Mining.pdf
Dunham - Data Mining.pdf
 
Scanning Channel Islands Cyberspace
Scanning Channel Islands Cyberspace Scanning Channel Islands Cyberspace
Scanning Channel Islands Cyberspace
 
Architecting IoT with Machine Learning
Architecting IoT with Machine LearningArchitecting IoT with Machine Learning
Architecting IoT with Machine Learning
 
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...
 
Curation Markets #ethbuenosaires
Curation Markets #ethbuenosairesCuration Markets #ethbuenosaires
Curation Markets #ethbuenosaires
 
#msignite2019 #msignite19 #msignite November 2019 by Metricool
#msignite2019  #msignite19 #msignite  November 2019  by Metricool#msignite2019  #msignite19 #msignite  November 2019  by Metricool
#msignite2019 #msignite19 #msignite November 2019 by Metricool
 
Singapore's IoT Technical Standard Update at IoT Asia 2018
Singapore's IoT Technical Standard Update at IoT Asia 2018Singapore's IoT Technical Standard Update at IoT Asia 2018
Singapore's IoT Technical Standard Update at IoT Asia 2018
 
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...
 
When Computers are Everywhere, Will we have superpowers.
When Computers are Everywhere, Will we have superpowers.When Computers are Everywhere, Will we have superpowers.
When Computers are Everywhere, Will we have superpowers.
 
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”
 
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201
 
Using amazon machine learning to identify trends in io t data technical 201
Using amazon machine learning to identify trends in io t data   technical 201Using amazon machine learning to identify trends in io t data   technical 201
Using amazon machine learning to identify trends in io t data technical 201
 
3D Modeling The World Around
3D Modeling The World Around 3D Modeling The World Around
3D Modeling The World Around
 
FinalPresentation-GradProject
FinalPresentation-GradProjectFinalPresentation-GradProject
FinalPresentation-GradProject
 
Leveraging IOT and Latest Technologies
Leveraging IOT and Latest TechnologiesLeveraging IOT and Latest Technologies
Leveraging IOT and Latest Technologies
 
Neotys PAC - Todd De Capua
Neotys PAC - Todd De CapuaNeotys PAC - Todd De Capua
Neotys PAC - Todd De Capua
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 

Machine Learning 101

  • 2. About me • Email Security @ Symantec • Doing Data Science to fight Spam and Malware • Organizer for Python Data Science Group Singapore • Monthly regular meet-ups—over a year • http://meetup.com/pydata-sg >1.8K members • https://www.facebook.com/groups/pydatasg/ >1k members • https://twitter.com/pydatasg • https://engineers.sg/organizations/118 recorded and uploaded • Previously with CENSAM @ MIT • Co-founded startup(s) • NUS Alumni • Some questions • How many of you have heard about Machine Learning or ML? • How many of you know how to do ML? • How many of you earn a living doing ML? • What this talk offers • Getting a foot in the door • Grossly oversimplifying things • How to learn ML from literature • Relate to ML terms when thrown at you • Types of ML • Learning ML models and their coding (SciKit-learn and why?) • Linear Regression • Logistic Regression • Clustering • Lessons from Practical ML @ObaidTal
  • 3. Some terminology • Data Science • Data Analytics • Business Analytics • Artificial Intelligence • Machine Learning Ref. Tuan Q. Phan
  • 4. What is Data? • Available data (Internal) • Health record • Organization • University • … • Available data (external) • www.data.gov.sg • Publicly available corpuses • Quality of data • Trustworthy or not • Missing data • Huge challenge in scientific community • Other jargon • Tiny Data: Data from sensors • Big Data: Data on massive scale • Fast Data: Hash-based lookup @ObaidTal
  • 5. Machine Learning, defined • A field of study that gives computers the ability to learn without being explicitly programmed – Arthur Samuel (1959) • Samuel wrote a program to play checkers • Eventually his program learned to play better @ObaidTal Ref: http://infolab.stanford.edu/pub/voy/museum/samuel.html
  • 6. When did we all start with Machine Learning? • Take a look at the following (outputs) and guess the ?: • 1, 2, 3, 4, 5, 6, ?, …, ? • 2, 4, 6, 8, 10, 12, ?, …, ? • 3, 6, 9, 12, ?, …, ? • 1, 3, 9, 27, ?, …, ? • 4, 7, 10, 13, ?, …, ? • So how can I represent above • Input -> -> output • X -> -> Y • call this box as f() • Output = f(Input) ... In maths • Y = f(x) • Answers • Y=X • Y=2*X • Y=3*X + 0 • Y=3^X • Y=3*X + 1 In school… Really, how? How to find ‘…,?’ – A: Equation (Single variable) @ObaidTal Assuming input is 1,2,3,4,5,6,…
  • 7. Linear Regression – Statistical term Y=mx+b… from last example, b=? & m=? b = 1 m = 3 Y=mx+b Output Input Suppose this line is Y=3x+1 Assume that this line is ‘surrounded’ by ‘+’ shaped points, which we had, i.e. (outputs) 4, 7, 10, 13, ?, …, ? (Y) having inputs 1, 2, 3, 4, 5, 6, … (x) The line Y=3x+1 kind of ‘fits’ in these points as to find out ‘…, ?’
  • 9. Supervised Learning • Providing the output, and a dataset (input), to come up with the answer, i.e. model. • In literature, “The Boston housing prices” example is a “Regression problem”, i.e. predicting the continuous value variable, as the outcome. • “Classification problem” – i.e. the variable trying to predict is discrete e.g. spam problem, output is either 0 or 1 • The feature or input dataset variables can always be more than one, i.e. graph with multiple dimensions. • Code the model with what the right answer, i.e. Y is, and train with number of input sets, i.e. x, and ask the algorithm or model to replicate the same Types of Machine Learning Ref. Andrew Y. Ng. 1
  • 10. Unsupervised Learning Types of Machine Learning • Data is given, and structure must be inferred • Clustering is one example of it • Deep Learning is also considered here • Example is finding clusters in – Gene data – Image processing, grouping pixels together – Social network analysis – Lots of people talking, extracting the voice of single person, considering voices of others as noise – Cocktail party problem – Text processing • Independent component analysis ICA algorithm Ref. Andrew Y. Ng. 2
  • 11. Reinforcement Learning • Sequence of decisions are made over time • Example • Flying an autonomous helicopter • Reward function • Specify what you want to get done • Specify a good behavior and bad behavior in Reward function • Learning algorithm will decide to maximize good behavior and minimize bad behavior Types of Machine Learning Ref. Andrew Y. Ng. 3
  • 12. Getting ready – some more terms… • Data set/Input is also called training set, observation • The predictor is called hypothesis for historical reasons, and it is called classifier, estimator, predictor • Boston housing price problem (we’ll see more of it) • We will train/learn and predict price • Features or input variable on right side of Y = mx+b, i.e. x • Price, i.e. Y, output or target variable of Y = mx+b • Linear equation, i.e. Y=mx+b can be written as predictor, where m is slope and b is intercept • Cost function – which Y=mx+b is better (we will see more of it) @ObaidTal Let’s get coding… 1 • To remember • Will expand on
  • 13. Popular Machine Learning Tool Kit – Introduction Project Language Highlight R R A language for statistic analysis and ML Octave Octave A language to simulate Matlab for numerical computations Scikit-learn Python Documentation, example, tutorials available. General purpose with simple API Tensorflow Py bindings A library for numerical computation using data flow graphs Orange Python General Purpose ML Package PyBrain Python Neural networks, unsupervised learning MLlib Python/Scala Apache’s new library based within Spark Mahout Java Apache’s framework based on Hadoop Weka java General Purpose ML Package GoLearn Go Machine Learning by Go shogun C++ User interfaces to various languages
  • 14. Machine Learning Kit – which to choose • Factors to consider • Language • Performance (run speed) • Scalability Ref. T. Obaid & H. Zhang • We choose Scikit learn • Language: Python • Performance (run speed): good enough • Scalability: not critical, and can switch to MLlib in Spark for mass data • Well documented, enough algorithms, clean API, robust, fast implementation, easy usage Scikit Learn – Machine Learning in Python • Simple and efficient tools for data mining and data analysis • Accessible to everybody, and reusable in various contexts • Built on NumPy, SciPy, and matplotlib • Open source, commercially usable – BSD license
  • 15. Scikit Learn – Examples • A lot of sample codes are in source folder: scikit-learn-0.16.1/examples • Boston housing prices (we will work with this example dataset) • Will try features one by one (test only 3 of them in this session, please try more) • Excerpt of data… (how our data actually looks like) 1. CRIM 2. ZN 3. INDUS 4. CHAS 5. NOX 6. RM 7. AGE 8. DIS 9. RAD 10. TAX 11. PTRATIO 12. B 13. LSTAT 14. MEDV 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2 Details about each feature of this data are coming next… • To remember • Please explore …
  • 16. Features of Boston housing prices 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centers 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000s Features and their details Which of these features are significant: • All of them? • A few of them? • Another one, not in them? Let’s observe these…
  • 17. Scikit Learn – Demo code for Boston house price. Try it! import matplotlib.pyplot as plt # for plotting import numpy as np # for matrix/array operations from sklearn import datasets, linear_model # classifier boston = datasets.load_boston() boston_X = boston.data[:, np.newaxis] boston_X_temp = boston_X[:, :, 12] # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one boston_X_train = boston_X_temp[:] boston_y_train = boston.target[:] regr = linear_model.LinearRegression() # estimator regr.fit(boston_X_train, boston_y_train) # train parameters fig,ax = plt.subplots() ax.scatter(boston_X_train, boston_y_train, color='black') # we can predict boston_X_test ax.plot(boston_X_train, regr.predict(boston_X_train), color='green', linewidth=3) # to predict ax.set_xlabel(boston.feature_names[12]) # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one ax.set_ylabel('Predicted') fig.show() plt.show() Ref. T. Obaid & H. Zhang • Important ... • Good Feature? • Not so Good Feature? • Comments
  • 18. Scikit Learn – Demo result for Boston house price • Parameters (Coefficients, -0.95692593 ) (intercept, 34.7411998746244) • Feature: • % lower status of the population • y=-0.95692593 *LSTAT + 34.7411998746244 • Looks good! 1st Try with LSTAT % lower status of the population
  • 19. Demo result Contd. • Parameters (Coefficients, -2.1571753) (intercept, 62.3446274748) • Feature: • pupil-teacher ratio by town • y=-2.1571753*PTRATIO + 62.3446274748 • Doesn’t look good! 2nd Try with PTRATIO pupil-teacher ratio by town
  • 20. Demo result Contd. • Parameters (Coefficients, 9.126359) (intercept, -34.7856369115583) • Feature: • average number of rooms per dwelling • y=9.126359*RM -34.7856369115583 • Looks good! 3rd Try with RM average number of rooms per dwelling
  • 21. Cost function – the lower the cost, the better the model Real LSTAT Predicted Difference Square ... ... ... ... ... 18.3 14.1 21.24854426 2.948544262 8.693913263 21.2 12.92 22.37771686 1.177716859 1.387017 17.5 15.1 20.29161833 2.791618332 7.793132909 16.8 14.33 21.0284513 4.228451298 17.87980038 22.4 9.67 25.48772613 3.087726132 9.534052663 20.6 9.08 26.05231243 5.45231243 29.72771084 23.9 5.64 29.34413763 5.444137629 29.63863453 22 6.48 28.54031985 6.540319848 42.77578372 11.9 7.88 27.20062355 15.30062355 234.1090809 Total: 19478.69458 Total/2 9739.347291 Ref. Andrew Y. Ng. Real RM Predicted Difference Square ... ... ... ... ... 18.3 5.794 18.09248713 -0.207512866 0.043061589 21.2 6.019 20.14591791 -1.054082091 1.111089054 17.5 5.569 16.03905636 -1.460943641 2.134356321 16.8 6.027 20.21892878 3.418928781 11.68907401 22.4 6.593 25.38444798 2.984447975 8.906929718 20.6 6.12 21.06768017 0.467680168 0.21872474 23.9 6.976 28.87984347 4.979843472 24.79884101 22 6.794 27.21884613 5.218846134 27.23635497 11.9 6.03 20.24630786 8.346307858 69.66085487 Total: 22062.73306 Total/2 11031.36653 Predicted=-0.95692593 * LSTAT + 34.7411998746244 Predicted=9.126359 * RM -34.7856369115583 Least-squares cost function = for ( i = 1; i < m; i++) Comment: Here Summation is nothing but a for loop as: How well are we doing – Compare the Good ones • 1 Good Feature VS • Another Good Feature
  • 22. Over-fitting and Under-fitting The Good model is … the “Just right!” model – Why? • Under-fitting – high bias not matching and cost too high • Just right is what we need • Over-fitting – High variance happens mostly when too many features are used or the model is too complex • The model should learn, not memorize http://i.imgur.com/W0qejU0.png
  • 23. Scikit Learn – Usage from sklearn import linear_model X=[][] # source data with (n_samples, n_features) Y=[] # target value with (n_samples) clf = linear_model.LinearRegression() # Estimator, or classfier clf = clf.fit(X, Y) # learn parameters from existing data Test = [][] # same shape as X clf.predict(Test) # predict the target for data in Test Ref. T. Obaid & H. Zhang The model program skeleton would look something like… • Important 1. Model 2. Fit 3. Predict • Comments
  • 24. Observations from code • There is always a fit function call, i.e. learning/training X, to give Y. • Same is a predict function call, given X only, pop out Y. • Panda library pd can alternatively be used to have relatively simpler display of data • train_test_split function call serves important purpose, as it shuffles the dataset so we don’t have selection bias, i.e. if for instance data is ordered by price ascending, and halved for training and half for testing, then the training data may have all the house with lesser prices. • To remember • Subtleties • Probable Issue
  • 25. Scikit Learn – Test Data • Scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the Boston house prices dataset for regression. • Boston(boston house prices), iris(iris flower), mlcomp(20 newsgroups), svmlight_file/s, diabetes, lfw_pairs(labeled face), sample_image/s(china and flower), digits(0-9 handwriting), lfw_people(labeld people), linnerud(for multivariate regression) • Scipy.misc.lena() • Load test data … Try others! from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits() Subset of learning datasets – just saw Boston housing prices • … Seen so far • Ahead … • Please Explore …
  • 26. Scikit Learn – Main Algorithms • Supervised learning (most have both classifier and regressor) • Line model: LinearRegression, Lasso, Ridge, LogisticRegression, SGD • SVM: LinearSVC, SVC, SVR • Naïve Bayes: GaussianNB, MultinomiaNB, BernoulliNB • Decision Tree: DecisionTree(optimized version of the CART) • Ensemble method: RandomForest, AdaBoost, GradientBoosting(GBDT) • Unsupervised learning • Clustering: Kmeans(Kmeans+, mini-batch), DBSCAN • Manifold learning(dimension reduction): MDS, Isomap, LocallyLinearEmbedding. • Algorithm whole list: http://scikit-learn.org/stable/modules/classes.html Subset of supported algorithms – we just saw LinearRegression • … Seen so far • Ahead …
  • 27. Logistic (Classification) Regression • Regression is when our labels y can take any real (continuous) value. Examples include: • Predicting stock market. • Predicting sales. • Detecting the age of a person from a picture. • Classification is when our labels y can only take a finite set of values (categories). Examples include: • Handwritten digit recognition: xx is an image with a handwritten digit, yy is a digit between 0 and 9. • Spam filtering: xx is an e-mail, and yy is 0 or 1 whether that e-mail is a spam or not. Linear (Regression) vs Logistic (Classification)
  • 29. Logistic Regression – with IRIS example • Categorical output instead of continuous output • Will use IRIS dataset – to classify 3 species of plants • Number of Instances: 150 (50 in each of three classes) • Number of Attributes: 4 numeric, predictive attributes and the class • Attribute/Feature Information: • sepal length in cm (will use this) • sepal width in cm (will use this) • petal length in cm • petal width in cm • Classes i.e. Target: • Iris-Setosa • Iris-Versicolour • Iris-Virginica IRIS is a database of flower classes… bears a little bit of botany Setosa Versicolour Virginica • Petal is the colored part of the flower • Sepal is the green leaf below the petal
  • 30. Let’s go code… Try it! import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression iris = load_iris() print "--- Keys ---n", iris.keys() print "--- Shape ---n", iris.data.shape print "--- Feature Names ---n", iris.feature_names print "--- Description ---n", iris.DESCR print "--- Target --- n", iris.target iri = pd.DataFrame(iris.data) print "--- Panda Head ---n", iri.head() iri.columns = iris.feature_names print "--- Panda Columns ---n", iri.head() logreg = LogisticRegression(C=1e5) X = iris.data[:, :2] # we only take the first two features. Y = iris.target print "--- X ---n", X print "--- y ---n", Y # we create an instance of Neighbors Classifier and fit the data. logreg.fit(X, Y) # again, the infamous fit method Part 1 Part 2 • Preparation • Important • Debug
  • 31. A little bit more… Try it! # Plotting h = .02 # step size in the mesh # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, m_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # Prediction Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(1, figsize=(4, 3)) plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired) # Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width’) plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(()) plt.show() Part 3 Part 4 • Plotting • Important • Debug
  • 33. Clustering • Unsupervised learning • Output unknown • Grouping observation K-Means • One of the most popular "clustering" algorithms. • Stores kk centroids that it uses to define clusters. • If a point is closer to a cluster's centroid. • Find best centroids by alternating between • assigning data points to clusters based on the current centroids • sing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters. 34 43 49 58 70 81 89 101 116 121 131 145 <=11 <=12 <=15 34 43 49 58 70 81 89 101 116 121 131 145 Primitive clustering e.g. 11 6 9 12 11 8 12 15 Input data sorted 2
  • 34. Clustering applied on IRIS data • We used the same IRIS data, as used in logistic regression demo, however changed two things: • Added a feature, i.e. three features for clustering, that’s why a 3D plot as output • Removed the output, to demonstrate unsupervised learning Three features, thus plotted in 3D plane
  • 35. Let’s codeTry it! import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets np.random.seed(5) iris = datasets.load_iris() X = iris.data # No used of Y here est = KMeans() # We try before hand the no. of clusters, can be even more, default is 8 est.fit(X) # NOTICE!, no Y here, “Unsupervised”, Yay! labels = est.labels_ fig = plt.figure(1, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float)) ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float)) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') plt.show() Part 1 Part 2 • Preparation/Plotting • Important • Debug
  • 36. Lessons learned! • The dataset on which the model is executed here, is available and well-formatted, which is not the case always • Data acquisition and preparation come prior to feature extraction • Extracting the interesting features, “numerifying” (converting to numbers, if not already) and later normalizing them, comes prior to running model on it • Features or data columns can be categorical or inferential variables, or can cause singularity problem; these affect the performance of data model and hence residual cost • Selection of model, linear or logistic, and observing cost to select appropriate features, can also be achieved using R, i.e. a gold standard of p (Probability of incorrectly rejecting a true null hypothesis) would be ~ 0.05 (At least 23% (and typically close to 50%)) • Cross validation (CV) is done by running test and training a few times and measuring difference • Confusion matrix also provides visibility into how many predictions are right and wrong @ObaidTal From real-life Machine Learning
  • 37. Lessons learned! • If the data is in time-series, and there is missing data within the time window, then we can apply interpolation or extrapolation. Interpolation works good for archived data, whereas extrapolation for live data • Before applying any regression, it so happens that we may have to cluster the data and then apply regression over it. This would help control outliers, if any, which may impact the model performance. Outliers are not always noise in the data • Selection bias happens when we train the model on data, which is not the true representation of the real occurrences. For instance, dissecting the housing price ordered by ascending, and training over it, would skip the higher-valued homes. Thus to avoid it, data should be shuffled to achieve even distribution • Curse of dimensionality, when challenged with too many features. To deal with it, carefully reduce the non-significant features including the dependent, categorical or composite features, depending on where applicable … Continued @ObaidTal
  • 38. References • Stanford’s CS229 by Prof Andrew Y. Ng – Highly recommended! • https://www.youtube.com/watch?v=UzxYlbK2c7E • Scikit-Learn tutorial • http://scikit-learn.org/stable/ • http://scikit-learn.org/stable/install.html • http://www.shogun-toolbox.org/page/features/ • http://daoudclarke.github.io/machine%20learning%20in%20practice/ 2013/10/08/machine-learning-libraries/
  • 39. References • http://www-bcf.usc.edu/~gareth/ISL/ – Highly Recommended! • http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression- in-python-scikit-learn/ • http://ipython-books.github.io/featured-04/ • http://stanford.edu/~cpiech/cs221/handouts/kmeans.html • http://scikit- learn.org/stable/auto_examples/cluster/plot_cluster_iris.html • http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly- interpret-p-values Continued…
  • 40. Thank you! Talha Obaid • linkedin.com/in/talhaobaid • twitter.com/ObaidTal • github.com/TalhaObaid • talhaobaid@gmail.com