A practical Introduction to Machine(s) Learning

A practical Introduction to Machine(s) Learning
 
www.bgoncalves.com

www.bgoncalves.com@bgoncalves
Time
CPU
Data
Big 
Data
Moore’s Law
Big Data
1997

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf
E X P E R T O P I N I O N
Contact Editor: Brian Brannon, bbrannon@computer.org
behavior. So, this corpus could serve as the basis of
a complete model for certain tasks—if only we knew
how to extract the model from the data.
E
ugene Wigner’s article “The Unreasonable Ef-
fectiveness of Mathematics in the Natural Sci-
ences”1 examines why so much of physics can be
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
The Unreasonable
Effectiveness of Data
Big Data

Dataset Size
ModelQualityBig Data

Big Data https://www.wired.com/2008/06/pb-theory/

From Data To Information
Statistics are numbers that summarize raw facts and figures in some
meaningful way. They present key ideas that may not be immediately
apparent by just looking at the raw data, and by data, we mean facts or figures
from which we can draw conclusions. As an example, you don’t have to
wade through lots of football scores when all you want to know is the league
position of your favorite team. You need a statistic to quickly give you the
information you need.
The study of statistics covers where statistics come from, how to calculate them,
and how you can use them effectively.
Gather data
Analyze
Draw conclusions
When you’ve analyzed
your data, you make
decisions and predictions.
Once you have data, you can analyze itand generate statistics. You can calculateprobabilities to see how likely certain eventsare, test ideas, and indicate how confidentyou are about your results.
At the root of statistics is data.
Data can be gathered by looking
through existing sources, conducting
experiments, or by conducting surveys.

“Zero is the most natural number” 
(E. W. Dijkstra)
Count!
• How many items do we have?

Descriptive Statistics
Min Max
Mean µ =
1
N
NX
i=1
xi
=
v
u
u
t 1
N
NX
i=1
(xi µ)
2Standard 
Deviation

Anscombe’s Quartet
x1 y1
10.0 8.04
8.0 6.95
13.0 7.58
9.0 8.81
11.0 8.33
14.0 9.96
6.0 7.24
4.0 4.26
12.0 10.84
7.0 4.82
5.0 5.68
x2 y2
10.0 9.14
8.0 8.14
13.0 8.74
9.0 8.77
11.0 9.26
14.0 8.10
6.0 6.13
4.0 3.10
12.0 9.13
7.0 7.26
5.0 4.74
x3 y3
10.0 7.46
8.0 6.77
13.0 12.74
9.0 7.11
11.0 7.81
14.0 8.84
6.0 6.08
4.0 5.39
12.0 8.15
7.0 6.42
5.0 5.73
x4 y4
8.0 6.58
8.0 5.76
8.0 7.71
8.0 8.84
8.0 8.47
8.0 7.04
8.0 5.25
19.0 12.50
8.0 5.56
8.0 7.91
8.0 6.89
9
11
7.50
~4.125
0.816
ﬁt y=3+0.5x
µx
µy
y
x
⇢
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20

Central Limit Theorem
• As the random variables:
• with:
• converge to a normal distribution:
• after some manipulations, we ﬁnd:
n ! 1
Sn =
1
n
X
i
xi
N 0, 2
p
n (Sn µ)
Sn ⇠ µ +
N 0, 2
p
n
The estimation of the mean converges to the true
mean with the square root of the number of samples
! SE = p
n

Gaussian Distribution - Maximally Entropic
PN (x, µ, ) =
1
p
2
e
(x µ)2
2 2

Experimental Measurements
• Experimental errors commonly assumed gaussian distributed
• Many experimental measurements are actually averages:
• Instruments have a ﬁnite response time and the quantity of
interest varies quickly over time
• Stochastic Environmental factors
• Etc

• In an experimental measurement, we expect (CLT) the experimental values to be
normally distributed around the theoretical value with a certain variance. Mathematically,
this means: 
• where are the experimental values and the theoretical ones. The likelihood is
then: 
• Where we see that to maximize the likelihood we must minimize the sum of squares
MLE - Fitting a theoretical function to experimental data
y f (x) ⇡
1
p
2 2
exp
"
(y f (x))
2
2 2
#
y f (x)
Least Squares Fitting
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi f (xi))
2
2 2
#

MLE - Linear Regression
• Let’s say we want to ﬁt a straight line to a set of points:
• The Likelihood function then becomes: 
• With partial derivatives: 
 
 
 
• Setting to zero and solving for and :
y = w · x + b
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi w · xi b)
2
2 2
#
@L
@w
=
X
i
[2xi (yi w · xi b)]
@L
@b
=
X
i
[(yi w · xi b)]
ˆw =
P
i (xi hxi) (yi hyi)
P
i (xi hxi)
2
ˆb = hyi ˆwhxi
ˆw ˆb

@bgoncalves
MLE for Linear Regression
from __future__ import print_function
import sys
import numpy as np
from scipy import optimize
data = np.loadtxt(sys.argv[1])
x = data.T[0]
y = data.T[1]
meanx = np.mean(x)
meany = np.mean(y)
w = np.sum((x-meanx)*(y-meany))/np.sum((x-meanx)**2)
b = meany-w*meanx
print(w, b)
#We can also optimize the Likelihood expression directly
def likelihood(w):
global x, y
sigma = 1.0
w, b = w
return np.sum((y-w*x-b)**2)/(2*sigma)
w, b = optimize.fmin_bfgs(likelihood, [1.0, 1.0])
print(w, b)
MLElinear.py

Geometric Interpretation
0
3.25
6.5
9.75
13
0 5 10 15 20
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi f (xi))
2
2 2
#
Quadratic error
means that an error
twice as large is
p e n a l i ze d f o u r
times as much.

`
0
3.25
6.5
9.75
13
0 5 10 15 20
2D
y = !0 + !1x1
3D
y = !0 + !1x1 + !2x2
Geometric Interpretation
nD
y = ~!T
~x

How do Machines Learn?

3 Types of Machine Learning
• Unsupervised Learning
• Autonomously learn an good representation of the dataset
• Find clusters in input
• Supervised Learning
• Predict output given input
• Training set of known inputs and outputs is provided
• Reinforcement Learning
• Learn sequence of actions to maximize payoff
• Discount factor for delayed rewards

Unsupervised Learning
• Extracting patters from data
• K-Means
• PCA

Clustering

K-Means
• Choose k randomly chosen points to be the centroid of each
cluster
• Assign each point to belong the cluster whose centroid is closest
• Recompute the centroid positions (mean cluster position)
• Repeat until convergence

K-Means: Structure Voronoi Tesselation

K-Means: Convergence
• How to quantify the “quality” of the solution found at each iteration, ?
• Measure the “Inertia”, the square intra-cluster distance: 
 
 
 
where are the coordinates of the centroid of the cluster to which is assigned.
• Smaller values are better
• Can stop when the relative variation is smaller than some value
µi xi
In+1 In
In
< tol
In =
NX
i=0
kxi µik
2
n

@bgoncalves
K-Means: sklearn
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=nclusters)
kmeans.fit(data)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
KMeans.py

K-Means: Limitations

K-Means: Limitations
• No guarantees about Finding
“Best” solution 
• Each run can ﬁnd different
solution 
• No clear way to determine “k”

Silhouettes
• For each point deﬁne as: 
 
 
 
the average distance between point and every other point within cluster .
• Let be: 
 
the minimum value of excluding
• The silhouette of is then:
ac (xi)
ac (xi) =
1
Nc
X
j2c
kxi xjk
b (xi) = min
c6=ci
ac (xi)
s (xi) =
b (xi) aci (xi)
max {b (xi) , aci
(xi)}
xi
xi c
b (xi)
ciac (xi)
xi

Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Principle Component Analysis
• Finds the directions of maximum variance of the dataset
• Useful for dimensionality reduction
• Often used as preprocessing of the dataset

• The Principle Component projection, T, of a matrix A is deﬁned as:
• where W is the eigenvector matrix of:
• and corresponds to the right singular vectors of A obtained by Singular Value
Decomposition (SVD):
• So we can write: 
• Showing that the Principle Component projection corresponds to the left
singular vectors of A scaled by the respective singular values
• Columns of T are ordered in order of decreasing variance.
T = AW
AT
A
T = U⌃WT
W ⌘ U⌃
A = U⌃WT
Generalization of
Eigenvalue/
Eigenvector
decomposition
for non-square
matrices.
⌃

@bgoncalves
from __future__ import print_function
import sys
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt(sys.argv[1])
x = data.T[0]
y = data.T[1]
pca = PCA()
pca.fit(data)
meanX = np.mean(x)
meanY = np.mean(y)
plt.style.use('ggplot')
plt.plot(x, y, 'r*')
plt.plot([meanX, meanX+pca.components_[0][0]*pca.explained_variance_[0]],
[meanY, meanY+pca.components_[0][1]*pca.explained_variance_[0]], 'b-')
plt.plot([meanX, meanX+pca.components_[1][0]*pca.explained_variance_[1]],
[meanY, meanY+pca.components_[1][1]*pca.explained_variance_[1]], 'g-')
plt.title('PCA Visualization')
plt.legend(['data', 'PCA1', 'PCA2'], loc=2)
plt.savefig('PCA.png')
PCA.py

Supervised Learning

Supervised Learning - Classification
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature1
Feature3
Feature2
…
Label
FeatureM
• Dataset formatted as an NxM matrix of N samples and M
features
• Each sample belongs to a specific class or has a specific label.
• The goal of classification is to predict to which class a
previously unseen sample belongs to by learning defining
regularities of each class
• K-Nearest Neighbor
• Support Vector Machine
• Neural Networks
• Two fundamental types of problems:
• Classification
• Regression (see Linear Regression above)

Supervised Learning - Overﬁtting
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature1
Feature3
Feature2
…
Label
FeatureM
• “Learning the noise”
• “Memorization” instead of “generalization”
• How can we prevent it?
• Split dataset into two subsets: Training and Testing
• Train model using only the Training dataset and evaluate
results in the previously unseen Testing dataset.
• Different heuristics on how to split:
• Single split
• k-fold cross validation: split dataset in k parts, train in k-1
and evaluate in 1, repeat k times and average results.
TrainingTesting

Bias-Variance Tradeoff

Bias-Variance Tradeoff
Model Complexity
Error
Training
Testing
Variance
Bias
High Bias
Low Variance
Low Bias
High Variance

K-nearest neighbors
• Perhaps the simplest of supervised learning algorithms
• Effectively memorizes all previously seen data
• Intuitively takes advantage of natural data clustering
• Deﬁne that the class of any datapoint is given by the plurality of it’s k
nearest neighbors
• It’s not obvious how to ﬁnd the right value of k

K-nearest neighbors

Biological Neuron
What about Neurons?

How the Brain “Works” (Cartoon version)

• Each neuron receives input from other neurons
• 1011 neurons, each with with 104 weights
• Weights can be positive or negative
• Weights adapt during the learning process
• “neurons that ﬁre together wire together” (Hebb)
• Different areas perform different functions using same structure (Modularity)

Inputs Outputf(Inputs)

Historical Perspective
1958
Perceptron
• Popularized by F. Rosenblatt who wrote “Principles of Neurodynamics"
• Still used today
• Simple but limited training procedure
• Single Layer

Perceptron
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x
aj
(z)
w
0j
1
Inputs Weights
Output
Activation
function
Bias

Activation Function

Activation Function
(z) =
1
1 + e z

Activation Function
(z) =
ez
e z
ez + e z

@bgoncalves
Activation Function
import matplotlib.pyplot as plt
import numpy as np
def linear(z):
return z
def binary(z):
return np.where(z > 0, 1, 0)
def relu(z):
return np.where(z > 0, z, 0)
def sigmoid(z):
return 1./(1+np.exp(-z))
def tanh(z):
return np.tanh(z)
z = np.linspace(-6, 6, 100)
plt.style.use('ggplot')
plt.plot(z, linear(z), 'r-')
plt.xlabel('z')
plt.title('Linear activation function')
plt.savefig('linear.png')
plt.close() activation.py

Perceptron
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
Training Procedure:
• If correct, do nothing
• If output incorrectly outputs 0, add input to weight
vector
• if output incorrectly outputs 1, subtract input to
weight vector
• Guaranteed to converge, if a correct set of weights
exists
• Given enough features, perceptrons can learn almost
anything
• Speciﬁc Features used limit what is possible to learn

1958
Marvin Minsky
• Co-authors “Perceptrons” with Seymour Papert
• XOR Problem
• Perceptrons can’t learn non-linearly separable functions
• The ﬁrst “AI Winter”
Perceptron
1969

1958
Perceptron
1969
XOR
1986
Geoff Hinton
• Discovers “Backpropagation”
• “Multi-layer Perceptron”
• Expensive computation requiring lots of data
• Impractical

Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of
one layer as the input to the next one. 
• But how can we propagate back the errors and update the weights?
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
1
w
0k
w
1k
w2k
w3k
wNk
ak
wT
a
a1
a2
aN

Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• Forward propagate the inputs and calculate the deltas
• Update the weights
• The error at the output is the squared difference between predicted
output and the observed one:
• Where is the real output and is the predicted one.
• For inner layers there is no “real output”!
t
E = (t y)
2
y

Chain-rule
• From the forward propagation described above, we know that the
output of a neuron is:
• But how can we calculate how to modify the weights ?
• We take the derivative of the error with respect to the weights! 
• Using the chain rule: 
• And ﬁnally we can update each weight in the previous layer: 
• where is the learning rate
yj = wT
x
@E
@wij
=
@E
@yj
@yj
@wij
wij
@E
@wij
wij wij ↵
@E
@wij
yj
↵

Learning Rate
Epoch
High Learning Rate
Very High Learning Rate
Loss
Low Learning Rate
Best Learning Rate

Backprop
• Back propagation solved the fundamental problem underlying neural networks
• Unfortunately, computers were still too slow for large networks to be trained successfully
• Also, in many cases, there wasn’t enough available data

1958
Perceptron
1969
XOR
1986
Yann LeCun
• Starts working on Convolutional Neural
Networks
• (Eventually) develops the ﬁrst practical
application of image recognition
Backprop
1989

1958
Perceptron
1969
XOR
1986
Vladimir Vapnik
• Support Vector Machines
• A clever variation on Perceptrons
• Kernel Trick
Backprop
1989
ConvNet
1995

Support Vector Machine
0
3.25
6.5
9.75
13
0 5 10 15 20
Perceptrons ﬁnd a
hyperplane that
separates the two
classes of data

Support Vector Machine
0
3.25
6.5
9.75
13
0 5 10 15 20
But how can we
ﬁnd the hyperplane
w i t h t h e b e s t
separation of the
data?

Support Vector Machines
• Decision plane has the form:
• We want for points in the “positive” class and for points in the negative
class. Where the “margin” is as large as possible.
• Normalize such that and solve the optimization problem: 
 
 
subject to:
• The margin is:
wT
x = 0
wT
x b wT
x  b
2b
min
w
||w||2
yi wT
x > 1
2
||w||
b = 1

Kernel “trick”
• SVM procedure uses only the dot products of vectors and never the vectors themselves.
• We can redeﬁne the dot product in any way we wish.
• In effect we are mapping from a non-linear input space to a linear feature space

1958
Perceptron
1969
XOR
1986
Backprop
1989
ConvNet
1995
SVM
Geoff Hinton
• Deep Neural Networks
• Combines backprop  
with faster machines and
larger datasets
2006

Neural Network Architectures

Convolutional Neural Networks

Interpretability

“Deep” learning

A practical Introduction to Machine(s) Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to A practical Introduction to Machine(s) Learning

Similar to A practical Introduction to Machine(s) Learning (20)

Recently uploaded

Recently uploaded (20)

A practical Introduction to Machine(s) Learning