supervised.pptx

Semantic gap- What the computercansee?
????

Semantic gap- What the computer canREAD?

Semantic gapin text - What computers canread?
In images, we have matrix
representation
In text→ ??
→ Naive = BoW against vocab

Semantic gap- What is the best representation of sensory signals forcomputers?
Old problem →
ADC
Coursat.ai

Facerecognition with traditional AI
Rules
(If-
else)
I
D
Rule based
AI

Facerecognition with traditional AI
Rules
(If-else)
I
D
Did you handle
scale?
Did you handle
colors?
Did you handle
pose?
Did you handle
Clutter?
Too many cases!! Let’s Learn the rule!

RulebasedAIvs. ML
“Machine learning is the science of getting
computers to act without being explicitly
programmed”, Arthur Samuel 1959

DeepLearning vs.Machine Learning
Pric
e
Floo
r
Are
a
Yea
r
built

ML: How to learnw’s?
Assume only 2 q’s:
Basically → Search: ax1+bx2+c (Model)→
w1=a,
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better → How good?
Loss
Practically→ Smarter methods (Optimizer)
are used better than brute force or random
search!

How to decide onquestions?
We call the questions vector $q^i=[q1, q2,...qN]$ features vector
In case of structured data like above, normally we choose them based on
experience.
In case of unstructured data like images, the input vector is just the pixel values. So the
question,
what are the features?
The answer to this question defines if we do ML or DL
In general, if we feed raw pixels, we do DL, if we define specific features, we do ML.
This will get more clear

Semantic gap- What the computer cansee?Traditional CVpipeline with OpenCV
Model
y
X

DeepLearning vs.Machine Learning
I
D

What could be good facefeatures?

Ex:Semanticgapin Tabulardata
Or you can have derived features
The least is simple input transform

Howto closethe Gap?Thedeeperyougo, the moreyouclosethe gapfrom raw(X)to
semantics(y)
Model
y
How to obtain the
model?
X
The model is usually hierarchical = layered
The least #layers is input transformations
More “refinements” means more depth
The more you go deeper, the more the gap closes X→ y

The“Deep” in DeepLearning
Hierarichal Representation

What AIcando?
Artificial Narrow Intelligence: Structure or unstructured
Structured data Unstructured data

ParametricvsNon-parametric
➢ Parametric: assume a family of functions (=distribution) from which the model is searched. read as
mapping function (model) of input X to output y, parametrized by theta
➢ Non-parametric: We don’t assume any function family. Usually heuristics. Example: K-NN and
Naive Bayes.

ModelsAnatomy
Models
Statistical Learning
Rule-based
Learning
Method
Unsupervised
Reinforcement
Learning
Supervised
Modeling
Choice
Non-parametric Parametric

What is AIis best at today?
Supervised
Learning
Classification I
D
Regression Pric
e
ML B
A

Regression
Car Price 150K 200K 250K ? 350K
Hors
e
Powe
r
72 90 110 130 180
Job Salary 10K 20K ? 35K 50K
Domain X Y X Z Z
Location CAI NY PAR NY CAI
Grade B A A B A
Pure DB query=fail
Need to interpolate
An output could depend on many factors = features

Classification
Car Category Eco Low Mid High Premium
Horse Power 72 90 110 130 180
Eco=0
Low=1
Mid=2
High=3
Premium=4
Car Category
Low=0
High=1
Car Category

Discrimination =Classification

Optimization problem =Min.Error

How it works?
Ingredients
• Data:
– X, Y (supervised)
• Model: function mapping:
– Y’ = fw(X), w=1,2,…etc,
• Loss:
– How good Model fits Data?-> Y vs. Y’
• Optimizer:
– search for best fw, that makes the Model fits the Data with min.
Loss.

How DLworks?
3. Loss
4. Optimizer
Optimizer does the
“search”
Using backpropagation

Ingredients of supervisedlearning
➢Data:
● X, Y (supervised)
➢Model: function mapping:
● Y’ = fw(X), w=1,2,…etc,
➢Loss:
● How good Model fits Data?-> Y vs. Y’
➢Optimizer:
● Search for best fw, that makes the Model fits the Data with min.
Loss.

Parameters vs.Hyper parameters
• Parameter = values adjusted by the optimizer algorithm
• Only on training data! Never on validation or test.
• Ex: weights
• Hyper parameter = values adjusted by the algorithm
designer (human)
• Only on validation/training data! Never on test.
• Ex: network architecture (number of layers, numer of neuron per
layer,…etc), and opmitizer hyper parameters (learning rate,
momentum,…etc) (more on that later)

Let’scode
Inference
path:
Model y
X 7

BinaryClassification`
Multi-class
Classification` Regression
X y
Problem type depends on the
target variable = y
Supervisedproblemtypes

Global ML/DLframework
Decision
Input
Output
Wcls
Wfeatures
Features
Manual = Hand crafted=
Engineered = ML
Auto = NN = DL

Deepor shallow?
https://www.coursera.org/learn/ai-for-everyone

Deepor shallow?
https://www.manning.com/books/deep-learning-with-python Deep learning summer school, Montreal, 2015: Convolutional NN, Lee

UniversalSLpattern: Encoder-DecoderDesignPattern
Features
Decision
Input
Output
Encoder
(x2v)
Decoder
(v2y)
x y
v
img y
v
img2vec vec2task
seq seq2vec v vec2task y

Model vs.Problemsummary
https://www.manning.com/books/deep-learning-with-python
Regardless of layers types, we can
design two things:
- Last layer activation
- Loss function
Based on problem type

RegressionDesignPattern
Model (Ex: linear
regression)
How much knows ML?
https://miro.medium.com/max/1826/1*L9xLcwKhuZ2cuS8fF0ZjwA.png

Decision layer/Output (Ex: linear
regression)
ReLU (if negative is not
permitted) or Linear
https://static.packt-cdn.com/products/9781786469786/graphics/B05478_03_11.png

Loss (Ex: linear
regression)

BinaryClassification
Model (Ex: Logistic
regression)
Knows ML or not?

Decision layer/Output (Ex: Logistic
regression)
Knows ML or not?
Binary Classification
https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif

➢ Model:
● Layers → Wfeatures
● Output → Multiple linear neurons (likelinear
regression)
● Each output neuron gives a score
(unnormalized probability or logit)
● Softmax → Normalize
● Prediction = argmax
● Ensures only 1 class is possible
Multi-classclassification
https://deepnotes.io/public/images/softmax.png

BinaryCrossEntropy
https://gombru.github.io/assets/cross_entropy_loss/intro.png

Multi-label Classification
➢ Model:
● Layers
● Multiple
sigmoids
https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg

➢ Loss: BCE over each neuron
➢ Ask every Neuron, is it Human? Is it
person?...etc.
➢ More than 1 is possible
https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
Multi-label Classification

Model vs.Problemsummary

Let’scode
➢ Classifying movie reviews: a binary classification
example
➢ Digit Classification: Multi-class classification
example
➢ Classifying newswires: a multi-class classification
example
➢ Predicting house prices: a regression example

Assignment
Predicting House
Prices:

Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probs andwe
argmax them

Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probsand
we argmax them

Introductio
n to
sklearn
● Sklearn conventions
● Model persistence
(save/load)
● Dealing with different data
variables (categorical,
nominal,
ordinal,…etc)
● Dealing with different
problems types (Multi-
label, Multi-class,

Introduction to sklearn
https://colab.research.google.com/drive/1YMOpQAh3gN_A013IHMPvhFpoKcbMUK2B?usp
=sharing
We have seen how to deal with different problems setups in Keras, which is targeting one model type (NN)
Now, let’s explore a more generic framework, that targets different ML models
This is the whole you need from sklearn.
We won’t go through all, but only some parts to
stress + we see the rest as we go
sklearn is generic for ML
keras/tf/pytorch are more for DNN

Introduction tosklearn
The main design principles are:
- Consistency. All objects share a consistent and simple interface
- Inspection. All the estimator’s hyperparameters are accessible directly via public instance variables (e.g., imputer.strategy),
and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (e.g.,
imputer.statistics_).
- Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade
classes. Hyperparameters are just regular Python strings or numbers.
- Composition. Existing building blocks are reused as much as possible. For example, it is easy to create a Pipeline
estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.
- Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline
working system quickly.

SklearnInterfaces
- Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is
an estimator). The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for
supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the
estimation process is considered a hyperparameter (such as an imputer’s strategy), and it must be set as an instance
variable (generally via a constructor parameter).
- Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once
again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a
parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the
case for an imputer All transformers also have a convenience method called fit_transform() that is equivalent to calling fit()
and then transform() (but sometimes fit_transform() is optimized and runs much faster).
- Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For
example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a country’s
GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of
corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set (and
the corresponding labels in the case of supervised learning algorithms).

Let’scode
Parametric vs Non-parametric models:
https://colab.research.google.com/drive/1Oc0D_jr5Rmqvhf7315zjDY7Vj8Sr9RGV#scrollTo=aSpyl
FzopToZ

Gradient
Based
Optimizatio
n
● Optimization as a search
problem
● Gradient based optimization for
linear classifiers/regressors
● Linear
Regression
● SGD
● Logistic
Regression
● Overfitting vs
Underfitting
● Regularization
● Deeper models

Howoptimizer works forParametric models?
➢ Option 1: Closed form solution Let's solve the equations!
➢ Option 2: Search !
● Brute force: try all possible values of w's and
choose the
set with min loss → Numerical solution =
optimization
● Random search: same as brute force, but
choose random subset of all possible w's
● Guided search: start from random w's, measure
the loss, then choose the next set of w's, guided by
the change in the loss.
On which data to compute the loss?
How to use the loss change as a
guidance?

Onwhichdatato computethe loss?
➢ Search in training data → Loss evaluation + Parameters update
➢ Evaluate on unseen testing data → Loss evaluation but No parameters update allowed

Isthe separating boundary (classification) /fittingmodel (regression),alwaysaline?

Further classification: Linearvs.Non-linearmodels
Models
Statistical Learning
Rule-
based
Learning
Method
Unsupervised
Reinforcement
Learning
Supervise
d
Modeling
Choice
Non-parametric Parametric
Non-linear Linear
We start with Linear models

Example:Life Satisfaction vs.GDP
Play with Pandas
for tabular data!

Example:Life Satisfaction vs.GDP
We can manually fit many
solutions!

Normalequation
N-equations, in N
unknowns

sklearn.LinearRegressionis just the normal equation, with Pseudoinverse

➢ Disadvantages:
● Slow (X is a big matrix, inverse is costly and
not
always existing)
● Uses all the data points --> Might overfit
(more on that later)
● It doesn’t work for classification. Only
regression, since some equations will vanish
(correct classifications), leading to more
degrees of freedom and less constraints, and
we'll have infinite solutions (infinite lines can
separate the classes with 0 MSE)
Normalequation
https://raw.githubusercontent.com/qingkaikong/blog/master/43_
support_vector_machine_part1/figures/figure_3.jpg

➢ Sources:
– Sample noise: small data
– Sample bias: gender, ethnicit,
popularity,..etc
– Poor quality (garbage in-garbage out):
mislabels
– Missing data
➢ This results in poor generalization=overfitting
➢ The affect is magnitifed if the model is
allowed to use all the training data as in the
case of normal equation
➢ As can be seen, with missing countries
added, we get completely different curve!
ANOTHERISSUE:Non-representation ofdata

Gradient basedoptimization
A gradient is a of change of output, with change in input.
Meaning how much y changes when we change x by small amount.
In other words, it's the rate of change of y with x.
It also encodes the sensitivity of y to x
In case of loss, the gradient $frac{delta{loss}}{delta{w}}$ is how much the loss
changes when we change a weight with small amount. It guides our search of the
the weights that minimizes the loss; if we change w a bit, and the loss is affected
a lot, it means that the loss is sensitive to this change.
In case of MSE loss for example, since the loss is always convex, it means any
change in w moves the loss away from the minimum. So if we move against it, we
reach the minimum. If this change value is small, it means we need to move a
small step (we are already at the min: $frac{delta{loss}}{delta{w}=0}$), and vice
versa.

Gradientdescent
When w_new = w_old, then the grad = 0
So we reach the min. (same as we
wanted in normal equation and closed
form solution)
grad=delta_W → Numerical estimate of the
true
gradient (dJ/dW) → How to calc that
estimate?
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20%20Pattern%20Recognition%20And%20
Machine%20Learning%20-%20Springer%20%202006.pdf

Trainingloop
For number of epochs
➢ For each example in the
dataset:
● Feedfwd: Compute err
● Feedback: Compute
Gradient
● Update params

Let’scodeBatch GradientDescent
When using Gradient Descent, you
should ensure that all features have a
similar scale (e.g., using Scikit-Learn’s
StandardScaler class), or else it will
take much longer to converge

Learningrate
As you can see, intuitively it’s important to pick a reasonable value for the step factor.
If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local
minimum.
If step is too large, your updates may end up taking you to completely random locations on the curve.
https://cdn-images-1.medium.com/max/1000/1*YQWdnHVTRPGjr0-VGuSxyA.jpeg3

FindingLR
- Learning rate
schedules

FindingLR
- Learning rate schedule and simulated
annealing

CyclicLRFinder
High LR = Oscillation at high
level of loss

ReduceLROnPlateau
Monitor val_loss

Additionally, there exist multiple variants of SGD that differ by taking into
account previous weight updates when computing the next weight update,
rather than just looking at the current value of the gradients.
There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and
several others. Momentum draws inspiration from physics.
A useful mental image here is to think of the optimization process as a small ball
rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a
ravine and will end up at the global minimum. Momentum is implemented by
moving the ball at each step based not only on the current slope value (current
acceleration) but also on the current velocity (resulting from past acceleration).
In practice, this means updating the parameter w based not only on the current
gradient value but also on the previous parameter update, such as in this naive
implementation:
https://i.pinimg.com/236x/31/90/2e/31902e5c838575c5aae653
5181740cb9.jpg?nii=t
Momentum
https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/
Normal GD
GD w/ momentum

EarlyStoppingkeras
https://keras.io/api/callbacks/early_
stopping/

Repeatedrepresentation ofdata (Sampleby sample=Stochastic GD)
➢ Training Loops
➢ For 1..Num epochs
▶ For 1..Num examples (M)
w(k+1) = w(k) + delta_W
(M)

Repeatedrepresentation of data (ALLdata)Batch GD
➢ Training Loops
▶ For 1..Num examples (M)
delta_W
(M)+=GRAD(EXAMPLE)
▶ GD Updates: w(k+1) = w(k) + delta_W (M)

Differenttraining loopoptions - Introducing batches
GD (Batch GD)
The above GD algorithm updates the gradient with every new sample of x.
If X is of 1000 samples, then the delta_W[t+1] = step * gradient(f)(W[t]) is
calculated and accumulated every sample, but the application of weight
update is done once after all 1000 samples are fed.
➢ Update once at the end
➢ High accumulation of error, Risk of saturation
➢ Slow feedback, slow convergence, but more stable
➢ Take advantage of parallelism in matrix operations (comp.
arch.)
https://colab.research.google.com/drive/11_-SxhdtxvRYdPukURz-cojd7-JdpiKR?authuser=1
Getting the gradient for all data
could be slow, and include many
errors before one update
loss=f(x,W) → 2 unknowns → we search for W, but
what about x?
GD→ dloss/dW, at which x?
So we estimate the gradient for different x’s:
Batch=ALL
Stochastic=One
Mini batch

➢ Training Loops with batches
▶ For 1..Num batches (N_batches)
• For 1..Num of examples per batch (B)
– GD Updates: w(k+1) = w(k) +
delta_W (B)
Introducing Batches-Why?

Effect of batchsize
➢ Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset (M)
➢ Stochastic Gradient Descent. Batch size is set to one.
➢ Minibatch Gradient Descent. Batch size (=B) is set to more than one and less than the total number of examples in the
training dataset.
For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size.
Objective: B=M
However, given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the
size of the training dataset.
Batch size only affecting convergence speed, but not
performance

SGDwithsklearn
Built-in in sklearn =
SGDRegressor epochs =
max_iter

SGD + High LR
Different training loopoptions
SGD
There are other options of
updating:
1. SGD: Each sample
(batch_size=1)
● Fast feedback, fast convergence (corrects itself
a lot)
● Could oscillate, unstable (tend to corrupt what it learnt) → To overcome reduce LR
● Not taking advantage of parallelism in matrix operations (comp. arch.)
● Not saturating
● Stochastic: every update is a representing sample of the true gradient
SGD + Low LR
One sample is not enoughto
estimate the gradient→ So
dont take large steps

SGDvsBDG
➢ Batch Gradient Descent: Use a relatively larger learning rate and more training
epochs.
➢ Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Mini- batch gradient descent provides an alternative approach.

➢ Minibatch SGD: Every group of samples
(batch_size=N)
➢ Accumulate M gradients, update
every M
● More stable than 1 sample
● Faster than GD
Different training loop options

Effect of batch sizeinMBGD
The plots show that small batch results generally in rapid
learning but a volatile learning process with higher variance
in the classification accuracy.
Larger batch sizes slow down the learning process but the
final stages result in a convergence to a more stable model
exemplified by lower variance in classification accuracy.

Trainingloop options/Effect ofBatchsize
Batch_sz
SGD (Available in
sklearn
and Keras)
1 Fast feedback, fast
convergence (corrects
itself a lot)
No risk of gradient
saturation
Needs to reduce the LR
since we make many
corrections (could
oscillate)
Inaccurate gradient estimate
BGD (Manual in
skelarn, available in
Keras with
batch_size=M)
M (total samples) Take advantage of
parallelism in matrix
operations (comp. arch.)
→ Better gradient
estimate → More stable
Update once at the end →
slow convergence
High accumulation of error
→ Risk of saturation
MBGD (Manual in
sklearn, available
in Keras)
B
(N_batches = M/B)
Good estimates than
SGD Faster than BGD
Worse estimates than
BGD (more oscillations)
Slower than SGD

Decision layer/Output (Ex: Logistic
regression)
Knows ML or not?
Logistic Regression for Binary Classification
Coursat.ai
https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif

Logistic Regression in
sklearn
Remember this “band”/street
when we deal with SVM

Multi-class and softmax in
sklearn
We can still see the decision boundaries for the 3 classes this time

supervised.pptx

Recommended

Recommended

More Related Content

Similar to supervised.pptx

Similar to supervised.pptx (20)

Recently uploaded

Recently uploaded (20)

supervised.pptx