Semantic gap- What the computercansee?
Semantic gap- What the computer canREAD?
Semantic gapin text - What computers canread?
In images, we have matrix
In text→ ??
→ Naive = BoW against vocab
Semantic gap- What is the best representation of sensory signals forcomputers?
Old problem →
Facerecognition with traditional AI
Rule based
Facerecognition with traditional AI
Did you handle
Did you handle
Did you handle
Did you handle
Too many cases!! Let’s Learn the rule!
RulebasedAIvs. ML
“Machine learning is the science of getting
computers to act without being explicitly
programmed”, Arthur Samuel 1959
DeepLearning vs.Machine Learning
ML: How to learnw’s?
Assume only 2 q’s:
Basically → Search: ax1+bx2+c (Model)→
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better → How good?
Practically→ Smarter methods (Optimizer)
are used better than brute force or random
How to decide onquestions?
We call the questions vector $q^i=[q1, q2,...qN]$ features vector
In case of structured data like above, normally we choose them based on
In case of unstructured data like images, the input vector is just the pixel values. So the
what are the features?
The answer to this question defines if we do ML or DL
In general, if we feed raw pixels, we do DL, if we define specific features, we do ML.
This will get more clear
Semantic gap- What the computer cansee?Traditional CVpipeline with OpenCV
DeepLearning vs.Machine Learning
What neurons represent?
What could be good facefeatures?
What neurons cansee?
Ex:Semanticgapin Tabulardata
Or you can have derived features
The least is simple input transform
Howto closethe Gap?Thedeeperyougo, the moreyouclosethe gapfrom raw(X)to
How to obtain the
The model is usually hierarchical = layered
The least #layers is input transformations
More “refinements” means more depth
The more you go deeper, the more the gap closes X→ y
The“Deep” in DeepLearning
Hierarichal Representation
What AIcando?
Artificial Narrow Intelligence: Structure or unstructured
Structured data Unstructured data
➢ Parametric: assume a family of functions (=distribution) from which the model is searched. read as
mapping function (model) of input X to output y, parametrized by theta
➢ Non-parametric: We don’t assume any function family. Usually heuristics. Example: K-NN and
Naive Bayes.
Statistical Learning
Non-parametric Parametric
What is AIis best at today?
Classification I
Regression Pric
Car Price 150K 200K 250K ? 350K
72 90 110 130 180
Job Salary 10K 20K ? 35K 50K
Domain X Y X Z Z
Grade B A A B A
Pure DB query=fail
Need to interpolate
An output could depend on many factors = features
Functional approximation
Car Category Eco Low Mid High Premium
Horse Power 72 90 110 130 180
Car Category
Car Category
Discrimination =Classification
Optimization problem =Min.Error
How it works?
• Data:
– X, Y (supervised)
• Model: function mapping:
– Y’ = fw(X), w=1,2,…etc,
• Loss:
– How good Model fits Data?-> Y vs. Y’
• Optimizer:
– search for best fw, that makes the Model fits the Data with min.
ML: How to learnw’s?
Assume only 2 q’s:
Basically → Search: ax1+bx2+c (Model)→
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better → How good?
Practically→ Smarter methods (Optimizer)
are used better than brute force or random
How DLworks?
1. Model
How DLworks?
2. Data
How DLworks?
3. Loss
4. Optimizer
Optimizer does the
Using backpropagation
Ingredients of supervisedlearning
● X, Y (supervised)
➢Model: function mapping:
● Y’ = fw(X), w=1,2,…etc,
● How good Model fits Data?-> Y vs. Y’
● Search for best fw, that makes the Model fits the Data with min.
Parameters vs.Hyper parameters
• Parameter = values adjusted by the optimizer algorithm
• Only on training data! Never on validation or test.
• Ex: weights
• Hyper parameter = values adjusted by the algorithm
designer (human)
• Only on validation/training data! Never on test.
• Ex: network architecture (number of layers, numer of neuron per
layer,…etc), and opmitizer hyper parameters (learning rate,
momentum,…etc) (more on that later)
Model y
X 7
Model y
X 7
Problem types
Classification` Regression
X y
Problem type depends on the
target variable = y
Global ML/DLframework
Manual = Hand crafted=
Engineered = ML
Auto = NN = DL
Deepor shallow?
Deepor shallow? Deep learning summer school, Montreal, 2015: Convolutional NN, Lee
UniversalSLpattern: Encoder-DecoderDesignPattern
x y
img y
img2vec vec2task
seq seq2vec v vec2task y
Activations andLosses
Model vs.Problemsummary
Regardless of layers types, we can
design two things:
- Last layer activation
- Loss function
Based on problem type
Model (Ex: linear
How much knows ML?*L9xLcwKhuZ2cuS8fF0ZjwA.png
Decision layer/Output (Ex: linear
ReLU (if negative is not
permitted) or Linear
Loss (Ex: linear
Model (Ex: Logistic
Knows ML or not?
Decision layer/Output (Ex: Logistic
Knows ML or not?
Binary Classification
➢ Model:
● Layers → Wfeatures
● Output → Multiple linear neurons (likelinear
● Each output neuron gives a score
(unnormalized probability or logit)
● Softmax → Normalize
● Prediction = argmax
● Ensures only 1 class is possible
Multi-label Classification
➢ Model:
● Layers
● Multiple
➢ Loss: BCE over each neuron
➢ Ask every Neuron, is it Human? Is it
➢ More than 1 is possible*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
Multi-label Classification
Model vs.Problemsummary
➢ Classifying movie reviews: a binary classification
➢ Digit Classification: Multi-class classification
➢ Classifying newswires: a multi-class classification
➢ Predicting house prices: a regression example
Predicting House
Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probs andwe
argmax them
Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probsand
we argmax them
n to
● Sklearn conventions
● Model persistence
● Dealing with different data
variables (categorical,
● Dealing with different
problems types (Multi-
label, Multi-class,
Introduction to sklearn
We have seen how to deal with different problems setups in Keras, which is targeting one model type (NN)
Now, let’s explore a more generic framework, that targets different ML models
This is the whole you need from sklearn.
We won’t go through all, but only some parts to
stress + we see the rest as we go
sklearn is generic for ML
keras/tf/pytorch are more for DNN
Introduction tosklearn
The main design principles are:
- Consistency. All objects share a consistent and simple interface
- Inspection. All the estimator’s hyperparameters are accessible directly via public instance variables (e.g., imputer.strategy),
and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (e.g.,
- Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade
classes. Hyperparameters are just regular Python strings or numbers.
- Composition. Existing building blocks are reused as much as possible. For example, it is easy to create a Pipeline
estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.
- Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline
working system quickly.
- Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is
an estimator). The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for
supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the
estimation process is considered a hyperparameter (such as an imputer’s strategy), and it must be set as an instance
variable (generally via a constructor parameter).
- Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once
again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a
parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the
case for an imputer All transformers also have a convenience method called fit_transform() that is equivalent to calling fit()
and then transform() (but sometimes fit_transform() is optimized and runs much faster).
- Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For
example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a country’s
GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of
corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set (and
the corresponding labels in the case of supervised learning algorithms).
Parametric vs Non-parametric models:
● Optimization as a search
● Gradient based optimization for
linear classifiers/regressors
● Linear
● Logistic
● Overfitting vs
● Regularization
● Deeper models
Howoptimizer works forParametric models?
➢ Option 1: Closed form solution Let's solve the equations!
➢ Option 2: Search !
● Brute force: try all possible values of w's and
choose the
set with min loss → Numerical solution =
● Random search: same as brute force, but
choose random subset of all possible w's
● Guided search: start from random w's, measure
the loss, then choose the next set of w's, guided by
the change in the loss.
On which data to compute the loss?
How to use the loss change as a
Onwhichdatato computethe loss?
➢ Search in training data → Loss evaluation + Parameters update
➢ Evaluate on unseen testing data → Loss evaluation but No parameters update allowed
Isthe separating boundary (classification) /fittingmodel (regression),alwaysaline?
Further classification: Linearvs.Non-linearmodels
Statistical Learning
Non-parametric Parametric
Non-linear Linear
We start with Linear models
Linear Regression
Example:Life Satisfaction vs.GDP
Play with Pandas
for tabular data!
Example:Life Satisfaction vs.GDP
We can manually fit many
Normal equation
N-equations, in N
sklearn.LinearRegressionis just the normal equation, with Pseudoinverse
➢ Disadvantages:
● Slow (X is a big matrix, inverse is costly and
always existing)
● Uses all the data points --> Might overfit
(more on that later)
● It doesn’t work for classification. Only
regression, since some equations will vanish
(correct classifications), leading to more
degrees of freedom and less constraints, and
we'll have infinite solutions (infinite lines can
separate the classes with 0 MSE)
➢ Sources:
– Sample noise: small data
– Sample bias: gender, ethnicit,
– Poor quality (garbage in-garbage out):
– Missing data
➢ This results in poor generalization=overfitting
➢ The affect is magnitifed if the model is
allowed to use all the training data as in the
case of normal equation
➢ As can be seen, with missing countries
added, we get completely different curve!
ANOTHERISSUE:Non-representation ofdata
Gradient Basedoptimization
Gradient basedoptimization
A gradient is a of change of output, with change in input.
Meaning how much y changes when we change x by small amount.
In other words, it's the rate of change of y with x.
It also encodes the sensitivity of y to x
In case of loss, the gradient $frac{delta{loss}}{delta{w}}$ is how much the loss
changes when we change a weight with small amount. It guides our search of the
the weights that minimizes the loss; if we change w a bit, and the loss is affected
a lot, it means that the loss is sensitive to this change.
In case of MSE loss for example, since the loss is always convex, it means any
change in w moves the loss away from the minimum. So if we move against it, we
reach the minimum. If this change value is small, it means we need to move a
small step (we are already at the min: $frac{delta{loss}}{delta{w}=0}$), and vice
When w_new = w_old, then the grad = 0
So we reach the min. (same as we
wanted in normal equation and closed
form solution)
grad=delta_W → Numerical estimate of the
gradient (dJ/dW) → How to calc that
Batch Gradientdescent
For number of epochs
➢ For each example in the
● Feedfwd: Compute err
● Feedback: Compute
● Update params
Let’scodeBatch GradientDescent
When using Gradient Descent, you
should ensure that all features have a
similar scale (e.g., using Scikit-Learn’s
StandardScaler class), or else it will
take much longer to converge
As you can see, intuitively it’s important to pick a reasonable value for the step factor.
If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local
If step is too large, your updates may end up taking you to completely random locations on the curve.*YQWdnHVTRPGjr0-VGuSxyA.jpeg3
Playingwith Learningrate
- Learning rate
- Learning rate schedule and simulated
High LR = Oscillation at high
level of loss
Monitor val_loss
Additionally, there exist multiple variants of SGD that differ by taking into
account previous weight updates when computing the next weight update,
rather than just looking at the current value of the gradients.
There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and
several others. Momentum draws inspiration from physics.
A useful mental image here is to think of the optimization process as a small ball
rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a
ravine and will end up at the global minimum. Momentum is implemented by
moving the ball at each step based not only on the current slope value (current
acceleration) but also on the current velocity (resulting from past acceleration).
In practice, this means updating the parameter w based not only on the current
gradient value but also on the previous parameter update, such as in this naive
Normal GD
GD w/ momentum
Repeatedrepresentation ofdata (Sampleby sample=Stochastic GD)
➢ Training Loops
➢ For 1..Num epochs
▶ For 1..Num examples (M)
w(k+1) = w(k) + delta_W
Repeatedrepresentation of data (ALLdata)Batch GD
➢ Training Loops
➢ For 1..Num epochs
▶ For 1..Num examples (M)
▶ GD Updates: w(k+1) = w(k) + delta_W (M)
Differenttraining loopoptions - Introducing batches
GD (Batch GD)
The above GD algorithm updates the gradient with every new sample of x.
If X is of 1000 samples, then the delta_W[t+1] = step * gradient(f)(W[t]) is
calculated and accumulated every sample, but the application of weight
update is done once after all 1000 samples are fed.
➢ Update once at the end
➢ High accumulation of error, Risk of saturation
➢ Slow feedback, slow convergence, but more stable
➢ Take advantage of parallelism in matrix operations (comp.
Getting the gradient for all data
could be slow, and include many
errors before one update
loss=f(x,W) → 2 unknowns → we search for W, but
what about x?
GD→ dloss/dW, at which x?
So we estimate the gradient for different x’s:
Mini batch
➢ Training Loops with batches
➢ For 1..Num epochs
▶ For 1..Num batches (N_batches)
• For 1..Num of examples per batch (B)
– GD Updates: w(k+1) = w(k) +
delta_W (B)
Introducing Batches-Why?
Effect of batchsize
➢ Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset (M)
➢ Stochastic Gradient Descent. Batch size is set to one.
➢ Minibatch Gradient Descent. Batch size (=B) is set to more than one and less than the total number of examples in the
training dataset.
For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size.
Objective: B=M
However, given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the
size of the training dataset.
Batch size only affecting convergence speed, but not
Built-in in sklearn =
SGDRegressor epochs =
SGD + High LR
Different training loopoptions
There are other options of
1. SGD: Each sample
● Fast feedback, fast convergence (corrects itself
a lot)
● Could oscillate, unstable (tend to corrupt what it learnt) → To overcome reduce LR
● Not taking advantage of parallelism in matrix operations (comp. arch.)
● Not saturating
● Stochastic: every update is a representing sample of the true gradient
SGD + Low LR
One sample is not enoughto
estimate the gradient→ So
dont take large steps
➢ Batch Gradient Descent: Use a relatively larger learning rate and more training
➢ Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Mini- batch gradient descent provides an alternative approach.
➢ Minibatch SGD: Every group of samples
➢ Accumulate M gradients, update
every M
● More stable than 1 sample
● Faster than GD
Different training loop options
Effect of batch sizeinMBGD
The plots show that small batch results generally in rapid
learning but a volatile learning process with higher variance
in the classification accuracy.
Larger batch sizes slow down the learning process but the
final stages result in a convergence to a more stable model
exemplified by lower variance in classification accuracy.
Trainingloop options/Effect ofBatchsize
SGD (Available in
and Keras)
1 Fast feedback, fast
convergence (corrects
itself a lot)
No risk of gradient
Needs to reduce the LR
since we make many
corrections (could
Inaccurate gradient estimate
BGD (Manual in
skelarn, available in
Keras with
M (total samples) Take advantage of
parallelism in matrix
operations (comp. arch.)
→ Better gradient
estimate → More stable
Update once at the end →
slow convergence
High accumulation of error
→ Risk of saturation
MBGD (Manual in
sklearn, available
in Keras)
(N_batches = M/B)
Good estimates than
SGD Faster than BGD
Worse estimates than
BGD (more oscillations)
Slower than SGD
Logistic Regression
Decision layer/Output (Ex: Logistic
Knows ML or not?
Logistic Regression for Binary Classification
Logistic Regression in
Remember this “band”/street
when we deal with SVM
Multi-class and softmax in
We can still see the decision boundaries for the 3 classes this time

