7. Facerecognition with traditional AI
Rules
(If-else)
I
D
Did you handle
scale?
Did you handle
colors?
Did you handle
pose?
Did you handle
Clutter?
Too many cases!! Letâs Learn the rule!
10. ML: How to learnwâs?
Assume only 2 qâs:
Basically â Search: ax1+bx2+c (Model)â
w1=a,
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better â How good?
Loss
Practicallyâ Smarter methods (Optimizer)
are used better than brute force or random
search!
11. How to decide onquestions?
We call the questions vector $q^i=[q1, q2,...qN]$ features vector
In case of structured data like above, normally we choose them based on
experience.
In case of unstructured data like images, the input vector is just the pixel values. So the
question,
what are the features?
The answer to this question defines if we do ML or DL
In general, if we feed raw pixels, we do DL, if we define specific features, we do ML.
This will get more clear
12. Semantic gap- What the computer cansee?Traditional CVpipeline with OpenCV
Model
y
X
18. Howto closethe Gap?Thedeeperyougo, the moreyouclosethe gapfrom raw(X)to
semantics(y)
Model
y
How to obtain the
model?
X
The model is usually hierarchical = layered
The least #layers is input transformations
More ârefinementsâ means more depth
The more you go deeper, the more the gap closes Xâ y
23. ParametricvsNon-parametric
⢠Parametric: assume a family of functions (=distribution) from which the model is searched. read as
mapping function (model) of input X to output y, parametrized by theta
⢠Non-parametric: We donât assume any function family. Usually heuristics. Example: K-NN and
Naive Bayes.
25. What is AIis best at today?
Supervised
Learning
Classification I
D
Regression Pric
e
ML B
A
26. Regression
Car Price 150K 200K 250K ? 350K
Hors
e
Powe
r
72 90 110 130 180
Job Salary 10K 20K ? 35K 50K
Domain X Y X Z Z
Location CAI NY PAR NY CAI
Grade B A A B A
Pure DB query=fail
Need to interpolate
An output could depend on many factors = features
31. How it works?
Ingredients
⢠Data:
â X, Y (supervised)
⢠Model: function mapping:
â Yâ = fw(X), w=1,2,âŚetc,
⢠Loss:
â How good Model fits Data?-> Y vs. Yâ
⢠Optimizer:
â search for best fw, that makes the Model fits the Data with min.
Loss.
32. ML: How to learnwâs?
Assume only 2 qâs:
Basically â Search: ax1+bx2+c (Model)â
w1=a,
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better â How good?
Loss
Practicallyâ Smarter methods (Optimizer)
are used better than brute force or random
search!
36. Ingredients of supervisedlearning
â˘Data:
â X, Y (supervised)
â˘Model: function mapping:
â Yâ = fw(X), w=1,2,âŚetc,
â˘Loss:
â How good Model fits Data?-> Y vs. Yâ
â˘Optimizer:
â Search for best fw, that makes the Model fits the Data with min.
Loss.
37. Parameters vs.Hyper parameters
⢠Parameter = values adjusted by the optimizer algorithm
⢠Only on training data! Never on validation or test.
⢠Ex: weights
⢠Hyper parameter = values adjusted by the algorithm
designer (human)
⢠Only on validation/training data! Never on test.
⢠Ex: network architecture (number of layers, numer of neuron per
layer,âŚetc), and opmitizer hyper parameters (learning rate,
momentum,âŚetc) (more on that later)
51. Decision layer/Output (Ex: linear
regression)
ReLU (if negative is not
permitted) or Linear
RegressionDesignPattern
https://static.packt-cdn.com/products/9781786469786/graphics/B05478_03_11.png
54. Decision layer/Output (Ex: Logistic
regression)
Knows ML or not?
Binary Classification
https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif
55. ⢠Model:
â Layers â Wfeatures
â Output â Multiple linear neurons (likelinear
regression)
â Each output neuron gives a score
(unnormalized probability or logit)
â Softmax â Normalize
â Prediction = argmax
â Ensures only 1 class is possible
Multi-classclassification
https://deepnotes.io/public/images/softmax.png
58. ⢠Loss: BCE over each neuron
⢠Ask every Neuron, is it Human? Is it
person?...etc.
⢠More than 1 is possible
https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
Multi-label Classification
60. Letâscode
⢠Classifying movie reviews: a binary classification
example
⢠Digit Classification: Multi-class classification
example
⢠Classifying newswires: a multi-class classification
example
⢠Predicting house prices: a regression example
62. Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, âsoftmaxâ).
Not only that, even the predict API shall return num_classes probs andwe
argmax them
63. Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, âsoftmaxâ).
Not only that, even the predict API shall return num_classes probsand
we argmax them
64. Introductio
n to
sklearn
â Sklearn conventions
â Model persistence
(save/load)
â Dealing with different data
variables (categorical,
nominal,
ordinal,âŚetc)
â Dealing with different
problems types (Multi-
label, Multi-class,
65. Introduction to sklearn
https://colab.research.google.com/drive/1YMOpQAh3gN_A013IHMPvhFpoKcbMUK2B?usp
=sharing
We have seen how to deal with different problems setups in Keras, which is targeting one model type (NN)
Now, letâs explore a more generic framework, that targets different ML models
This is the whole you need from sklearn.
We wonât go through all, but only some parts to
stress + we see the rest as we go
sklearn is generic for ML
keras/tf/pytorch are more for DNN
66. Introduction tosklearn
The main design principles are:
- Consistency. All objects share a consistent and simple interface
- Inspection. All the estimatorâs hyperparameters are accessible directly via public instance variables (e.g., imputer.strategy),
and all the estimatorâs learned parameters are also accessible via public instance variables with an underscore suffix (e.g.,
imputer.statistics_).
- Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade
classes. Hyperparameters are just regular Python strings or numbers.
- Composition. Existing building blocks are reused as much as possible. For example, it is easy to create a Pipeline
estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.
- Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline
working system quickly.
67. SklearnInterfaces
- Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is
an estimator). The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for
supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the
estimation process is considered a hyperparameter (such as an imputerâs strategy), and it must be set as an instance
variable (generally via a constructor parameter).
- Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once
again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a
parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the
case for an imputer All transformers also have a convenience method called fit_transform() that is equivalent to calling fit()
and then transform() (but sometimes fit_transform() is optimized and runs much faster).
- Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For
example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a countryâs
GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of
corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set (and
the corresponding labels in the case of supervised learning algorithms).
70. Gradient
Based
Optimizatio
n
â Optimization as a search
problem
â Gradient based optimization for
linear classifiers/regressors
â Linear
Regression
â SGD
â Logistic
Regression
â Overfitting vs
Underfitting
â Regularization
â Deeper models
71. Howoptimizer works forParametric models?
⢠Option 1: Closed form solution Let's solve the equations!
⢠Option 2: Search !
â Brute force: try all possible values of w's and
choose the
set with min loss â Numerical solution =
optimization
â Random search: same as brute force, but
choose random subset of all possible w's
â Guided search: start from random w's, measure
the loss, then choose the next set of w's, guided by
the change in the loss.
On which data to compute the loss?
How to use the loss change as a
guidance?
https://www.manning.com/books/deep-learning-with-python
72. Onwhichdatato computethe loss?
⢠Search in training data â Loss evaluation + Parameters update
⢠Evaluate on unseen testing data â Loss evaluation but No parameters update allowed
83. ⢠Disadvantages:
â Slow (X is a big matrix, inverse is costly and
not
always existing)
â Uses all the data points --> Might overfit
(more on that later)
â It doesnât work for classification. Only
regression, since some equations will vanish
(correct classifications), leading to more
degrees of freedom and less constraints, and
we'll have infinite solutions (infinite lines can
separate the classes with 0 MSE)
Normalequation
https://raw.githubusercontent.com/qingkaikong/blog/master/43_
support_vector_machine_part1/figures/figure_3.jpg
84. ⢠Sources:
â Sample noise: small data
â Sample bias: gender, ethnicit,
popularity,..etc
â Poor quality (garbage in-garbage out):
mislabels
â Missing data
⢠This results in poor generalization=overfitting
⢠The affect is magnitifed if the model is
allowed to use all the training data as in the
case of normal equation
⢠As can be seen, with missing countries
added, we get completely different curve!
ANOTHERISSUE:Non-representation ofdata
86. Gradient basedoptimization
A gradient is a of change of output, with change in input.
Meaning how much y changes when we change x by small amount.
In other words, it's the rate of change of y with x.
It also encodes the sensitivity of y to x
In case of loss, the gradient $frac{delta{loss}}{delta{w}}$ is how much the loss
changes when we change a weight with small amount. It guides our search of the
the weights that minimizes the loss; if we change w a bit, and the loss is affected
a lot, it means that the loss is sensitive to this change.
In case of MSE loss for example, since the loss is always convex, it means any
change in w moves the loss away from the minimum. So if we move against it, we
reach the minimum. If this change value is small, it means we need to move a
small step (we are already at the min: $frac{delta{loss}}{delta{w}=0}$), and vice
versa.
https://www.manning.com/books/deep-learning-with-python
87. Gradientdescent
When w_new = w_old, then the grad = 0
So we reach the min. (same as we
wanted in normal equation and closed
form solution)
grad=delta_W â Numerical estimate of the
true
gradient (dJ/dW) â How to calc that
estimate?
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20%20Pattern%20Recognition%20And%20
Machine%20Learning%20-%20Springer%20%202006.pdf
89. Trainingloop
For number of epochs
⢠For each example in the
dataset:
â Feedfwd: Compute err
â Feedback: Compute
Gradient
â Update params
90. LetâscodeBatch GradientDescent
When using Gradient Descent, you
should ensure that all features have a
similar scale (e.g., using Scikit-Learnâs
StandardScaler class), or else it will
take much longer to converge
92. Learningrate
As you can see, intuitively itâs important to pick a reasonable value for the step factor.
If itâs too small, the descent down the curve will take many iterations, and it could get stuck in a local
minimum.
If step is too large, your updates may end up taking you to completely random locations on the curve.
https://cdn-images-1.medium.com/max/1000/1*YQWdnHVTRPGjr0-VGuSxyA.jpeg3
98. Additionally, there exist multiple variants of SGD that differ by taking into
account previous weight updates when computing the next weight update,
rather than just looking at the current value of the gradients.
There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and
several others. Momentum draws inspiration from physics.
A useful mental image here is to think of the optimization process as a small ball
rolling down the loss curve. If it has enough momentum, the ball wonât get stuck in a
ravine and will end up at the global minimum. Momentum is implemented by
moving the ball at each step based not only on the current slope value (current
acceleration) but also on the current velocity (resulting from past acceleration).
In practice, this means updating the parameter w based not only on the current
gradient value but also on the previous parameter update, such as in this naive
implementation:
https://i.pinimg.com/236x/31/90/2e/31902e5c838575c5aae653
5181740cb9.jpg?nii=t
Momentum
https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/
Normal GD
GD w/ momentum
103. Repeatedrepresentation of data (ALLdata)Batch GD
⢠Training Loops
⢠For 1..Num epochs
âś For 1..Num examples (M)
delta_W
(M)+=GRAD(EXAMPLE)
âś GD Updates: w(k+1) = w(k) + delta_W (M)
104. Differenttraining loopoptions - Introducing batches
GD (Batch GD)
The above GD algorithm updates the gradient with every new sample of x.
If X is of 1000 samples, then the delta_W[t+1] = step * gradient(f)(W[t]) is
calculated and accumulated every sample, but the application of weight
update is done once after all 1000 samples are fed.
⢠Update once at the end
⢠High accumulation of error, Risk of saturation
⢠Slow feedback, slow convergence, but more stable
⢠Take advantage of parallelism in matrix operations (comp.
arch.)
https://colab.research.google.com/drive/11_-SxhdtxvRYdPukURz-cojd7-JdpiKR?authuser=1
Getting the gradient for all data
could be slow, and include many
errors before one update
loss=f(x,W) â 2 unknowns â we search for W, but
what about x?
GDâ dloss/dW, at which x?
So we estimate the gradient for different xâs:
Batch=ALL
Stochastic=One
Mini batch
105. ⢠Training Loops with batches
⢠For 1..Num epochs
âś For 1..Num batches (N_batches)
⢠For 1..Num of examples per batch (B)
â GD Updates: w(k+1) = w(k) +
delta_W (B)
Introducing Batches-Why?
106. Effect of batchsize
⢠Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset (M)
⢠Stochastic Gradient Descent. Batch size is set to one.
⢠Minibatch Gradient Descent. Batch size (=B) is set to more than one and less than the total number of examples in the
training dataset.
For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size.
Objective: B=M
However, given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the
size of the training dataset.
Batch size only affecting convergence speed, but not
performance
108. SGD + High LR
Different training loopoptions
SGD
There are other options of
updating:
1. SGD: Each sample
(batch_size=1)
â Fast feedback, fast convergence (corrects itself
a lot)
â Could oscillate, unstable (tend to corrupt what it learnt) â To overcome reduce LR
â Not taking advantage of parallelism in matrix operations (comp. arch.)
â Not saturating
â Stochastic: every update is a representing sample of the true gradient
SGD + Low LR
One sample is not enoughto
estimate the gradientâ So
dont take large steps
109. SGDvsBDG
⢠Batch Gradient Descent: Use a relatively larger learning rate and more training
epochs.
⢠Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Mini- batch gradient descent provides an alternative approach.
110. ⢠Minibatch SGD: Every group of samples
(batch_size=N)
⢠Accumulate M gradients, update
every M
â More stable than 1 sample
â Faster than GD
Different training loop options
111. Effect of batch sizeinMBGD
The plots show that small batch results generally in rapid
learning but a volatile learning process with higher variance
in the classification accuracy.
Larger batch sizes slow down the learning process but the
final stages result in a convergence to a more stable model
exemplified by lower variance in classification accuracy.
113. Trainingloop options/Effect ofBatchsize
Batch_sz
SGD (Available in
sklearn
and Keras)
1 Fast feedback, fast
convergence (corrects
itself a lot)
No risk of gradient
saturation
Needs to reduce the LR
since we make many
corrections (could
oscillate)
Inaccurate gradient estimate
BGD (Manual in
skelarn, available in
Keras with
batch_size=M)
M (total samples) Take advantage of
parallelism in matrix
operations (comp. arch.)
â Better gradient
estimate â More stable
Update once at the end â
slow convergence
High accumulation of error
â Risk of saturation
MBGD (Manual in
sklearn, available
in Keras)
B
(N_batches = M/B)
Good estimates than
SGD Faster than BGD
Worse estimates than
BGD (more oscillations)
Slower than SGD