SlideShare a Scribd company logo
1 of 117
Review
Semantic gap- What the computercansee?
????
Semantic gap- What the computer canREAD?
Semantic gapin text - What computers canread?
In images, we have matrix
representation
In text→ ??
→ Naive = BoW against vocab
Semantic gap- What is the best representation of sensory signals forcomputers?
Old problem →
ADC
Coursat.ai
Facerecognition with traditional AI
Rules
(If-
else)
I
D
Rule based
AI
Facerecognition with traditional AI
Rules
(If-else)
I
D
Did you handle
scale?
Did you handle
colors?
Did you handle
pose?
Did you handle
Clutter?
Too many cases!! Let’s Learn the rule!
RulebasedAIvs. ML
“Machine learning is the science of getting
computers to act without being explicitly
programmed”, Arthur Samuel 1959
DeepLearning vs.Machine Learning
Pric
e
Floo
r
Are
a
Yea
r
built
ML: How to learnw’s?
Assume only 2 q’s:
Basically → Search: ax1+bx2+c (Model)→
w1=a,
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better → How good?
Loss
Practically→ Smarter methods (Optimizer)
are used better than brute force or random
search!
How to decide onquestions?
We call the questions vector $q^i=[q1, q2,...qN]$ features vector
In case of structured data like above, normally we choose them based on
experience.
In case of unstructured data like images, the input vector is just the pixel values. So the
question,
what are the features?
The answer to this question defines if we do ML or DL
In general, if we feed raw pixels, we do DL, if we define specific features, we do ML.
This will get more clear
Semantic gap- What the computer cansee?Traditional CVpipeline with OpenCV
Model
y
X
DeepLearning vs.Machine Learning
I
D
What neurons represent?
What could be good facefeatures?
What neurons cansee?
Ex:Semanticgapin Tabulardata
Or you can have derived features
The least is simple input transform
Howto closethe Gap?Thedeeperyougo, the moreyouclosethe gapfrom raw(X)to
semantics(y)
Model
y
How to obtain the
model?
X
The model is usually hierarchical = layered
The least #layers is input transformations
More “refinements” means more depth
The more you go deeper, the more the gap closes X→ y
The“Deep” in DeepLearning
Hierarichal Representation
Supervised
Learning
What AIcando?
Artificial Narrow Intelligence: Structure or unstructured
Structured data Unstructured data
ParametricModels
ParametricvsNon-parametric
➢ Parametric: assume a family of functions (=distribution) from which the model is searched. read as
mapping function (model) of input X to output y, parametrized by theta
➢ Non-parametric: We don’t assume any function family. Usually heuristics. Example: K-NN and
Naive Bayes.
ModelsAnatomy
Models
Statistical Learning
Rule-based
Learning
Method
Unsupervised
Reinforcement
Learning
Supervised
Modeling
Choice
Non-parametric Parametric
What is AIis best at today?
Supervised
Learning
Classification I
D
Regression Pric
e
ML B
A
Regression
Car Price 150K 200K 250K ? 350K
Hors
e
Powe
r
72 90 110 130 180
Job Salary 10K 20K ? 35K 50K
Domain X Y X Z Z
Location CAI NY PAR NY CAI
Grade B A A B A
Pure DB query=fail
Need to interpolate
An output could depend on many factors = features
Functional approximation
Classification
Car Category Eco Low Mid High Premium
Horse Power 72 90 110 130 180
Eco=0
Low=1
Mid=2
High=3
Premium=4
Car Category
Low=0
High=1
Car Category
Discrimination =Classification
Optimization problem =Min.Error
How it works?
Ingredients
• Data:
– X, Y (supervised)
• Model: function mapping:
– Y’ = fw(X), w=1,2,…etc,
• Loss:
– How good Model fits Data?-> Y vs. Y’
• Optimizer:
– search for best fw, that makes the Model fits the Data with min.
Loss.
ML: How to learnw’s?
Assume only 2 q’s:
Basically → Search: ax1+bx2+c (Model)→
w1=a,
w2=b, w0=c (bias, to be clear later)
Try many lines, and get the one that
separates the Data better → How good?
Loss
Practically→ Smarter methods (Optimizer)
are used better than brute force or random
search!
How DLworks?
1. Model
How DLworks?
2. Data
How DLworks?
3. Loss
4. Optimizer
Optimizer does the
“search”
Using backpropagation
Ingredients of supervisedlearning
➢Data:
● X, Y (supervised)
➢Model: function mapping:
● Y’ = fw(X), w=1,2,…etc,
➢Loss:
● How good Model fits Data?-> Y vs. Y’
➢Optimizer:
● Search for best fw, that makes the Model fits the Data with min.
Loss.
Parameters vs.Hyper parameters
• Parameter = values adjusted by the optimizer algorithm
• Only on training data! Never on validation or test.
• Ex: weights
• Hyper parameter = values adjusted by the algorithm
designer (human)
• Only on validation/training data! Never on test.
• Ex: network architecture (number of layers, numer of neuron per
layer,…etc), and opmitizer hyper parameters (learning rate,
momentum,…etc) (more on that later)
Let’scode
Model y
X 7
BasicKerasProgram
Let’scode
Inference
path:
Model y
X 7
Let’scode
Problem types
BinaryClassification`
Multi-class
Classification` Regression
X y
Problem type depends on the
target variable = y
Supervisedproblemtypes
Global ML/DLframework
Decision
Input
Output
Wcls
Wfeatures
Features
Manual = Hand crafted=
Engineered = ML
Auto = NN = DL
Deepor shallow?
https://www.coursera.org/learn/ai-for-everyone
Deepor shallow?
https://www.manning.com/books/deep-learning-with-python Deep learning summer school, Montreal, 2015: Convolutional NN, Lee
UniversalSLpattern: Encoder-DecoderDesignPattern
Features
Decision
Input
Output
Encoder
(x2v)
Decoder
(v2y)
x y
v
img y
v
img2vec vec2task
seq seq2vec v vec2task y
Activations andLosses
Model vs.Problemsummary
https://www.manning.com/books/deep-learning-with-python
Regardless of layers types, we can
design two things:
- Last layer activation
- Loss function
Based on problem type
RegressionDesignPattern
Model (Ex: linear
regression)
How much knows ML?
https://miro.medium.com/max/1826/1*L9xLcwKhuZ2cuS8fF0ZjwA.png
Decision layer/Output (Ex: linear
regression)
ReLU (if negative is not
permitted) or Linear
RegressionDesignPattern
https://static.packt-cdn.com/products/9781786469786/graphics/B05478_03_11.png
Loss (Ex: linear
regression)
RegressionDesignPattern
BinaryClassification
Model (Ex: Logistic
regression)
Knows ML or not?
Decision layer/Output (Ex: Logistic
regression)
Knows ML or not?
Binary Classification
https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif
➢ Model:
● Layers → Wfeatures
● Output → Multiple linear neurons (likelinear
regression)
● Each output neuron gives a score
(unnormalized probability or logit)
● Softmax → Normalize
● Prediction = argmax
● Ensures only 1 class is possible
Multi-classclassification
https://deepnotes.io/public/images/softmax.png
BinaryCrossEntropy
https://gombru.github.io/assets/cross_entropy_loss/intro.png
Multi-label Classification
➢ Model:
● Layers
● Multiple
sigmoids
https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
➢ Loss: BCE over each neuron
➢ Ask every Neuron, is it Human? Is it
person?...etc.
➢ More than 1 is possible
https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
Multi-label Classification
Model vs.Problemsummary
https://www.manning.com/books/deep-learning-with-python
Let’scode
➢ Classifying movie reviews: a binary classification
example
➢ Digit Classification: Multi-class classification
example
➢ Classifying newswires: a multi-class classification
example
➢ Predicting house prices: a regression example
Assignment
Predicting House
Prices:
Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probs andwe
argmax them
Classifying newswires: a multi-class classification example
SparseCategorical CrossEntropy
Notice, while the y_train is NOT OHE, we still set the model outputto
Dense(num_classes, ‘softmax’).
Not only that, even the predict API shall return num_classes probsand
we argmax them
Introductio
n to
sklearn
● Sklearn conventions
● Model persistence
(save/load)
● Dealing with different data
variables (categorical,
nominal,
ordinal,…etc)
● Dealing with different
problems types (Multi-
label, Multi-class,
Introduction to sklearn
https://colab.research.google.com/drive/1YMOpQAh3gN_A013IHMPvhFpoKcbMUK2B?usp
=sharing
We have seen how to deal with different problems setups in Keras, which is targeting one model type (NN)
Now, let’s explore a more generic framework, that targets different ML models
This is the whole you need from sklearn.
We won’t go through all, but only some parts to
stress + we see the rest as we go
sklearn is generic for ML
keras/tf/pytorch are more for DNN
Introduction tosklearn
The main design principles are:
- Consistency. All objects share a consistent and simple interface
- Inspection. All the estimator’s hyperparameters are accessible directly via public instance variables (e.g., imputer.strategy),
and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (e.g.,
imputer.statistics_).
- Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade
classes. Hyperparameters are just regular Python strings or numbers.
- Composition. Existing building blocks are reused as much as possible. For example, it is easy to create a Pipeline
estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see.
- Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline
working system quickly.
SklearnInterfaces
- Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is
an estimator). The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for
supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the
estimation process is considered a hyperparameter (such as an imputer’s strategy), and it must be set as an instance
variable (generally via a constructor parameter).
- Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once
again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a
parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the
case for an imputer All transformers also have a convenience method called fit_transform() that is equivalent to calling fit()
and then transform() (but sometimes fit_transform() is optimized and runs much faster).
- Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For
example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a country’s
GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of
corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set (and
the corresponding labels in the case of supervised learning algorithms).
Sklearnprogram
Let’scode
Parametric vs Non-parametric models:
https://colab.research.google.com/drive/1Oc0D_jr5Rmqvhf7315zjDY7Vj8Sr9RGV#scrollTo=aSpyl
FzopToZ
Gradient
Based
Optimizatio
n
● Optimization as a search
problem
● Gradient based optimization for
linear classifiers/regressors
● Linear
Regression
● SGD
● Logistic
Regression
● Overfitting vs
Underfitting
● Regularization
● Deeper models
Howoptimizer works forParametric models?
➢ Option 1: Closed form solution Let's solve the equations!
➢ Option 2: Search !
● Brute force: try all possible values of w's and
choose the
set with min loss → Numerical solution =
optimization
● Random search: same as brute force, but
choose random subset of all possible w's
● Guided search: start from random w's, measure
the loss, then choose the next set of w's, guided by
the change in the loss.
On which data to compute the loss?
How to use the loss change as a
guidance?
https://www.manning.com/books/deep-learning-with-python
Onwhichdatato computethe loss?
➢ Search in training data → Loss evaluation + Parameters update
➢ Evaluate on unseen testing data → Loss evaluation but No parameters update allowed
Isthe separating boundary (classification) /fittingmodel (regression),alwaysaline?
Further classification: Linearvs.Non-linearmodels
Models
Statistical Learning
Rule-
based
Learning
Method
Unsupervised
Reinforcement
Learning
Supervise
d
Modeling
Choice
Non-parametric Parametric
Non-linear Linear
We start with Linear models
Linear Regression
Example:Life Satisfaction vs.GDP
Play with Pandas
for tabular data!
Example:Life Satisfaction vs.GDP
We can manually fit many
solutions!
Normal equation
Normalequation
N-equations, in N
unknowns
Let’scode:
Let’scode:
sklearn.LinearRegressionis just the normal equation, with Pseudoinverse
➢ Disadvantages:
● Slow (X is a big matrix, inverse is costly and
not
always existing)
● Uses all the data points --> Might overfit
(more on that later)
● It doesn’t work for classification. Only
regression, since some equations will vanish
(correct classifications), leading to more
degrees of freedom and less constraints, and
we'll have infinite solutions (infinite lines can
separate the classes with 0 MSE)
Normalequation
https://raw.githubusercontent.com/qingkaikong/blog/master/43_
support_vector_machine_part1/figures/figure_3.jpg
➢ Sources:
– Sample noise: small data
– Sample bias: gender, ethnicit,
popularity,..etc
– Poor quality (garbage in-garbage out):
mislabels
– Missing data
➢ This results in poor generalization=overfitting
➢ The affect is magnitifed if the model is
allowed to use all the training data as in the
case of normal equation
➢ As can be seen, with missing countries
added, we get completely different curve!
ANOTHERISSUE:Non-representation ofdata
Gradient Basedoptimization
Gradient basedoptimization
A gradient is a of change of output, with change in input.
Meaning how much y changes when we change x by small amount.
In other words, it's the rate of change of y with x.
It also encodes the sensitivity of y to x
In case of loss, the gradient $frac{delta{loss}}{delta{w}}$ is how much the loss
changes when we change a weight with small amount. It guides our search of the
the weights that minimizes the loss; if we change w a bit, and the loss is affected
a lot, it means that the loss is sensitive to this change.
In case of MSE loss for example, since the loss is always convex, it means any
change in w moves the loss away from the minimum. So if we move against it, we
reach the minimum. If this change value is small, it means we need to move a
small step (we are already at the min: $frac{delta{loss}}{delta{w}=0}$), and vice
versa.
https://www.manning.com/books/deep-learning-with-python
Gradientdescent
When w_new = w_old, then the grad = 0
So we reach the min. (same as we
wanted in normal equation and closed
form solution)
grad=delta_W → Numerical estimate of the
true
gradient (dJ/dW) → How to calc that
estimate?
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20%20Pattern%20Recognition%20And%20
Machine%20Learning%20-%20Springer%20%202006.pdf
Batch Gradientdescent
Trainingloop
For number of epochs
➢ For each example in the
dataset:
● Feedfwd: Compute err
● Feedback: Compute
Gradient
● Update params
Let’scodeBatch GradientDescent
When using Gradient Descent, you
should ensure that all features have a
similar scale (e.g., using Scikit-Learn’s
StandardScaler class), or else it will
take much longer to converge
GDHyperparameters
Learningrate
As you can see, intuitively it’s important to pick a reasonable value for the step factor.
If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local
minimum.
If step is too large, your updates may end up taking you to completely random locations on the curve.
https://cdn-images-1.medium.com/max/1000/1*YQWdnHVTRPGjr0-VGuSxyA.jpeg3
Playingwith Learningrate
FindingLR
- Learning rate
schedules
FindingLR
- Learning rate schedule and simulated
annealing
CyclicLRFinder
High LR = Oscillation at high
level of loss
ReduceLROnPlateau
Monitor val_loss
Additionally, there exist multiple variants of SGD that differ by taking into
account previous weight updates when computing the next weight update,
rather than just looking at the current value of the gradients.
There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and
several others. Momentum draws inspiration from physics.
A useful mental image here is to think of the optimization process as a small ball
rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a
ravine and will end up at the global minimum. Momentum is implemented by
moving the ball at each step based not only on the current slope value (current
acceleration) but also on the current velocity (resulting from past acceleration).
In practice, this means updating the parameter w based not only on the current
gradient value but also on the previous parameter update, such as in this naive
implementation:
https://i.pinimg.com/236x/31/90/2e/31902e5c838575c5aae653
5181740cb9.jpg?nii=t
Momentum
https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/
Normal GD
GD w/ momentum
EarlyStopping
EarlyStoppingkeras
https://keras.io/api/callbacks/early_
stopping/
Repeatedrepresentation ofdata (Sampleby sample=Stochastic GD)
➢ Training Loops
➢ For 1..Num epochs
▶ For 1..Num examples (M)
w(k+1) = w(k) + delta_W
(M)
Batches
Repeatedrepresentation of data (ALLdata)Batch GD
➢ Training Loops
➢ For 1..Num epochs
▶ For 1..Num examples (M)
delta_W
(M)+=GRAD(EXAMPLE)
▶ GD Updates: w(k+1) = w(k) + delta_W (M)
Differenttraining loopoptions - Introducing batches
GD (Batch GD)
The above GD algorithm updates the gradient with every new sample of x.
If X is of 1000 samples, then the delta_W[t+1] = step * gradient(f)(W[t]) is
calculated and accumulated every sample, but the application of weight
update is done once after all 1000 samples are fed.
➢ Update once at the end
➢ High accumulation of error, Risk of saturation
➢ Slow feedback, slow convergence, but more stable
➢ Take advantage of parallelism in matrix operations (comp.
arch.)
https://colab.research.google.com/drive/11_-SxhdtxvRYdPukURz-cojd7-JdpiKR?authuser=1
Getting the gradient for all data
could be slow, and include many
errors before one update
loss=f(x,W) → 2 unknowns → we search for W, but
what about x?
GD→ dloss/dW, at which x?
So we estimate the gradient for different x’s:
Batch=ALL
Stochastic=One
Mini batch
➢ Training Loops with batches
➢ For 1..Num epochs
▶ For 1..Num batches (N_batches)
• For 1..Num of examples per batch (B)
– GD Updates: w(k+1) = w(k) +
delta_W (B)
Introducing Batches-Why?
Effect of batchsize
➢ Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset (M)
➢ Stochastic Gradient Descent. Batch size is set to one.
➢ Minibatch Gradient Descent. Batch size (=B) is set to more than one and less than the total number of examples in the
training dataset.
For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size.
Objective: B=M
However, given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the
size of the training dataset.
Batch size only affecting convergence speed, but not
performance
SGDwithsklearn
Built-in in sklearn =
SGDRegressor epochs =
max_iter
SGD + High LR
Different training loopoptions
SGD
There are other options of
updating:
1. SGD: Each sample
(batch_size=1)
● Fast feedback, fast convergence (corrects itself
a lot)
● Could oscillate, unstable (tend to corrupt what it learnt) → To overcome reduce LR
● Not taking advantage of parallelism in matrix operations (comp. arch.)
● Not saturating
● Stochastic: every update is a representing sample of the true gradient
SGD + Low LR
One sample is not enoughto
estimate the gradient→ So
dont take large steps
SGDvsBDG
➢ Batch Gradient Descent: Use a relatively larger learning rate and more training
epochs.
➢ Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Mini- batch gradient descent provides an alternative approach.
➢ Minibatch SGD: Every group of samples
(batch_size=N)
➢ Accumulate M gradients, update
every M
● More stable than 1 sample
● Faster than GD
Different training loop options
Effect of batch sizeinMBGD
The plots show that small batch results generally in rapid
learning but a volatile learning process with higher variance
in the classification accuracy.
Larger batch sizes slow down the learning process but the
final stages result in a convergence to a more stable model
exemplified by lower variance in classification accuracy.
SGDvs.BGDvs.MBGD
Trainingloop options/Effect ofBatchsize
Batch_sz
SGD (Available in
sklearn
and Keras)
1 Fast feedback, fast
convergence (corrects
itself a lot)
No risk of gradient
saturation
Needs to reduce the LR
since we make many
corrections (could
oscillate)
Inaccurate gradient estimate
BGD (Manual in
skelarn, available in
Keras with
batch_size=M)
M (total samples) Take advantage of
parallelism in matrix
operations (comp. arch.)
→ Better gradient
estimate → More stable
Update once at the end →
slow convergence
High accumulation of error
→ Risk of saturation
MBGD (Manual in
sklearn, available
in Keras)
B
(N_batches = M/B)
Good estimates than
SGD Faster than BGD
Worse estimates than
BGD (more oscillations)
Slower than SGD
Logistic Regression
Decision layer/Output (Ex: Logistic
regression)
Knows ML or not?
Logistic Regression for Binary Classification
Coursat.ai
https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif
Logistic Regression in
sklearn
Remember this “band”/street
when we deal with SVM
Multi-class and softmax in
sklearn
We can still see the decision boundaries for the 3 classes this time

More Related Content

Similar to supervised.pptx

Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Max Kleiner
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
 
Introduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowIntroduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowNicholas McClure
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2preetikumara
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & PythonLonghow Lam
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Vincenzo Santopietro
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersZoran Sevarac, PhD
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...ETS Asset Management Factory
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonTom Dierickx
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Lecture 1
Lecture 1Lecture 1
Lecture 1Aun Akbar
 

Similar to supervised.pptx (20)

Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
DAA UNIT 3
DAA UNIT 3DAA UNIT 3
DAA UNIT 3
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 
Introduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowIntroduction to Neural Networks in Tensorflow
Introduction to Neural Networks in Tensorflow
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)
 
Optimization
OptimizationOptimization
Optimization
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java Developers
 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 

Recently uploaded

Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiologyDrAnita Sharma
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 

Recently uploaded (20)

Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiology
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 

supervised.pptx

  • 2. Semantic gap- What the computercansee? ????
  • 3. Semantic gap- What the computer canREAD?
  • 4. Semantic gapin text - What computers canread? In images, we have matrix representation In text→ ?? → Naive = BoW against vocab
  • 5. Semantic gap- What is the best representation of sensory signals forcomputers? Old problem → ADC Coursat.ai
  • 6. Facerecognition with traditional AI Rules (If- else) I D Rule based AI
  • 7. Facerecognition with traditional AI Rules (If-else) I D Did you handle scale? Did you handle colors? Did you handle pose? Did you handle Clutter? Too many cases!! Let’s Learn the rule!
  • 8. RulebasedAIvs. ML “Machine learning is the science of getting computers to act without being explicitly programmed”, Arthur Samuel 1959
  • 10. ML: How to learnw’s? Assume only 2 q’s: Basically → Search: ax1+bx2+c (Model)→ w1=a, w2=b, w0=c (bias, to be clear later) Try many lines, and get the one that separates the Data better → How good? Loss Practically→ Smarter methods (Optimizer) are used better than brute force or random search!
  • 11. How to decide onquestions? We call the questions vector $q^i=[q1, q2,...qN]$ features vector In case of structured data like above, normally we choose them based on experience. In case of unstructured data like images, the input vector is just the pixel values. So the question, what are the features? The answer to this question defines if we do ML or DL In general, if we feed raw pixels, we do DL, if we define specific features, we do ML. This will get more clear
  • 12. Semantic gap- What the computer cansee?Traditional CVpipeline with OpenCV Model y X
  • 15. What could be good facefeatures?
  • 17. Ex:Semanticgapin Tabulardata Or you can have derived features The least is simple input transform
  • 18. Howto closethe Gap?Thedeeperyougo, the moreyouclosethe gapfrom raw(X)to semantics(y) Model y How to obtain the model? X The model is usually hierarchical = layered The least #layers is input transformations More “refinements” means more depth The more you go deeper, the more the gap closes X→ y
  • 21. What AIcando? Artificial Narrow Intelligence: Structure or unstructured Structured data Unstructured data
  • 23. ParametricvsNon-parametric ➢ Parametric: assume a family of functions (=distribution) from which the model is searched. read as mapping function (model) of input X to output y, parametrized by theta ➢ Non-parametric: We don’t assume any function family. Usually heuristics. Example: K-NN and Naive Bayes.
  • 25. What is AIis best at today? Supervised Learning Classification I D Regression Pric e ML B A
  • 26. Regression Car Price 150K 200K 250K ? 350K Hors e Powe r 72 90 110 130 180 Job Salary 10K 20K ? 35K 50K Domain X Y X Z Z Location CAI NY PAR NY CAI Grade B A A B A Pure DB query=fail Need to interpolate An output could depend on many factors = features
  • 28. Classification Car Category Eco Low Mid High Premium Horse Power 72 90 110 130 180 Eco=0 Low=1 Mid=2 High=3 Premium=4 Car Category Low=0 High=1 Car Category
  • 31. How it works? Ingredients • Data: – X, Y (supervised) • Model: function mapping: – Y’ = fw(X), w=1,2,…etc, • Loss: – How good Model fits Data?-> Y vs. Y’ • Optimizer: – search for best fw, that makes the Model fits the Data with min. Loss.
  • 32. ML: How to learnw’s? Assume only 2 q’s: Basically → Search: ax1+bx2+c (Model)→ w1=a, w2=b, w0=c (bias, to be clear later) Try many lines, and get the one that separates the Data better → How good? Loss Practically→ Smarter methods (Optimizer) are used better than brute force or random search!
  • 35. How DLworks? 3. Loss 4. Optimizer Optimizer does the “search” Using backpropagation
  • 36. Ingredients of supervisedlearning ➢Data: ● X, Y (supervised) ➢Model: function mapping: ● Y’ = fw(X), w=1,2,…etc, ➢Loss: ● How good Model fits Data?-> Y vs. Y’ ➢Optimizer: ● Search for best fw, that makes the Model fits the Data with min. Loss.
  • 37. Parameters vs.Hyper parameters • Parameter = values adjusted by the optimizer algorithm • Only on training data! Never on validation or test. • Ex: weights • Hyper parameter = values adjusted by the algorithm designer (human) • Only on validation/training data! Never on test. • Ex: network architecture (number of layers, numer of neuron per layer,…etc), and opmitizer hyper parameters (learning rate, momentum,…etc) (more on that later)
  • 43. BinaryClassification` Multi-class Classification` Regression X y Problem type depends on the target variable = y Supervisedproblemtypes
  • 46. Deepor shallow? https://www.manning.com/books/deep-learning-with-python Deep learning summer school, Montreal, 2015: Convolutional NN, Lee
  • 49. Model vs.Problemsummary https://www.manning.com/books/deep-learning-with-python Regardless of layers types, we can design two things: - Last layer activation - Loss function Based on problem type
  • 50. RegressionDesignPattern Model (Ex: linear regression) How much knows ML? https://miro.medium.com/max/1826/1*L9xLcwKhuZ2cuS8fF0ZjwA.png
  • 51. Decision layer/Output (Ex: linear regression) ReLU (if negative is not permitted) or Linear RegressionDesignPattern https://static.packt-cdn.com/products/9781786469786/graphics/B05478_03_11.png
  • 54. Decision layer/Output (Ex: Logistic regression) Knows ML or not? Binary Classification https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif
  • 55. ➢ Model: ● Layers → Wfeatures ● Output → Multiple linear neurons (likelinear regression) ● Each output neuron gives a score (unnormalized probability or logit) ● Softmax → Normalize ● Prediction = argmax ● Ensures only 1 class is possible Multi-classclassification https://deepnotes.io/public/images/softmax.png
  • 57. Multi-label Classification ➢ Model: ● Layers ● Multiple sigmoids https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg
  • 58. ➢ Loss: BCE over each neuron ➢ Ask every Neuron, is it Human? Is it person?...etc. ➢ More than 1 is possible https://miro.medium.com/max/3616/1*s6Tm6f3cPhHEFdEjuCMMKQ.jpeg Multi-label Classification
  • 60. Let’scode ➢ Classifying movie reviews: a binary classification example ➢ Digit Classification: Multi-class classification example ➢ Classifying newswires: a multi-class classification example ➢ Predicting house prices: a regression example
  • 62. Classifying newswires: a multi-class classification example SparseCategorical CrossEntropy Notice, while the y_train is NOT OHE, we still set the model outputto Dense(num_classes, ‘softmax’). Not only that, even the predict API shall return num_classes probs andwe argmax them
  • 63. Classifying newswires: a multi-class classification example SparseCategorical CrossEntropy Notice, while the y_train is NOT OHE, we still set the model outputto Dense(num_classes, ‘softmax’). Not only that, even the predict API shall return num_classes probsand we argmax them
  • 64. Introductio n to sklearn ● Sklearn conventions ● Model persistence (save/load) ● Dealing with different data variables (categorical, nominal, ordinal,…etc) ● Dealing with different problems types (Multi- label, Multi-class,
  • 65. Introduction to sklearn https://colab.research.google.com/drive/1YMOpQAh3gN_A013IHMPvhFpoKcbMUK2B?usp =sharing We have seen how to deal with different problems setups in Keras, which is targeting one model type (NN) Now, let’s explore a more generic framework, that targets different ML models This is the whole you need from sklearn. We won’t go through all, but only some parts to stress + we see the rest as we go sklearn is generic for ML keras/tf/pytorch are more for DNN
  • 66. Introduction tosklearn The main design principles are: - Consistency. All objects share a consistent and simple interface - Inspection. All the estimator’s hyperparameters are accessible directly via public instance variables (e.g., imputer.strategy), and all the estimator’s learned parameters are also accessible via public instance variables with an underscore suffix (e.g., imputer.statistics_). - Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings or numbers. - Composition. Existing building blocks are reused as much as possible. For example, it is easy to create a Pipeline estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see. - Sensible defaults. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.
  • 67. SklearnInterfaces - Estimators. Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is an estimator). The estimation itself is performed by the fit() method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is considered a hyperparameter (such as an imputer’s strategy), and it must be set as an instance variable (generally via a constructor parameter). - Transformers. Some estimators (such as an imputer) can also transform a dataset; these are called transformers. Once again, the API is quite simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the case for an imputer All transformers also have a convenience method called fit_transform() that is equivalent to calling fit() and then transform() (but sometimes fit_transform() is optimized and runs much faster). - Predictors. Finally, some estimators are capable of making predictions given a dataset; they are called predictors. For example, the LinearRegression model in the previous chapter was a predictor: it predicted life satisfaction given a country’s GDP per capita. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set (and the corresponding labels in the case of supervised learning algorithms).
  • 69. Let’scode Parametric vs Non-parametric models: https://colab.research.google.com/drive/1Oc0D_jr5Rmqvhf7315zjDY7Vj8Sr9RGV#scrollTo=aSpyl FzopToZ
  • 70. Gradient Based Optimizatio n ● Optimization as a search problem ● Gradient based optimization for linear classifiers/regressors ● Linear Regression ● SGD ● Logistic Regression ● Overfitting vs Underfitting ● Regularization ● Deeper models
  • 71. Howoptimizer works forParametric models? ➢ Option 1: Closed form solution Let's solve the equations! ➢ Option 2: Search ! ● Brute force: try all possible values of w's and choose the set with min loss → Numerical solution = optimization ● Random search: same as brute force, but choose random subset of all possible w's ● Guided search: start from random w's, measure the loss, then choose the next set of w's, guided by the change in the loss. On which data to compute the loss? How to use the loss change as a guidance? https://www.manning.com/books/deep-learning-with-python
  • 72. Onwhichdatato computethe loss? ➢ Search in training data → Loss evaluation + Parameters update ➢ Evaluate on unseen testing data → Loss evaluation but No parameters update allowed
  • 73. Isthe separating boundary (classification) /fittingmodel (regression),alwaysaline?
  • 74. Further classification: Linearvs.Non-linearmodels Models Statistical Learning Rule- based Learning Method Unsupervised Reinforcement Learning Supervise d Modeling Choice Non-parametric Parametric Non-linear Linear We start with Linear models
  • 76. Example:Life Satisfaction vs.GDP Play with Pandas for tabular data!
  • 77. Example:Life Satisfaction vs.GDP We can manually fit many solutions!
  • 82. sklearn.LinearRegressionis just the normal equation, with Pseudoinverse
  • 83. ➢ Disadvantages: ● Slow (X is a big matrix, inverse is costly and not always existing) ● Uses all the data points --> Might overfit (more on that later) ● It doesn’t work for classification. Only regression, since some equations will vanish (correct classifications), leading to more degrees of freedom and less constraints, and we'll have infinite solutions (infinite lines can separate the classes with 0 MSE) Normalequation https://raw.githubusercontent.com/qingkaikong/blog/master/43_ support_vector_machine_part1/figures/figure_3.jpg
  • 84. ➢ Sources: – Sample noise: small data – Sample bias: gender, ethnicit, popularity,..etc – Poor quality (garbage in-garbage out): mislabels – Missing data ➢ This results in poor generalization=overfitting ➢ The affect is magnitifed if the model is allowed to use all the training data as in the case of normal equation ➢ As can be seen, with missing countries added, we get completely different curve! ANOTHERISSUE:Non-representation ofdata
  • 86. Gradient basedoptimization A gradient is a of change of output, with change in input. Meaning how much y changes when we change x by small amount. In other words, it's the rate of change of y with x. It also encodes the sensitivity of y to x In case of loss, the gradient $frac{delta{loss}}{delta{w}}$ is how much the loss changes when we change a weight with small amount. It guides our search of the the weights that minimizes the loss; if we change w a bit, and the loss is affected a lot, it means that the loss is sensitive to this change. In case of MSE loss for example, since the loss is always convex, it means any change in w moves the loss away from the minimum. So if we move against it, we reach the minimum. If this change value is small, it means we need to move a small step (we are already at the min: $frac{delta{loss}}{delta{w}=0}$), and vice versa. https://www.manning.com/books/deep-learning-with-python
  • 87. Gradientdescent When w_new = w_old, then the grad = 0 So we reach the min. (same as we wanted in normal equation and closed form solution) grad=delta_W → Numerical estimate of the true gradient (dJ/dW) → How to calc that estimate? http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20%20Pattern%20Recognition%20And%20 Machine%20Learning%20-%20Springer%20%202006.pdf
  • 89. Trainingloop For number of epochs ➢ For each example in the dataset: ● Feedfwd: Compute err ● Feedback: Compute Gradient ● Update params
  • 90. Let’scodeBatch GradientDescent When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge
  • 92. Learningrate As you can see, intuitively it’s important to pick a reasonable value for the step factor. If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local minimum. If step is too large, your updates may end up taking you to completely random locations on the curve. https://cdn-images-1.medium.com/max/1000/1*YQWdnHVTRPGjr0-VGuSxyA.jpeg3
  • 95. FindingLR - Learning rate schedule and simulated annealing
  • 96. CyclicLRFinder High LR = Oscillation at high level of loss
  • 98. Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and several others. Momentum draws inspiration from physics. A useful mental image here is to think of the optimization process as a small ball rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a ravine and will end up at the global minimum. Momentum is implemented by moving the ball at each step based not only on the current slope value (current acceleration) but also on the current velocity (resulting from past acceleration). In practice, this means updating the parameter w based not only on the current gradient value but also on the previous parameter update, such as in this naive implementation: https://i.pinimg.com/236x/31/90/2e/31902e5c838575c5aae653 5181740cb9.jpg?nii=t Momentum https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/ Normal GD GD w/ momentum
  • 101. Repeatedrepresentation ofdata (Sampleby sample=Stochastic GD) ➢ Training Loops ➢ For 1..Num epochs ▶ For 1..Num examples (M) w(k+1) = w(k) + delta_W (M)
  • 103. Repeatedrepresentation of data (ALLdata)Batch GD ➢ Training Loops ➢ For 1..Num epochs ▶ For 1..Num examples (M) delta_W (M)+=GRAD(EXAMPLE) ▶ GD Updates: w(k+1) = w(k) + delta_W (M)
  • 104. Differenttraining loopoptions - Introducing batches GD (Batch GD) The above GD algorithm updates the gradient with every new sample of x. If X is of 1000 samples, then the delta_W[t+1] = step * gradient(f)(W[t]) is calculated and accumulated every sample, but the application of weight update is done once after all 1000 samples are fed. ➢ Update once at the end ➢ High accumulation of error, Risk of saturation ➢ Slow feedback, slow convergence, but more stable ➢ Take advantage of parallelism in matrix operations (comp. arch.) https://colab.research.google.com/drive/11_-SxhdtxvRYdPukURz-cojd7-JdpiKR?authuser=1 Getting the gradient for all data could be slow, and include many errors before one update loss=f(x,W) → 2 unknowns → we search for W, but what about x? GD→ dloss/dW, at which x? So we estimate the gradient for different x’s: Batch=ALL Stochastic=One Mini batch
  • 105. ➢ Training Loops with batches ➢ For 1..Num epochs ▶ For 1..Num batches (N_batches) • For 1..Num of examples per batch (B) – GD Updates: w(k+1) = w(k) + delta_W (B) Introducing Batches-Why?
  • 106. Effect of batchsize ➢ Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset (M) ➢ Stochastic Gradient Descent. Batch size is set to one. ➢ Minibatch Gradient Descent. Batch size (=B) is set to more than one and less than the total number of examples in the training dataset. For shorthand, the algorithm is often referred to as stochastic gradient descent regardless of the batch size. Objective: B=M However, given that very large datasets are often used to train deep learning neural networks, the batch size is rarely set to the size of the training dataset. Batch size only affecting convergence speed, but not performance
  • 107. SGDwithsklearn Built-in in sklearn = SGDRegressor epochs = max_iter
  • 108. SGD + High LR Different training loopoptions SGD There are other options of updating: 1. SGD: Each sample (batch_size=1) ● Fast feedback, fast convergence (corrects itself a lot) ● Could oscillate, unstable (tend to corrupt what it learnt) → To overcome reduce LR ● Not taking advantage of parallelism in matrix operations (comp. arch.) ● Not saturating ● Stochastic: every update is a representing sample of the true gradient SGD + Low LR One sample is not enoughto estimate the gradient→ So dont take large steps
  • 109. SGDvsBDG ➢ Batch Gradient Descent: Use a relatively larger learning rate and more training epochs. ➢ Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs. Mini- batch gradient descent provides an alternative approach.
  • 110. ➢ Minibatch SGD: Every group of samples (batch_size=N) ➢ Accumulate M gradients, update every M ● More stable than 1 sample ● Faster than GD Different training loop options
  • 111. Effect of batch sizeinMBGD The plots show that small batch results generally in rapid learning but a volatile learning process with higher variance in the classification accuracy. Larger batch sizes slow down the learning process but the final stages result in a convergence to a more stable model exemplified by lower variance in classification accuracy.
  • 113. Trainingloop options/Effect ofBatchsize Batch_sz SGD (Available in sklearn and Keras) 1 Fast feedback, fast convergence (corrects itself a lot) No risk of gradient saturation Needs to reduce the LR since we make many corrections (could oscillate) Inaccurate gradient estimate BGD (Manual in skelarn, available in Keras with batch_size=M) M (total samples) Take advantage of parallelism in matrix operations (comp. arch.) → Better gradient estimate → More stable Update once at the end → slow convergence High accumulation of error → Risk of saturation MBGD (Manual in sklearn, available in Keras) B (N_batches = M/B) Good estimates than SGD Faster than BGD Worse estimates than BGD (more oscillations) Slower than SGD
  • 115. Decision layer/Output (Ex: Logistic regression) Knows ML or not? Logistic Regression for Binary Classification Coursat.ai https://static.wixstatic.com/media/93352b_8e7c7a60715b4728a9a7d13612c5fa58~mv2.gif
  • 116. Logistic Regression in sklearn Remember this “band”/street when we deal with SVM
  • 117. Multi-class and softmax in sklearn We can still see the decision boundaries for the 3 classes this time