VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
ML Workshop at SACON 2018
1. SACON
SACON Pune 2018
India | Pune | May 18 – 19 | Hotel Hyatt Pune
Learning Machine Learning
Subrat Panda
Capillary Technologies
Principal Architect, AI and Data Sciences
2. SACON
LEARNING MACHINE LEARNING
Subrat Panda
Principal Architect, AI and Data Sciences,
Capillary Technologies (www.capillarytech.com)
Co-Founder : IDLI (Indian Deep Learning Initiative)
https://www.facebook.com/groups/idliai/
BTech(2002), PhD(2009) IIT KGP.
https://www.linkedin.com/in/subratpanda/
Email : subratpanda@gmail.com
Acknowledgements:
Biswa Gourav Singh
Co-Founder : IDLI (Indian Deep Learning Initiative)
https://www.linkedin.com/in/biswagsingh/
Email: biswagourav.singh@gmail.com
AI Community Across the Globe
6. SACON
Gartner Says By 2020,
Artificial Intelligence Will
Create More Jobs Than It
Eliminates
7. SACON
What this talk can motivate people to do
STUDENTS:
Motivates to participate in data science competitions
Further learning and add the expertise to the resume
Final year and fun projects.
PROFESSIONALS:
Find interesting data in your current project and apply machine learning
Motivates further learning and profession change. Data scientists/Machine
learning engineers are highly paid professionals
TEACHERS:
Motivates teachers to spread knowledge in the their university
Conduct hackathons
SACON 2018 - Pune
9. SACON
Machine Learning Classical Definition
Arthur Samuel (1959): "computer’s ability to learn without being
explicitly programmed.“
Tom M Mitchel (1998): "A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P,
improves with experience E.“
Optimize a performance criterion using example data or past
experience.
10. SACON
Types of Machine Learning Algorithms
Supervised Learning: Input data with
labeled responses
Regression : Given a picture of a person, we
have to predict their age on the basis of the
given picture
Classification : Given a patient with a tumor,
we have to predict whether the tumor is
malignant or benign. IRIS DATASET
SPECIES
CLASSIFICATION
TEXT
CLASSIFICATION
IMAGE
CLASSIFICATION
Linear Regression Non-Linear
Regression
11. SACON
Types of Machine Learning Algorithms
Unsupervised Learning: Input data without labeled responses.
Clustering: Take a collection of 1,000,000 different genes, and find a way to
automatically group these genes into groups that are somehow similar or
related by different variables, such as lifespan, location, roles, and so on.
Non Clustering: Exploratory data analysis (PCA, Auto-encoders)
Customer
Segmentation
MNIST Digit Segmentation
14. SACON
Pop Quiz
Predicting housing prices based on input parameters like house
size, number of rooms, location of house etc. falls under which
category of machine learning problem:
A) Regression
B) Classification
C) Clustering
D) None
Automatically segmenting your customers according to the customer
information falls under which category of machine learning.
A) Regression
B) Classification
C) Clustering
D) None
16. SACON
Linear Regression
• Linear regression is the simple form of Supervised learning.
• In a regression problem the target variable is continuous.
Living Area (Sq. feet) Year Built Price (1000$s)
2104 2012 400
1600 2013 300
2400 2014 369
1416 2013 232
3000 2015 540
. . .
. . .
. . .
Predict Housing Price from Historical data
17. SACON
Linear Regression
• The goal is to learn a function which assumes linear relationship
between target variable Y with input variable X
18. SACON
Linear Regression
• In supervised learning, our goal is, given a training set, to learn a function h : X
→ Y so that h(x) is a “good” predictor for the corresponding value of Y.
Living Area (Sq.
feet) Year Built
Price
(1000$s)
2104 2012 400
1600 2013 300
2400 2014 369
1416 2013 232
3000 2015 540
. . .
. . .
. . .
• Lets consider the housing data above. X’s represents a two dimensional vector
ad Y represents the price of the house.
20. SACON
Cost Function I
• Lets approximate the Y as a linear function of X. Hence the hypothesis function
will be given by.
• θ’s are the parameters (also called weights) parameterizing the space of linear
functions mapping from X to Y.
• How do we pick, or learn, the parameters θ? One reasonable method seems to
be to make h(x) close to y, at least for the training examples we have. The cost
function is given by: (Considering θ1
• This is the least-squares cost function that gives rise to the ordinary least
squares regression model
21. SACON
Cost Function II
We want to choose θ so as to minimize J(θ).
We can see the cost associated with different values of θ and we can see the
graph has a slight bowl to its shape.
The goal is to “roll down the hill”, and find θ corresponding to the bottom of
the bowl.
22. SACON
Gradient Descent
We should use a search algorithm that starts with some “initial guess” for θ, and that
repeatedly changes θ to make J(θ) smaller, until we converge to a value of θ that
minimizes J(θ).
The algorithm we choose is Gradient Descent Algorithm, which starts with some
initial θ and repeatedly perform the following update:
If we calculate the partial derivate , we get the following output:
α = Learning Rate
If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may
not converge.
25. SACON
Other Optimization Methods:
There is an alternative to batch gradient descent that also works very well.
Consider the following algorithm:
Each time we encounter a training example, we update the parameters
according to the gradient of the error with respect to that single training
example only. This algorithm is called Stochastic Gradient Descent(SGD).
Other examples of Optimization algorithms: BFGS, L-BFGS
Mini batch gradient descent: performs an update for every batch.
26. SACON
Normal Equation
Normal Equation is a method to solve for θ analytically.
Our cost function looks like:
To minimize a Quadratic function, the partial derivative of the function should
be equated to zero.
27. SACON
Normal Equation
Given a training set with m examples and n features, define the
design matrix X to be the m-by-n matrix give like below:
Thus, the value of θ that minimizes J(θ) is given
in closed form by the equation
let y be the m-dimensional vector containing all the target values
from the training set:
30. SACON
Introduction
It is an approach to the classification problem.
The output vector is either 1 or 0 instead of a continuous range of
values
y ∈ {0,1}
Binary classification problem (two values)
Linear regression wont work in the classification problem
IMAGE
CLASSIFICATION
31. SACON
Logistic Regression: Hypothesis
The hypothesis should satisfy
0 ≤ h(x) ≤ 1
the "Sigmoid Function," also called
the "Logistic Function":
We want to restrict the range to 0
and 1. This is accomplished by
plugging θTx into the Logistic
Function
32. SACON
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the
hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
33. SACON
Cost Function
Can not use squared cost function as Logistic Function will cause the
output to be wavy, causing many local optima.
37. SACON
Overview
Intro. to Support Vector Machines (SVM)
Properties of SVM
Applications
Discussion
38. SACON
A Support Vector Machine (SVM) is a supervised machine
learning algorithm that can be employed for both classification
and regression purposes.
SVMs are more commonly used in classification problems
Introduction
Plot shows size and weight of several
people, and there is also a way to
distinguish between men and women.
39. SACON
We can see that it is possible to separate the data into classes.
We could trace a line and then all the data points representing men will be
above the line, and all the data points representing women will be below the
line.
Separating Hyperplane
40. SACON
Many separating hyperplane possible. Which one is best?
What is the Optimal Separating Hyperplane
41. SACON
• We will try to select an hyperplane as far as possible from data
points from each category (best hyperplane)
• Because it correctly classifies the training data
• And because it is the one which will generalize better with unseen data
What is the Optimal Separating Hyperplane
42. SACON
• Given a particular hyperplane, we can compute the distance between the hyperplane and the
closest data point(Support Vectors).
• Basically the margin is a no man's land. There will never be any data point inside the margin.
Large Margin Classifier
The optimal hyperplane will be the one with the
biggest margin. Margin A is better than Margin B
46. SACON
Non-linear SVMs
Datasets that are linearly separable with some
noise work out great:
But what are we going to do if the dataset is just
too hard?
How about… mapping data to a higher-
dimensional space:
0 x
0 x
0 x
x2
47. SACON
Non-linear SVMs: Feature spaces
General idea: the original input space can always
be mapped to some higher-dimensional feature
space where the training set is separable:
Φ: x → φ(x)
48. SACON
The“Kernel Trick”
The linear classifier relies on dot product between vectors K(xi,xj)=xi
Txj
If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in
some expanded feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi)Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi)Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]
49. SACON
What Functions are Kernels?
For some functions K(xi,xj) checking that
K(xi,xj)= φ(xi)Tφ(xj) can be cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
K=
50. SACON
Examples of Kernel Functions
Linear: K(xi,xj)= xi
Txj
Polynomial of power p: K(xi,xj)= (1+ xi
Txj)p
Gaussian (radial-basis function network):
Sigmoid: K(xi,xj)= tanh(β0xi
Txj + β1)
)
2
exp(),( 2
2
ji
ji
xx
xx
K
51. SACON
Non-linear SVMs Mathematically
Dual problem formulation:
The solution is:
Optimization techniques for finding αi’s remain the same!
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi,
xj)+ b
52. SACON
SVM locates a separating hyperplane in the feature space and classify points in
that space
It does not need to represent the space explicitly, simply by defining a kernel
function
The kernel function plays the role of the dot product in the feature space.
Nonlinear SVM - Overview
53. SACON
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane
Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature space
Overfitting can be controlled by soft margin approach
Nice math property: a simple convex optimization problem which is
guaranteed to converge to a single global solution
Feature Selection
54. SACON
SVM Applications
SVM has been used successfully in many real-world problems
Text (and hypertext) categorization
Image classification
Bioinformatics (Protein classification, Cancer classification)
Hand-written character recognition
55. SACON
Application 1: Cancer Classification
High Dimensional
- p>1000; n<100
Imbalanced
- less positive samples
Many irrelevant features
Noisy
Genes
Patients g-1 g-2 …… g-p
P-1
p-2
…….
p-n
N
n
xxkxxK
),(],[
FEATURE SELECTION
In the linear case,
wi
2 gives the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data
56. SACON
Weakness of SVM
It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically decrease
the performance
It only considers two classes
- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive region.
57. SACON
Application 2: Text Categorization
Task: The classification of natural text (or hypertext) documents into a
fixed number of predefined categories based on their content.
- email filtering, web searching, sorting documents by topic, etc..
A document can be assigned to more than one category, so this can
be viewed as a series of binary classification problems, one for each
category
58. SACON
Representation of Text
IR’s vector space model (aka bag-of-words
representation)
A doc is represented by a vector indexed by a pre-
fixed set or dictionary of terms
Values of an entry can be binary or weights
Normalization, stop words, word stems
Doc x => φ(x)
59. SACON
Text Categorization using SVM
The distance between two documents is φ(x)·φ(z)
K(x,z) = 〈φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for
discrimination.
Why SVM?
High dimensional input space
Few irrelevant features (dense concept)
Sparse document vectors (sparse instances)
Text categorization problems are linearly separable
60. SACON
Some Issues
Choice of kernel
Gaussian or polynomial kernel is default
If ineffective, more elaborate kernels are needed
Domain experts can give assistance in formulating appropriate
similarity measures
Choice of kernel parameters
e.g. σ in Gaussian kernel
σ is the distance between closest points with different classifications
In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
Optimization criterion – Hard margin v.s. Soft margin
a lengthy series of experiments in which various parameters are tested
62. SACON
k-Nearest Neighbor Classification
(kNN)
Unlike all the previous learning methods, kNN does not build model
from the training data.
To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
Count number n of training instances in P that belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is linear in training set size
for each test case.
63. SACON
kNN Algorithm
k is usually chosen empirically via a validation set
or cross-validation by trying a range of k values.
Distance function is crucial, but depends on
applications.
65. SACON
Discussions
kNN can deal with complex and arbitrary decision boundaries.
Despite its simplicity, researchers have shown that the classification
accuracy of kNN can be quite strong and in many cases as accurate
as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable model
67. SACON
INTRODUCTION-
What is clustering?
Clustering is the classification of objects into different groups, or more
precisely, the partitioning of a data set into subsets (clusters), so that the data
in each subset (ideally) share some common trait - often according to some
defined distance measure.
68. SACON
TYPES OF CLUSTERING
Hierarchical algorithms: these find successive clusters using previously
established clusters.
Agglomerative ("bottom-up"): Agglomerative algorithms begin with each element as
a separate cluster and merge them into successively larger clusters.
Divisive ("top-down"): Divisive algorithms begin with the whole set and proceed to
divide it into successively smaller clusters.
SACON 2018 - Pune
CLUSTER
DENDOGRAM
69. SACON
TYPES OF CLUSTERING
Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
K-means and derivatives
Fuzzy c-means clustering
QT clustering algorithm
70. SACON
TYPES OF CLUSTERING
Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
They include:
The Euclidean distance (also called 2-norm distance) is given
by:
The Manhattan distance (also called taxicab norm or 1-norm) is
given by:
71. SACON
The maximum norm is given by:
The Mahalanobis distance corrects data for different scales and
correlations in the variables.
Inner product space: The angle between two vectors can be used as a
distance measure when clustering high dimensional data
Hamming distance (sometimes edit distance) measures the minimum
number of substitutions required to change one member into another.
72. SACON
K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where k
< n.
It is similar to the expectation-maximization algorithm
for mixtures of Gaussians in that they both attempt to
find the centers of natural clusters in the data.
It assumes that the object attributes form a vector
space.
73. SACON
An algorithm for partitioning (or clustering) N data points into K disjoint
subsets Sj containing data points so as to minimize the sum-of-squares
criterion
where xn is a vector representing the the nth data point and uj is the
geometric centroid of the data points in Sj.
Simply speaking k-means clustering is an algorithm to categorize or to
group the objects based on attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
74. SACON
HOW K-MEANS CLUSTERING WORKS?
Step 1: Begin with a decision on the value of k = Number of
clusters
Step 2: Put any initial partition that classifies the data into k
clusters. You may assign the training samples randomly, or
systematically as the following:
Take the first k training sample as single- element
clusters
Assign each of the remaining (N-k) training sample to the
cluster with the nearest centroid. After each assignment,
recompute the centroid of the gaining cluster.
Step 3: Take each sample in sequence and compute its distance
from the centroid of each of the clusters. If a sample is not
currently in the cluster with the closest centroid, switch this
sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
Step 4 . Repeat step 3 until convergence is achieved, that is until a
pass through the training sample causes no new assignments.
77. SACON
Bias is the algorithm's tendency to
consistently learn the wrong thing by
not taking into account all the
information in the data
Variance is the algorithm's tendency to
learn random things irrespective of the
real signal by fitting highly flexible
models that follow the error/noise in
the data too closely
Bias/Variance
78. SACON
• Generalization ability gives an algorithm’s ability to give accurate
prediction new, previous unseen data
• Models that are too complex for the amount of training data
available are said to overfit and are not likely to generalize well to
new examples
• High variance can cause an algorithm to model the random noise in
the training data, rather than the intended outputs (overfitting).
• Models that are too simple, that do not even do well on training data,
are said to underfit and also not likely to generalize well.
• High bias can cause an algorithm to miss the relevant relations
between features and target outputs (underfitting).
Problem of high Bias/Variance
80. SACON
Bias/Variance is a Way to Understand
Overfitting and Underfitting
Error/Loss on
training set
Dtrain
Error/Loss on
an unseen test
set Dtest
high error
80
complex classifiersimple classifier
“too simple”
“too complex”
81. SACON
Definitions
• Overfitting: too much reliance on the training data
• Underfitting: a failure to learn the relationships in the training data
• High Variance: model changes significantly based on training data
• High Bias: assumptions about model lead to ignoring training data
• Overfitting and underfitting cause poor generalization on the test set
• A validation set for model tuning can prevent under and overfitting
SACON 2018 - Pune
82. SACON
Ways to Deal with
Overfitting and Underfitting
Underfitting:
Easier to resolve
Try different machine learning models
Try stronger models with higher capacity (hyperparameter
tuning)
Try more features
Overfitting
Use a resampling technique like K-fold cross validation
Improve the feature quality or remove some features
Training with more data
Early stopping
Regularization
Ensembling
Early Stopping
83. SACON
Regularization
• Regularization penalizes the coefficients. In machine learning, it
actually penalizes the weight matrices of the nodes.
• L1 and L2 are the most common types of regularization.
• These update the general cost function by adding another term
known as the regularization term.
Cost function = Loss (say, binary cross entropy) +
Regularization term
84. SACON
L1 and L2 Regularization
In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter whose
value is optimized for better results. L2 regularization is also known as weight
decay as it forces the weights to decay towards zero (but not exactly zero).
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may
be reduced to zero here.
86. SACON
Artificial Neural Networks
A Single Neuron: The basic unit of computation in a neural network is
the neuron, often called a node or unit.
The function f is non-linear and is called the Activation Function.
The idea of ANNs is based on the belief that working of human brain by making
the right connections, can be imitated using silicon and wires as
living neurons and dendrites.
87. SACON
Activation Function
Sigmoid: takes a real-valued input and squashes it to range between 0 and 1.
σ(x) = 1 / (1 + exp(−x))
tanh: takes a real-valued input and squashes it to the range [-1, 1]
tanh(x) = 2σ(2x) − 1
ReLU: ReLU stands for Rectified Linear Unit. It takes a real-valued input and
thresholds it at zero (replaces negative values with zero)
f(x) = max(0, x)
89. SACON
Neural Network Intuition (Multiple Layer layer)
Multi Layer Neural network is capable of learning complex
functions.
Lets consider XNOR operation.
• CASE1: X1 XNOR X2 = (A’.B’) + (A.B)
NN
representation
• CASE2: X1 XNOR X2 = NOT [ (A+B).(A’+B’) ]
NN representation = ?
90. SACON
Back-Propagation
Back-propagation (BP) algorithms works by
determining the loss (or error) at the output and
then propagating it back into the network.
The weights are updated to minimize the error
resulting from each neuron.
91. SACON
Regularization: Dropout
At every iteration, it randomly selects some nodes
and removes them along with all of their incoming
and outgoing connections
We need to choose the dropout parameter such
that we get the appropriate fitting
92. SACON
Deep Learning
• Deep Neural Network has a been very successful recently in the field
of computer vision, Natural language Processing, Speech recognition
and many more.
• Some of the important/successful networks are
• Convolutional Neural Network: Has been very successful in computer vision
• Recurrent neural network: Has been successful in Natural Language
Processing and speech recognition as well.
94. SACON
Decision Tree
Decision Tree is the supervised learning algorithm.
We split the population or sample into two or more homogeneous sets (or sub-
populations) based on most significant differentiator in input variables.
1.Root Node: It represents entire
population or sample and this further
gets divided into two or more
homogeneous sets.
2.Splitting: It is a process of dividing
a node into two or more sub-nodes.
3.Decision Node: When a sub-node
splits into further sub-nodes, then it is
called decision node.
4.Leaf/ Terminal Node: Nodes do not
split is called Leaf or Terminal node.
96. SACON
Methods of splitting: Information gain
which node can be described easily?
Information theory is a measure to define this degree of disorganization in a
system known as Entropy.
Here p and q is probability of success and failure respectively in
that node.
97. SACON
Other Tree based methods
Trade-off management of bias-variance errors.
Bagging is a simple ensembling technique in which we
build many independent predictors/models/learners and
combine them using some model averaging techniques.
Ensemble methods involve group of predictive models to
achieve a better accuracy and model stability.
Random Forest: Multiple Trees instead of
single tree. It’s a bagging method
To classify a new object based on
attributes, each tree gives a classification
and we say the tree “votes” for that class.
98. SACON
Other Tree based methods
Gradient Boosting is a tree ensemble technique that creates a strong classifier
from a number of weak classifiers.
It works in the technique of weak learners and the additive model.
Boosting is an ensemble technique in which the predictors are not made
independently, but sequentially.
99. SACON
Iris Dataset
Three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
Four features were measured from each sample: the length and the width of
the sepals and petals, in centimeters.
100. SACON
References
• Andrew Ng’s Coursera Course
• Scikit Learn Training example on Google
• Nvidia
• Sebastian Ruder’s blog
• HBR
• MIT Tech Review
• Lots of Others
• AI community in general
• IDLI Community