SlideShare a Scribd company logo
1 of 12
Download to read offline
!
!
!
!
Masters in Computer Science — Machine Learning Concepts
!
Goal : very briefly touch upon some of the important terminologies and fundamental concepts for selecting a machine
learning algorithm.
!
!
!
** concept : function or mapping from objects to membership . A mapping between objects in the world and
membership in a set.
** instance : Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an
income)
** target concept : actual answer thats being searched in the space of multiple concepts.
** hypothesis : helps to predict target concept (actual answer)
*** apply candidate concepts to testing set (should include lots of examples)
*** apply inductive learning to choose a hypothesis from given set of examples 	

We need to ask some relevant questions to choose a a Hypothesis !	

!What’s the Inductive Bias for the Classification Function ?
>> Inductive Bias helps us find a General Rule from example.
>> Generalization is the whole point in Machine Learning
!Whats the Occum’s Razor ?
>> Prefer simplest hypothesis that fits data
!What’s the Restriction Bias ?
>> Consider only those hypothesis which can be represented by chosen algorithm
!Supervised classification => Function Approximation : predicting outcome when we know the different classifications
example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length
!Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different
classifications.
example: splitting all data for sepal width/length into different groups (cluster similar data together)
!Reinforcement Learning => Learning from Delayed Reward.
!Eager & Lazy Learners :
! Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets
! find a function that best fits training data i.e. spend time to learn from data , when new inputs are received the input features are fed
into the function ! here we consider global scale inputs and avoid local sensitivities
!Lazy Learners : lazy learners do not compute a function to fit the training data before new data is received ! so we save
significant time upfront ! new instances are compared to the training data to make a classification / regression
decision !!! ! considers local-scale estimation .
!
!
MLAlgo Preference Bias Learning Function Performance Enhancements Usage
Bayesian
!(Eager Learner)
- Classification
Prior Domain
Knowledge
~ Pr (h) prior prob for
each candidate h
~ Pr(D) – prob dist.
Over observed data for
each h
!Occum’s Razor ?
- select h with min
length
!** at least one
maximally probable
hypothesis
argmaxP(h|D)
-> argmaxP(D|h)
(for uniform prior)
Posterior Prob
P(h|D) = P(D|h).P(h) /
P(D)
!Key assumption : every
hi equally probably a
priori => p(h
!* Noise Free Uniformly
Dist. Hypothesis in V(s)
*
!P(h) = 1 / |H| ,
P(D|h) = { 1 if di = h(x) ,
0 otherwise }
P(h|D) = 1 / |V(s)|
!* Noisy Data*
di = k.x
hmc
= argmax P(D|h)
= argmax π P(di|h)
!* di
ln (h
(di – hi(x))
!* Vmap
P(v|h).P(h|D)
!Cons :
* significant
computational cost
to find Bayes
optimal hypothesis
* sometimes huge
no of hypothesis
need to be
surveyed .
* NB handles
missing data very
well: it just excludes
the attribute with
missing data when
computing posterior
probability (i.e.
probability of class
given data point)
Pros : No need to be aware
of given hypothesis
!— for smaller training set,
NB is good bet !
* Use Bayesian
Learning to
represent
Conditional
Independence of
variables !
* Assumes real-
valued attributes
are normally
distributed. As a
result, NB can only
have linear, elliptic,
or parabolic
decision
boundaries.
* Example:
misclassification ,
pruning , fitting
errors
!* spam
/ | 
Lottery Bank
College
!P(spam | lottery , not
bank , not college) =
p(vi
Algo
Decision
Tree :
!(Eager Learner)
!ID3 , C4.5
! approximate
discrete values
functions
! disjunction of
conjunction of
constraints on attr
values
!Description
Classification
: for discrete input
data
: for cont. input data
(consider Range
selection as condition
- >20% )
Preference Bias
Occum’s Razor ?
: shorter tree
Other Biases :
: prefer attributes with
many possible values
: prefer trees that
places high info gain
attrs close to root (attr
with best answers
NOT best splits)
Learning Function
Info Gain (S,A) =
Entropy(S) – Sum
|S|)*Entropy(S
** wtd sum of entropies
of partitions
* Entropy(s) =
-Sum(P
Performance
Usual problem of
Dtree : for N
variables
combinations of
rows !
(2)2-to-the-power-N
outputs
!** so instead of
iterating on all
rows , first work
upon only the
attributes which
have highest info
gain.
** handles noise ,
handles missing
values
!=============
Scope of
improvement :
!Decision trees,
however, often
achieve lower
generalization
accuracy, compared
to other learning
methods, such as
support vector
machines and neural
networks. One
common way to
improve their
accuracy is boosting
Enhancement
pros : computes best attribute
in one move
!cons :
* does not look ahead or
behind ( this problem is
solved by Hill-Climbing …)
* tends to overfit as it looks
into many diferent
combinations of features
* logistic regression avoids
overfitting more elegantly
!** Overfitting soln for DTree
:
>> stop growing tree before
it grows too large
>> prune after certain
threshold
* consider interdependency
betn attributes P(Y=y | X=x)
* consider GainRatio ,
SplitInfo
Usage
- restaurant
selection decision
based on cost,
menu , appetite,
weather, and other
features.
-
Decision Tree :
Regression
!Classification
: for cont. output data
!Lazy Distance-based
learning func :
For each training sample
sl -> S
Dl = dist(s
sum-sqr(diff)
Wj = dmax – dj
Advantages of
decision trees include:	

● computational
scalability 	

● handling of messy
data missing values,
various feature types

!● ability to deal with
irrelevant features the
algorithm selects 

“relevant” features
first, and generally
ignores irrelevant
features. 	

● If the decision tree
is short, it is easy for a
human to interpret it:	

decision trees do not
produce a black box
model.
Algo
Linear
Regression :
!(Eager Learner)
!Model a linear
relationship between a
dependent variable (y)
and independent
variables (x1,x2..)
!Regression, as a term,
stems from the
observation that
individual instances
of any observed
attribute tend to
regress towards the
mean.
!Description
Classification :
Scalar input , Cont.
output
Vector input, Cont.
outputp
!** Vector Input ->
combinations of
multiple features into
a single feature
Preference Bias
Regress to mean
!Gradient :
* for one variable
derivative is slope of
tangent line
* for several variables,
gradient is the
direction of the fastest
increase of function
Learning Function
y^ =
yi =
minimize the Sum of
Squared Error :
½ Sum (y^-y
!θ1 = θ
θ1 ->next pos
θ0 ->current pos
α is the learning rate so
that function takes small
step towards the
direction opposite to that
of ∇J (direction of
fastest increase)
Performance
!Cons:
Function should be
differentiable
!Caution :
Learning rate must
not be very small or
very large
Enhancement
! Usage
!Housing Price
prediction
Polynomial
Regression
Algo
Multi-Layer
Perceptron
!!(Eager Learner)
!Description
Classification
Preference Bias
Initial weights should
be chosen to be small
and random values:
!— local minima
— variability and low
complexity (larger
weights equate to
larger complexity).
Learning Function
Perceptron is a linear
function that offers a
hyperplane in n
dimensions,
perpendicular to the
vector
w
n
) . The perceptron
classifies things on one
side of the hyperplane as
positive and things on
the other side as
negative.
Perceptron
!Guarantee finite
convergence, however,
only if linearly
separable.
Δwi=η(y−y^)xi
!Gradient Descent
!Calculus-based
More robust to data sets
that are not linearly
separable, however,
converges to local
minima / optima.
Δwi=η(y−a)xi
!
!
Performance
!Neural networks
have low
restriction bias,
because they can
model many
different functions.
Therefore they
have the danger
of overfitting.
!Neural Networks
consist of:
!Perceptron: half-
spaces
Sigmoids (instead
of step functions):
much more
complex
Hidden Layers
(groups of sigmoid
functions)
!So it allows for
modeling many
types of
functions /
behaviors, such as:
!Boolean: network
of threshold-like
units
Continuous:
through hidden
layers (e.g. use of
sigmoids instead
of step)
Arbitrary (non-
continuous):
multiple hidden
layers
Enhancement
!Addition of hidden layers
help map continuous
functions (change in input
changes output very
smoothly)
!Multiply weights only if we
don’t get better errors !
Usage
!One obvious
advantage of
artificial neural
networks - ability
to produce any
number of
outputs, (multi-
class) while support
vector machines
have only one. The
most direct way to
create an n-ary
classifier with
support vector
machines is to
create n support
vector machines
and train each of
them one by one.
On the other hand,
an n-ary classifier
with neural
networks can be
trained in one go.
===========
Multi-layer
perceptron is able
to find relation
between features.
For example it is
necessary in
computer vision
when a raw image
is provided to the
learning algorithm
and now
Sophisticated
features are
calculated.
Essentially the
intermediate levels
can calculate new
unknown features.
Algo
K Nearest
Neighbors -
Classification
!remembers mapping,
fast lookup
!
Preference Bias :
Why consider KNN
over other ?
* near points are
similar to one another
(locality)
* smoothly changing
behavior from one
neighborhood to
another neighborhood.
* so we can choose
best distance function
Learning Function
!Choose best distance
function.
!!Manhattan: ℓ1
d=∣y2−y1∣+∣x2−x1∣
!Euclidean:
d=sqrt( sqr(y2−y1)+sqr(
x2−x1))
Performance :
!Problem : curse of
dimensionality :
!… as the number
of features grow,
the amount of data
required for
accurate
generalization
grows
exponentially .
> O(2-to-power-d)
Reducing weights
will help curb the
effect of
dimensionality.
When k is small,
models have high
bias, fitting on a
strongly local level.
Larger k creates
models with lower
bias but higher
variance.
Cons :
* KNN doesn't
know which
attributes are
more important
* Doesn't handle
missing data
gracefully
!
!
Enhancements :
!generalization - NO
overfitting - YES
!///
!
Usage
!No assumption
about data
distribution (Great
Advantage over
NB)
Its highly non-
parametric
Algo
K Nearest Neighbors -
Regression.
!LWR (locally
weighted
regression)
Learning Function
!It combines the
traditional regression
with instance based
learning’s sensitivity to
training items with high
similarity to the test
point
Performance :
!-- reduce the pull
effect of far-away
points through
Kernels
-- the squared
deviations are
weighted by a kernel
function that
decreases with
distance, such that
for a new test
instance, a
regression function is
found for that specific
point that
emphasizes fitting
closeby points and
ignoring the pull of
faraway points…
Preference Bias :
!- Individual rule
(result of learning over
a subset of data) does
not provide answer but
when combined , the
complex rule works
well .
!Choose those
examples - where it
offers better
performance on testing
subsets of data than
fitting a 4th order
polynomial
Learning Function
PrD
!** boost up the
distribution ….
! h1 h2 h3
x1 +1 -1 +1
x2 -1 -1 +1
x3 +1 -1 +1
!** find hypothesis at
each time-step H
small error , (Weak
Classifier) constantly
creating new
distributions …
(Boosting)
!
** Final Hypothesis :
sgn (sign) function of
the weighted sum of
all of the rules.	

Performance :
!Why Boosting does
so well ?
!>> if there are some
samples which do
not provide good
result, then boosting
can re-rate the
samples so that some
of ‘past under-
performers’ become
more important.
>>
!Use Grad Boost to
handle noisy data in
DTree :
https://
en.wikipedia.org/
wiki/
Gradient_boosting
!>> Boosting does
overfit if Weak
Learners uses NN
with many layers of
nodes
!Choosing Subsets:
!Instead of selecting
subsets randomly,
we can pick subsets
containing hardest
examples—those
examples that don’t
perform well given
current rule.
!Combine:
!Instead of a mean,
consider a weighted
mean.
Enhancements:
● Computationally efficient.
● No difficult parameters to
set.
● Versatile a wide range of
base learners can be used
with
!AdaBoost.Caveats:
● Algorithm seems
susceptible to uniform noise.
● Weak learner should not
be too complex to avoid
overfitting.
● There needs to be enough
data so that the weak
learning 

requirement is satisfied the
base learner should perform
consistently better than
random guessing, with
generalization error < 0.5 for
binary classification
problems.
usage
body: contains word
manly → YES
from: your spouse →
NO
body short length →
YES
body: only contains
urls → YES
body: just an image
→ YES
body: contains words
belonging to blacklist
(misspellings) → YES
!All of these rules are
useful, however, no
specific one can
determine spam (or
not) on its own. We
need to find a way to
combine them.
!!find which Wiki pages
can recommended for
extended period of
time (feature set a
combination of binary
, text , nemerics)
!Ref : http://
statweb.stanford.edu/
~tibs/ElemStatLearn/
!http://media.nips.cc/
Conferences/2007/
Tutorials/Slides/
schapire-NIPS-07-
tutorial.pdf
!************
If you have dense
feature set, go with
boosting.
Algo
!
Ensemble
Learning
!!!!!
*
!!
*
!Solves Classification
Problem.
!*************
Boosting is a meta-
learning technique,
i.e. something you put
on top of a set of
learners to form an
ensemble


!!!
Notes on Ensemble Learning (Boosting)	

!Important difference of Ensemble Learners from other types of Learners : 	

-- NN already knows the Network and tries to learn the weights	

-- DTree gradually builds the rules 	

!But, Ensemble Learner ! finds the best combination of rules .	

!1. Initialize the importance weights w
i
= 1/N for all training examples i. 2. For m = 1 to M:
a) Fit a classifier G
m
(x) to the training data using the weights w
i
.
b) Compute the error: err
m
=

∑ w
i
I(y
i
=/ G
m
(x
i
)) / ∑ w
i
c) Compute α
m
= log((1 − err
m
)/err
m
)
d) Update weights: w
i
← w
i
. exp[α
m
. I(y
i
=/ G
m
(x
i
))] for i = 1, 2, ... N
3. Return G(x) = sign[ ∑ α
m
G
m
(x)].
We can see that for error < 0.5, the α
m
parameter is positive
Preference Bias :
!Support : the goal
with the support
vector machine is to
maximize the margin,
m, subject to the
constraint that we
classify everything
correctly. Together,
this can be defined
mathematically as:
!max(m):yi(wTXi
+b)≥1∀i
Learning Function
!Find the line of least
commitment in the
linear separable set of
data, is the basis
behind support vector
machines
>> a line that leaves
as much space as
possible from the
boundaries.	

y = (w
where: y is the
classification label and
y∈{−1,+1} with	

{in classout of classfor
y>0for y<0	

wT and b are the
parameters of the plane	

Performance :
>> : similar to KNN
, but here instead of
being completely
lazy , spend upfront
efforts to do
complicated
quadratic programs
! to consider
required points .
!
>> For
classification tasks
involving more than
two groups, a
common strategy is
to use multiple
binary classifiers to
decide on a single-
best class for new
instances	

Enhancements:
!y = w phi(x) +b
— use Kernel when feature
vector phi is of higher
dimension.
!Many machine learning
algorithms can be written to
only use dot products, and
then we can replace the dot
products with kernels
usage
!Mostly binary
classification (linear
and non-linear)	

1) If you have sparse
feature set, go with
linear svm (or other
linear model)	

!2) If you don't care
about speed and
memory, try kernel
svm.	

!*************	

In order to eliminated
expensive parameter
tuning and better
handle high-
dimensional input
space —> we can use
Kernelized SVM for
text classification
(tens of thousands of
support vectors, each
having hundreds of
thousands of features)
Algo
SVM
!The classifier is greater
than or equal to 1 for the
positive examples and
less than or equal to -1
for the negative
examples ….….
…… difference
between the vector x
and the vector x
projected	

!
*
!Classification
!!
Notes on Support Vector Machines - SVM 	

   	

Here instead of Polynomial Regression we consider Polynomial Kernel !  kernel represents domain knowledge 	

= projecting into some higher dimensional space.
!For data that is separable, but not linearly, we can use a kernel function to capture a nonlinear dividing curve. The kernel function should capture
some aspect of similarity in our data.
Ref : https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM
Simple Example of Kernel : x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the
kernel is K(x, y ) = (x, y)^2.
Let's plug in some numbers to make this more intuitive:
suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36)
f(x), f(y) = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
A lot of algebra. as f is a mapping from 3-dimensional to 9 dimensional space.
Now let us use the kernel instead:
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024 . Same result, but this calculation is so much easier.
!
!
!!!!!!!!!
Types of Errors	

!
In sample error = error resulted from applying the prediction algorithm to the training dataset	

!
Out of sample error = error resulted from applying the prediction algorithm to a new test data set	

!
In sample error  Out of sample error = model is overfitting i.e. model is too optimized for the initial
dataset	

!
Regression Errors:	

!
Bias-Variance Estimates
!Its very important to calculate ‘Bias Errors’ and ‘Variance Errors’ while comparing various algorithms.
!Error due to Bias = when a prediction model is built multiple times then Bias Error is the difference between ‘Expected Prediction value’ and Correct value.
Bias provides a deviation of prediction ranges from real values .
Example of low bias == tendency of mean of all the sample points to converge towards mean of real values
!
*
!Error due to Variance = how much the predictions for a given point vary between different implementations of the model.	

Example of high variability == sample points tend to be dispersed away from each other.	

!Reference : http://scott.fortmann-roe.com/docs/BiasVariance.html	

!
!
!
so often it is better to give up a little accuracy for more robustness when predicting on new data.	

!
Classification Errors: 	

!
Positive = identified and Negative = rejected	

True positive = correctly identified (predicted true when true)	

False positive = incorrectly identified (predicted true when false)	

True negative = correctly rejected (predicted false when false)	

False negative = incorrectly rejected (predicted false when true)	

!
example: medical testing	

!
True positive = Sick people correctly diagnosed as sick	

False positive = Healthy people incorrectly identified as sick	

True negative = Healthy people correctly identified as healthy	

False negative = Sick people incorrectly identified as healthy	

!
!
!
!
k= accuracy−P(e) / 1−P(e)	

!
P(e)=(TP+FP / total) × (TP+FN / total) + (TN+FN / total) × (FP+TN/ total)	

!
!
Receiver Operating Characteristic curves :	

!
x-axis = 1 - specificity (or, probability of false positive)	

y-axis = sensitivity (or, probability of true positive)	

areas under curve = quantifies whether the prediction model is viable or not 	

i.e. higher area →→ better predictor	

area = 0.5 →→ effectively random guessing (diagonal line in 	

the ROC curve)	

area = 1 →→ perfect classifier	

area = 0.8 →→ considered good for a prediction algorithm	

!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
References :
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-
notes/
http://www.stat.cmu.edu/~cshalizi/350/
http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why
https://www.udacity.com/course/machine-learning--ud262
https://www.coursera.org/learn/machine-learning
http://sux13.github.io/DataScienceSpCourseNotes/8_PREDMACHLEARN/
Practical_Machine_Learning_Course_Notes.html

More Related Content

What's hot

Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
butest
 
2021 icse reducedsylabiix-computer applications
2021 icse reducedsylabiix-computer applications2021 icse reducedsylabiix-computer applications
2021 icse reducedsylabiix-computer applications
Vahabshaik Shai
 

What's hot (20)

Java tutorial part 3
Java tutorial part 3Java tutorial part 3
Java tutorial part 3
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Object oriented programming concept
Object oriented programming conceptObject oriented programming concept
Object oriented programming concept
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 
Datatype
DatatypeDatatype
Datatype
 
Think Different: Objective-C for the .NET developer
Think Different: Objective-C for the .NET developerThink Different: Objective-C for the .NET developer
Think Different: Objective-C for the .NET developer
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Java Data Types and Variables
Java Data Types and VariablesJava Data Types and Variables
Java Data Types and Variables
 
Java
JavaJava
Java
 
M C6java2
M C6java2M C6java2
M C6java2
 
Function Approx2009
Function Approx2009Function Approx2009
Function Approx2009
 
M C6java3
M C6java3M C6java3
M C6java3
 
Applets
AppletsApplets
Applets
 
Object oriented programming concept- Saurabh Upadhyay
Object oriented programming concept- Saurabh UpadhyayObject oriented programming concept- Saurabh Upadhyay
Object oriented programming concept- Saurabh Upadhyay
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
2021 icse reducedsylabiix-computer applications
2021 icse reducedsylabiix-computer applications2021 icse reducedsylabiix-computer applications
2021 icse reducedsylabiix-computer applications
 

Viewers also liked

Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
Kaniska Mandal
 
Profile Brief of Omar Faruk
Profile Brief of Omar FarukProfile Brief of Omar Faruk
Profile Brief of Omar Faruk
Omar_mech
 
Comparison between Systems
Comparison between SystemsComparison between Systems
Comparison between Systems
Victor Ashkenazy
 
To find the foaming capacity of various soap and action of Ca & Mg salt on ...
To find the foaming capacity of various soap and action of  Ca & Mg  salt on ...To find the foaming capacity of various soap and action of  Ca & Mg  salt on ...
To find the foaming capacity of various soap and action of Ca & Mg salt on ...
bimalbhakta
 

Viewers also liked (16)

Party invite
Party inviteParty invite
Party invite
 
Titanium
TitaniumTitanium
Titanium
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
 
Construcción de los números. Estrategía de aprendizaje 3
Construcción de los números. Estrategía de aprendizaje 3Construcción de los números. Estrategía de aprendizaje 3
Construcción de los números. Estrategía de aprendizaje 3
 
KANBAN
KANBANKANBAN
KANBAN
 
SFPUG 2015.11.20 lightning talk "PostgreSQL in Russia"
SFPUG 2015.11.20 lightning talk "PostgreSQL in Russia"SFPUG 2015.11.20 lightning talk "PostgreSQL in Russia"
SFPUG 2015.11.20 lightning talk "PostgreSQL in Russia"
 
загадки про місяці
загадки про місяцізагадки про місяці
загадки про місяці
 
Cửa cuốn công nghệ úc
Cửa cuốn công nghệ úcCửa cuốn công nghệ úc
Cửa cuốn công nghệ úc
 
Party invite
Party inviteParty invite
Party invite
 
Work and travel Vašar with Student Adventures
Work and travel Vašar with Student AdventuresWork and travel Vašar with Student Adventures
Work and travel Vašar with Student Adventures
 
attachments
attachmentsattachments
attachments
 
Profile Brief of Omar Faruk
Profile Brief of Omar FarukProfile Brief of Omar Faruk
Profile Brief of Omar Faruk
 
Comparison between Systems
Comparison between SystemsComparison between Systems
Comparison between Systems
 
It Consulting 1 Opis Dzialanosci Eksperta Analityka
It Consulting 1 Opis Dzialanosci Eksperta AnalitykaIt Consulting 1 Opis Dzialanosci Eksperta Analityka
It Consulting 1 Opis Dzialanosci Eksperta Analityka
 
Hydrographic surveys
Hydrographic surveysHydrographic surveys
Hydrographic surveys
 
To find the foaming capacity of various soap and action of Ca & Mg salt on ...
To find the foaming capacity of various soap and action of  Ca & Mg  salt on ...To find the foaming capacity of various soap and action of  Ca & Mg  salt on ...
To find the foaming capacity of various soap and action of Ca & Mg salt on ...
 

Similar to Machine Learning Comparative Analysis - Part 1

Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
 

Similar to Machine Learning Comparative Analysis - Part 1 (20)

MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
Computational decision making
Computational decision makingComputational decision making
Computational decision making
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
 

More from Kaniska Mandal

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and http
Kaniska Mandal
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
Kaniska Mandal
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of Modelling
Kaniska Mandal
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.Odt
Kaniska Mandal
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class Loader
Kaniska Mandal
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In Eclipse
Kaniska Mandal
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super Force
Kaniska Mandal
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD Framework
Kaniska Mandal
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using Dltk
Kaniska Mandal
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate Notes
Kaniska Mandal
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml Classes
Kaniska Mandal
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
Kaniska Mandal
 

More from Kaniska Mandal (20)

Machine learning advanced applications
Machine learning advanced applicationsMachine learning advanced applications
Machine learning advanced applications
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and http
 
Designing Better API
Designing Better APIDesigning Better API
Designing Better API
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of Modelling
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.Odt
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class Loader
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In Eclipse
 
Eclipse Tricks
Eclipse TricksEclipse Tricks
Eclipse Tricks
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super Force
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD Framework
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using Dltk
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate Notes
 
Best Of Jdk 7
Best Of Jdk 7Best Of Jdk 7
Best Of Jdk 7
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml Classes
 
EMF Tips n Tricks
EMF Tips n TricksEMF Tips n Tricks
EMF Tips n Tricks
 
Graphical Model Transformation Framework
Graphical Model Transformation FrameworkGraphical Model Transformation Framework
Graphical Model Transformation Framework
 
Mashup Magic
Mashup MagicMashup Magic
Mashup Magic
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
 

Recently uploaded

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Machine Learning Comparative Analysis - Part 1

  • 1. ! ! ! ! Masters in Computer Science — Machine Learning Concepts ! Goal : very briefly touch upon some of the important terminologies and fundamental concepts for selecting a machine learning algorithm. ! ! ! ** concept : function or mapping from objects to membership . A mapping between objects in the world and membership in a set. ** instance : Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an income) ** target concept : actual answer thats being searched in the space of multiple concepts. ** hypothesis : helps to predict target concept (actual answer) *** apply candidate concepts to testing set (should include lots of examples) *** apply inductive learning to choose a hypothesis from given set of examples We need to ask some relevant questions to choose a a Hypothesis ! !What’s the Inductive Bias for the Classification Function ? >> Inductive Bias helps us find a General Rule from example. >> Generalization is the whole point in Machine Learning !Whats the Occum’s Razor ? >> Prefer simplest hypothesis that fits data !What’s the Restriction Bias ? >> Consider only those hypothesis which can be represented by chosen algorithm !Supervised classification => Function Approximation : predicting outcome when we know the different classifications example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length !Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different classifications. example: splitting all data for sepal width/length into different groups (cluster similar data together) !Reinforcement Learning => Learning from Delayed Reward. !Eager & Lazy Learners : ! Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets ! find a function that best fits training data i.e. spend time to learn from data , when new inputs are received the input features are fed into the function ! here we consider global scale inputs and avoid local sensitivities !Lazy Learners : lazy learners do not compute a function to fit the training data before new data is received ! so we save significant time upfront ! new instances are compared to the training data to make a classification / regression decision !!! ! considers local-scale estimation . !
  • 2. ! MLAlgo Preference Bias Learning Function Performance Enhancements Usage Bayesian !(Eager Learner) - Classification Prior Domain Knowledge ~ Pr (h) prior prob for each candidate h ~ Pr(D) – prob dist. Over observed data for each h !Occum’s Razor ? - select h with min length !** at least one maximally probable hypothesis argmaxP(h|D) -> argmaxP(D|h) (for uniform prior) Posterior Prob P(h|D) = P(D|h).P(h) / P(D) !Key assumption : every hi equally probably a priori => p(h !* Noise Free Uniformly Dist. Hypothesis in V(s) * !P(h) = 1 / |H| , P(D|h) = { 1 if di = h(x) , 0 otherwise } P(h|D) = 1 / |V(s)| !* Noisy Data* di = k.x hmc = argmax P(D|h) = argmax π P(di|h) !* di ln (h (di – hi(x)) !* Vmap P(v|h).P(h|D) !Cons : * significant computational cost to find Bayes optimal hypothesis * sometimes huge no of hypothesis need to be surveyed . * NB handles missing data very well: it just excludes the attribute with missing data when computing posterior probability (i.e. probability of class given data point) Pros : No need to be aware of given hypothesis !— for smaller training set, NB is good bet ! * Use Bayesian Learning to represent Conditional Independence of variables ! * Assumes real- valued attributes are normally distributed. As a result, NB can only have linear, elliptic, or parabolic decision boundaries. * Example: misclassification , pruning , fitting errors !* spam / | Lottery Bank College !P(spam | lottery , not bank , not college) = p(vi
  • 3. Algo Decision Tree : !(Eager Learner) !ID3 , C4.5 ! approximate discrete values functions ! disjunction of conjunction of constraints on attr values !Description Classification : for discrete input data : for cont. input data (consider Range selection as condition - >20% ) Preference Bias Occum’s Razor ? : shorter tree Other Biases : : prefer attributes with many possible values : prefer trees that places high info gain attrs close to root (attr with best answers NOT best splits) Learning Function Info Gain (S,A) = Entropy(S) – Sum |S|)*Entropy(S ** wtd sum of entropies of partitions * Entropy(s) = -Sum(P Performance Usual problem of Dtree : for N variables combinations of rows ! (2)2-to-the-power-N outputs !** so instead of iterating on all rows , first work upon only the attributes which have highest info gain. ** handles noise , handles missing values !============= Scope of improvement : !Decision trees, however, often achieve lower generalization accuracy, compared to other learning methods, such as support vector machines and neural networks. One common way to improve their accuracy is boosting Enhancement pros : computes best attribute in one move !cons : * does not look ahead or behind ( this problem is solved by Hill-Climbing …) * tends to overfit as it looks into many diferent combinations of features * logistic regression avoids overfitting more elegantly !** Overfitting soln for DTree : >> stop growing tree before it grows too large >> prune after certain threshold * consider interdependency betn attributes P(Y=y | X=x) * consider GainRatio , SplitInfo Usage - restaurant selection decision based on cost, menu , appetite, weather, and other features. - Decision Tree : Regression !Classification : for cont. output data !Lazy Distance-based learning func : For each training sample sl -> S Dl = dist(s sum-sqr(diff) Wj = dmax – dj Advantages of decision trees include: ● computational scalability ● handling of messy data missing values, various feature types
 !● ability to deal with irrelevant features the algorithm selects 
 “relevant” features first, and generally ignores irrelevant features. ● If the decision tree is short, it is easy for a human to interpret it: decision trees do not produce a black box model.
  • 4. Algo Linear Regression : !(Eager Learner) !Model a linear relationship between a dependent variable (y) and independent variables (x1,x2..) !Regression, as a term, stems from the observation that individual instances of any observed attribute tend to regress towards the mean. !Description Classification : Scalar input , Cont. output Vector input, Cont. outputp !** Vector Input -> combinations of multiple features into a single feature Preference Bias Regress to mean !Gradient : * for one variable derivative is slope of tangent line * for several variables, gradient is the direction of the fastest increase of function Learning Function y^ = yi = minimize the Sum of Squared Error : ½ Sum (y^-y !θ1 = θ θ1 ->next pos θ0 ->current pos α is the learning rate so that function takes small step towards the direction opposite to that of ∇J (direction of fastest increase) Performance !Cons: Function should be differentiable !Caution : Learning rate must not be very small or very large Enhancement ! Usage !Housing Price prediction Polynomial Regression
  • 5. Algo Multi-Layer Perceptron !!(Eager Learner) !Description Classification Preference Bias Initial weights should be chosen to be small and random values: !— local minima — variability and low complexity (larger weights equate to larger complexity). Learning Function Perceptron is a linear function that offers a hyperplane in n dimensions, perpendicular to the vector w n ) . The perceptron classifies things on one side of the hyperplane as positive and things on the other side as negative. Perceptron !Guarantee finite convergence, however, only if linearly separable. Δwi=η(y−y^)xi !Gradient Descent !Calculus-based More robust to data sets that are not linearly separable, however, converges to local minima / optima. Δwi=η(y−a)xi ! ! Performance !Neural networks have low restriction bias, because they can model many different functions. Therefore they have the danger of overfitting. !Neural Networks consist of: !Perceptron: half- spaces Sigmoids (instead of step functions): much more complex Hidden Layers (groups of sigmoid functions) !So it allows for modeling many types of functions / behaviors, such as: !Boolean: network of threshold-like units Continuous: through hidden layers (e.g. use of sigmoids instead of step) Arbitrary (non- continuous): multiple hidden layers Enhancement !Addition of hidden layers help map continuous functions (change in input changes output very smoothly) !Multiply weights only if we don’t get better errors ! Usage !One obvious advantage of artificial neural networks - ability to produce any number of outputs, (multi- class) while support vector machines have only one. The most direct way to create an n-ary classifier with support vector machines is to create n support vector machines and train each of them one by one. On the other hand, an n-ary classifier with neural networks can be trained in one go. =========== Multi-layer perceptron is able to find relation between features. For example it is necessary in computer vision when a raw image is provided to the learning algorithm and now Sophisticated features are calculated. Essentially the intermediate levels can calculate new unknown features.
  • 6. Algo K Nearest Neighbors - Classification !remembers mapping, fast lookup ! Preference Bias : Why consider KNN over other ? * near points are similar to one another (locality) * smoothly changing behavior from one neighborhood to another neighborhood. * so we can choose best distance function Learning Function !Choose best distance function. !!Manhattan: ℓ1 d=∣y2−y1∣+∣x2−x1∣ !Euclidean: d=sqrt( sqr(y2−y1)+sqr( x2−x1)) Performance : !Problem : curse of dimensionality : !… as the number of features grow, the amount of data required for accurate generalization grows exponentially . > O(2-to-power-d) Reducing weights will help curb the effect of dimensionality. When k is small, models have high bias, fitting on a strongly local level. Larger k creates models with lower bias but higher variance. Cons : * KNN doesn't know which attributes are more important * Doesn't handle missing data gracefully ! ! Enhancements : !generalization - NO overfitting - YES !/// ! Usage !No assumption about data distribution (Great Advantage over NB) Its highly non- parametric
  • 7. Algo K Nearest Neighbors - Regression. !LWR (locally weighted regression) Learning Function !It combines the traditional regression with instance based learning’s sensitivity to training items with high similarity to the test point Performance : !-- reduce the pull effect of far-away points through Kernels -- the squared deviations are weighted by a kernel function that decreases with distance, such that for a new test instance, a regression function is found for that specific point that emphasizes fitting closeby points and ignoring the pull of faraway points…
  • 8. Preference Bias : !- Individual rule (result of learning over a subset of data) does not provide answer but when combined , the complex rule works well . !Choose those examples - where it offers better performance on testing subsets of data than fitting a 4th order polynomial Learning Function PrD !** boost up the distribution …. ! h1 h2 h3 x1 +1 -1 +1 x2 -1 -1 +1 x3 +1 -1 +1 !** find hypothesis at each time-step H small error , (Weak Classifier) constantly creating new distributions … (Boosting) ! ** Final Hypothesis : sgn (sign) function of the weighted sum of all of the rules. Performance : !Why Boosting does so well ? !>> if there are some samples which do not provide good result, then boosting can re-rate the samples so that some of ‘past under- performers’ become more important. >> !Use Grad Boost to handle noisy data in DTree : https:// en.wikipedia.org/ wiki/ Gradient_boosting !>> Boosting does overfit if Weak Learners uses NN with many layers of nodes !Choosing Subsets: !Instead of selecting subsets randomly, we can pick subsets containing hardest examples—those examples that don’t perform well given current rule. !Combine: !Instead of a mean, consider a weighted mean. Enhancements: ● Computationally efficient. ● No difficult parameters to set. ● Versatile a wide range of base learners can be used with !AdaBoost.Caveats: ● Algorithm seems susceptible to uniform noise. ● Weak learner should not be too complex to avoid overfitting. ● There needs to be enough data so that the weak learning 
 requirement is satisfied the base learner should perform consistently better than random guessing, with generalization error < 0.5 for binary classification problems. usage body: contains word manly → YES from: your spouse → NO body short length → YES body: only contains urls → YES body: just an image → YES body: contains words belonging to blacklist (misspellings) → YES !All of these rules are useful, however, no specific one can determine spam (or not) on its own. We need to find a way to combine them. !!find which Wiki pages can recommended for extended period of time (feature set a combination of binary , text , nemerics) !Ref : http:// statweb.stanford.edu/ ~tibs/ElemStatLearn/ !http://media.nips.cc/ Conferences/2007/ Tutorials/Slides/ schapire-NIPS-07- tutorial.pdf !************ If you have dense feature set, go with boosting. Algo ! Ensemble Learning !!!!! * !! * !Solves Classification Problem. !************* Boosting is a meta- learning technique, i.e. something you put on top of a set of learners to form an ensemble
  • 9. 
 !!! Notes on Ensemble Learning (Boosting) !Important difference of Ensemble Learners from other types of Learners : -- NN already knows the Network and tries to learn the weights -- DTree gradually builds the rules !But, Ensemble Learner ! finds the best combination of rules . !1. Initialize the importance weights w i = 1/N for all training examples i. 2. For m = 1 to M: a) Fit a classifier G m (x) to the training data using the weights w i . b) Compute the error: err m =
 ∑ w i I(y i =/ G m (x i )) / ∑ w i c) Compute α m = log((1 − err m )/err m ) d) Update weights: w i ← w i . exp[α m . I(y i =/ G m (x i ))] for i = 1, 2, ... N 3. Return G(x) = sign[ ∑ α m G m (x)]. We can see that for error < 0.5, the α m parameter is positive Preference Bias : !Support : the goal with the support vector machine is to maximize the margin, m, subject to the constraint that we classify everything correctly. Together, this can be defined mathematically as: !max(m):yi(wTXi +b)≥1∀i Learning Function !Find the line of least commitment in the linear separable set of data, is the basis behind support vector machines >> a line that leaves as much space as possible from the boundaries. y = (w where: y is the classification label and y∈{−1,+1} with {in classout of classfor y>0for y<0 wT and b are the parameters of the plane Performance : >> : similar to KNN , but here instead of being completely lazy , spend upfront efforts to do complicated quadratic programs ! to consider required points . ! >> For classification tasks involving more than two groups, a common strategy is to use multiple binary classifiers to decide on a single- best class for new instances Enhancements: !y = w phi(x) +b — use Kernel when feature vector phi is of higher dimension. !Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels usage !Mostly binary classification (linear and non-linear) 1) If you have sparse feature set, go with linear svm (or other linear model) !2) If you don't care about speed and memory, try kernel svm. !************* In order to eliminated expensive parameter tuning and better handle high- dimensional input space —> we can use Kernelized SVM for text classification (tens of thousands of support vectors, each having hundreds of thousands of features) Algo SVM !The classifier is greater than or equal to 1 for the positive examples and less than or equal to -1 for the negative examples ….…. …… difference between the vector x and the vector x projected ! * !Classification !!
  • 10. Notes on Support Vector Machines - SVM Here instead of Polynomial Regression we consider Polynomial Kernel ! kernel represents domain knowledge = projecting into some higher dimensional space. !For data that is separable, but not linearly, we can use a kernel function to capture a nonlinear dividing curve. The kernel function should capture some aspect of similarity in our data. Ref : https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM Simple Example of Kernel : x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (x, y)^2. Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then: f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36) f(x), f(y) = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024 A lot of algebra. as f is a mapping from 3-dimensional to 9 dimensional space. Now let us use the kernel instead: K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024 . Same result, but this calculation is so much easier. ! ! !!!!!!!!!
  • 11. Types of Errors ! In sample error = error resulted from applying the prediction algorithm to the training dataset ! Out of sample error = error resulted from applying the prediction algorithm to a new test data set ! In sample error Out of sample error = model is overfitting i.e. model is too optimized for the initial dataset ! Regression Errors: ! Bias-Variance Estimates !Its very important to calculate ‘Bias Errors’ and ‘Variance Errors’ while comparing various algorithms. !Error due to Bias = when a prediction model is built multiple times then Bias Error is the difference between ‘Expected Prediction value’ and Correct value. Bias provides a deviation of prediction ranges from real values . Example of low bias == tendency of mean of all the sample points to converge towards mean of real values ! * !Error due to Variance = how much the predictions for a given point vary between different implementations of the model. Example of high variability == sample points tend to be dispersed away from each other. !Reference : http://scott.fortmann-roe.com/docs/BiasVariance.html ! ! ! so often it is better to give up a little accuracy for more robustness when predicting on new data. ! Classification Errors: ! Positive = identified and Negative = rejected True positive = correctly identified (predicted true when true) False positive = incorrectly identified (predicted true when false) True negative = correctly rejected (predicted false when false) False negative = incorrectly rejected (predicted false when true) ! example: medical testing ! True positive = Sick people correctly diagnosed as sick False positive = Healthy people incorrectly identified as sick True negative = Healthy people correctly identified as healthy False negative = Sick people incorrectly identified as healthy !
  • 12. ! ! ! k= accuracy−P(e) / 1−P(e) ! P(e)=(TP+FP / total) × (TP+FN / total) + (TN+FN / total) × (FP+TN/ total) ! ! Receiver Operating Characteristic curves : ! x-axis = 1 - specificity (or, probability of false positive) y-axis = sensitivity (or, probability of true positive) areas under curve = quantifies whether the prediction model is viable or not i.e. higher area →→ better predictor area = 0.5 →→ effectively random guessing (diagonal line in the ROC curve) area = 1 →→ perfect classifier area = 0.8 →→ considered good for a prediction algorithm ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! References : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture- notes/ http://www.stat.cmu.edu/~cshalizi/350/ http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why https://www.udacity.com/course/machine-learning--ud262 https://www.coursera.org/learn/machine-learning http://sux13.github.io/DataScienceSpCourseNotes/8_PREDMACHLEARN/ Practical_Machine_Learning_Course_Notes.html