ANALYTICS BOOT CAMP COURSE
Analytics Boot Camp Course
Aditya Joshi
Kumar Rishabh
Srinivas Nv Gannavarapu
the science of examining raw data
with the purpose of drawing
conclusions about that information
- Techtarget
Data analytics refers to qualitative and
quantitative techniques and processes used
to enhance productivity and business gain.
Data is extracted and categorized to identify
and analyze behavioral data and patterns,
and techniques vary according to
organizational requirements. - Techopedia
Brainstorm
Where else is Data Analytics used?
Workflow
• Planning, organizing and requirement gathering
• Gathering Data
• Data Cleaning
• Analyzing Data, Predictive Modelling and Result Generation
• Result Presentation
Learning Process
Obtain Data Extract Features Training Model
Learning Process
New Data Features Model Result
Evaluation
How is data obtained?
Train-Test data
Evaluation methodologies
Precision, Recall, F-Measure
Other methods of evaluation, Confusion Matrix
Data Attributes Types
Nominal Categorical Ordinal Continuous
Tools We Use
R Programming Language
Best suited for data-
oriented problems
Strong package
ecosystem
Graphics and
charting
Simple learning
curve
Memory
management, speed,
and efficiency
Not a complete
programming
language
Not for advanced
programmers
Pros
Cons
Matlab
Contains a lot of
advance toolboxes
Good
documentation
Good customer
support
Expensive
Not as much open
source code available
because matlab
requires a license
Cannot integrated
your code into a
Webservice
Pros
Cons
Python
Great data analysis
libraries (pandas,
statsmodels)
Code can be easily
integrated into a
web service
Simple and easy to
begin
A lot of cutting edge,
advanced academic
research is still being
done in R/Matlab
Advanced features of
the language might
offer a steep learning
curve for newcomers
Pros
Cons
SAS
Easy to learn
dedicated customer
service along with
the community
Simple learning
curve
Expensive enterprise
tool
Limited Functionality
Pros
Cons
Julia
High performance,
efficient
good to write a
computationally
intensive program
that uses multiple
CPUs
Decent visualization
capabilities
Relatively new;
growing community
not syntactically
optimized for
statistical operations
on data arrays
Pros
Cons
Brainstorm
What’s the Best Language?
The Big Blue
Cloud Computing, Bluemix
and Analytics
SAAS – PAAS – IAAS
IBM Bluemix
Platform as a Service
Zero Infrastructure, Lower Risk
Lower cost and improved profitability
Easy and quick development, Monetize quickly
Reusable code and business logics
Integration with other web services
Google App Engine
Heroku
IBM Bluemix
Openshift
Cloud Foundry
…
PaaS Offerings
https://www.zoho.com/creator/paas.html
Bluemix Offerings
Storage Analytics Watson
Mobile IOT Containers
And much
more
IBM Bluemix Data & Analytics
Data
Storage
• Cloudant NOSQL DB
• Redis
• IBM DashDB
Graph
Processing
• IBM Graph
Number
Crunching
• IBM Analytics for Apache Spark
Why Bluemix?
Cloud Service
Ready to Use Platform
Because it’s IBM
Open Source Tools
DashDB
Data Warehousing and Analytics
• Relational Data
• Special Data types, eg. geospatial data
Data Analysis
• SQL Access
• Advanced Built-in Analytics
• R Studio
Performance
• IBM’s BLU Acceleration
• processes terabytes of data extremely quickly
DashDB
Database Data Warehouse
Collection of data organized for storage,
accessibility, and retrieval.
Integrated copies of transaction data from
multiple sources optimized for analytical
use.
OLTP OLTP + OLAP
Very quick transaction processing for a
single application
Data from multiple applications used to
guide analysis and decision-making
Optimized for performing read-write
operations of single point transactions.
Optimized for efficiently reading/retrieving
large data sets and for aggregating data.
Analysis is hard because of slow multiple
joins. Queries become very complex.
Analysis queries are easy to form and
execute faster.
Let’s
Do It
Getting Started
Setup and Basics
Basics of R Programming
Learning further with swirl
Need a Break?
Overview of Machine Learning
What is Machine Learning?
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell
Supervised Learning - Regression
Linear Regression
modeling the relationship between a scalar dependent variable and one or more
explanatory variables (or independent variables)
If we have only one independent variable, the model is called as simple linear
regression, otherwise, multiple linear regression
Grade Point Average / Average Marks
Salary in the
First Job
Independent Variable
Dependent Variable
Linear Regression
Goal: Find the line such that distance from line to each point is
minimized.
We will “fit” the points with a line, so that an “objective function” is
minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
S (predictedi – actuali)2 =S (residuei)2
Logistic Regression
A regression model where the dependent variable (DV) is categorical.
Logistic regression is technically a classification technique – do not get confused by
the word “Regression”
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
?
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Linear Regression – Animation
Logistic Regression
Goal: Find the parameters to fit
We will “fit” the points with a line, so that an “objective function” is
minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).
The logistic function
• Where Y-hat is the estimated probability that the
ith case is in a category and u is the regular linear
regression equation:
1
u
i u
e
Y
e


1 1 2 2 K Ku A B X B X B X    
Supervised Learning - Classification
Nearest Neighbor Approaches
Find k closest training examples, and poll their class values
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
K Nearest Neighbors (k-NN)
k-NN is a type of instance-based learning, or lazy learning, where the
function is only approximated locally and all computation is deferred
until classification.
One of the simplest machine learning algorithms.
K Nearest Neighbors (k-NN)
Require no training
Easy to understand
Easy for active learning
processes
Doesn’t scale up very
well for large training set.
(require special
implementations like KD
Trees)
‘K’ is hard to determine
Require full dataset in
memory
Pros
Cons
Decision Trees
Find a model for class attribute as a function of the values of other attributes.
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Decision Trees
Goal: Build a tree; At each node, split the data on the basis of one
attribute which provides the maximum split
Dt
?
> If Dt contains records that belong the same class yt,
then t is a leaf node labeled as yt
> If Dt is an empty set, then t is a leaf node labeled by
the default class, yd
> If Dt contains records that belong to more than one
class, use an attribute test to split the data into smaller
subsets. Recursively apply the procedure to each
subset.
Decision Trees
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Decision Trees
When a node p is split into k partitions (children), the quality of split is computed as,
where, ni = number of records at child i, n = number of records at node p.
An alternate method is using Information Gain and Entropy.


k
i
i
split iGINI
n
n
GINI
1
)(
Decision Trees – Travel Time to Office
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Short
Medium Medium Long
No Yes No Yes
Decision Trees – Travel Time to Office
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Short
Medium Medium Long
No Yes No Yes
if hour == 8am
commute time = Short
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = medium
Decision Trees
Inexpensive to construct
Extremely fast at
classifying unknown
records
Easy to interpret for
small-sized trees
Time for building a tree
may be higher than
another type of classifier
Error propagates as the
number of classes
increases
Pros
Cons
Random Forests
Ensemble classifier containing many decision trees and outputs the class that is the
mode of the class's output by individual trees.
Random Forests
Runs efficiently on large
databases.
Can handle thousands of
input variables without
variable deletion.
It has methods for
balancing error in class
population unbalanced
data sets.
Have been observed to
overfit for noisy datasets.
For data including
categorical variables with
different number of
levels, random forests are
biased in favor of those
attributes with more
levels.
Pros
Cons
Support Vector Machines
SVM
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
Grade Point Average / Average Marks
Assessment Score
In Programming Skills
Got a job
Didn’t get job
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1
-1 if w . x + b <= -1
Support Vector Machines
Slide By Andrew W. Moore, CMU
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
•
M = Margin Width =
M = |x+ - x- | =| l w |=
x-
x+
w.w
2
λ
wwww
ww
.
2
.
.2

www .|| λλ 
ww.
2
Maximize   

R
k
R
l
kllk
R
k
k Qααα
1 11 2
1
where ).( lklkkl yyQ xx
Subject to these
constraints:
kCαk 0
Then define:


R
k
kkk yα
1
xw
k
k
KKKK
αK
εyb
maxargwhere
.)1(

 wx
Then classify with:
f(x,w,b) = sign(w. x - b)
0
1

R
k
kk yα
Slide By Andrew W. Moore, CMU
Support Vector Machines - Kernels
Let’s consider data points in only one dimension for simplicity
Support Vector Machines - Kernels
Support Vector Machines - Kernels
Support Vector Machines - Kernels
The linear classifier relies on dot product between vectors K(xi,xj)=xi
Txj
If every data point is mapped into high-dimensional space via some transformation Φ: x →
φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]
In 1D
In 3D
Support Vector Machines - Animation
Support Vector Machines
Less prone to overfitting
Needs less memory to
store the predictive
model
Reach the global
optimum due to quadratic
programming
Works well with smaller
sized datasets
Hyperparameter search is
important, and complex
Deep NN are performing
better than previously
SVM based state-of-the-
art solutions
Pros
Cons
Naïve Bayes
Apply Bayes’ theorem with the “naive” assumption of independence between
every pair of features
Naïve Bayes
Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1
After the evidence is obtained; posterior probability
• P(a|b)
• The probability of a given that all we know is b
• P(cavity|toothache)=0.8
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested in the best hypothesis for some space H given observed training
data D.

P(b | a) 
P(a |b)P(b)
P(a)
Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested in the best hypothesis for some space H given observed training
data D.

P(b | a) 
P(a |b)P(b)
P(a)
)|(argmax DhPh
Hh
MAP


)(
)()|(
argmax
DP
hPhDP
Hh

)()|(argmax hPhDP
Hh

We can drop this because P(D) is
independent of the hypothesis
Naïve Bayes
Training Set: instances of different classes described cj as conjunctions of attributes
values
Classify a new instance d based on attribute values into one of the classes cj  C
Key idea: assign the most probable class CMAPusing Bayes Theorem.
),,,|(argmax 21 nj
Cc
MAP xxxcPc
j



),,,(
)()|,,,(
argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j 



)()|,,,(argmax 21 jjn
Cc
cPcxxxP
j



Naïve Bayes
Performs at state of the
art level for some use-
cases
Performs good in small
datasets
Converges quickly
Not suitable for very
large datasets
Performs poorly if
features are correlated
Pros
Cons
Brainstorm
Which classifier should I use?
Questions
What is the relative importance of each predictor?
How does each variable affect the outcome?
Does a predictor make the solution better or worse or have no effect?
Can parameters be accurately predicted?
How good is the model at classifying cases for which the outcome is known ?
What is the prediction equation in the presence of covariates?
Can prediction models be tested for relative fit to the data?
So called “goodness of fit” statistics
What is the strength of association between the outcome variable and a set of
predictors?
Often in model comparison you want non-significant differences so strength of association
is reported for even non-significant effects.
Unsupervised Learning
Clustering
Draw inferences from datasets consisting of input data without labeled responses.
Clustering is used for exploratory data analysis to find hidden patterns or grouping
in data
Where is Clustering Used?
Marketing: segment customer behaviors
Banking: fraud detection
Gene Analysis: identify gene responsible for a disease
Image Processing: identifying objects in an image (e.g. face recognition)
Insurance: identify policy holders with high average claim cost
K-Means Algorithm
1. Fix a number k = number of required clusters
K=3
K-Means Algorithm
2. Select K random points in the given space
K=3
K-Means Algorithm
3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid
K-Means Algorithm
4. Within each newly formed cluster, find the new centroid
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)
K-Means Algorithm
6. Obtain final clusters
Hands On
What’s Next?

Analytics Boot Camp - Slides

  • 1.
  • 2.
    Analytics Boot CampCourse Aditya Joshi Kumar Rishabh Srinivas Nv Gannavarapu
  • 3.
    the science ofexamining raw data with the purpose of drawing conclusions about that information - Techtarget
  • 4.
    Data analytics refersto qualitative and quantitative techniques and processes used to enhance productivity and business gain. Data is extracted and categorized to identify and analyze behavioral data and patterns, and techniques vary according to organizational requirements. - Techopedia
  • 15.
    Brainstorm Where else isData Analytics used?
  • 16.
    Workflow • Planning, organizingand requirement gathering • Gathering Data • Data Cleaning • Analyzing Data, Predictive Modelling and Result Generation • Result Presentation
  • 17.
    Learning Process Obtain DataExtract Features Training Model
  • 18.
    Learning Process New DataFeatures Model Result
  • 19.
    Evaluation How is dataobtained? Train-Test data Evaluation methodologies Precision, Recall, F-Measure Other methods of evaluation, Confusion Matrix
  • 20.
    Data Attributes Types NominalCategorical Ordinal Continuous
  • 22.
  • 23.
    R Programming Language Bestsuited for data- oriented problems Strong package ecosystem Graphics and charting Simple learning curve Memory management, speed, and efficiency Not a complete programming language Not for advanced programmers Pros Cons
  • 24.
    Matlab Contains a lotof advance toolboxes Good documentation Good customer support Expensive Not as much open source code available because matlab requires a license Cannot integrated your code into a Webservice Pros Cons
  • 25.
    Python Great data analysis libraries(pandas, statsmodels) Code can be easily integrated into a web service Simple and easy to begin A lot of cutting edge, advanced academic research is still being done in R/Matlab Advanced features of the language might offer a steep learning curve for newcomers Pros Cons
  • 26.
    SAS Easy to learn dedicatedcustomer service along with the community Simple learning curve Expensive enterprise tool Limited Functionality Pros Cons
  • 27.
    Julia High performance, efficient good towrite a computationally intensive program that uses multiple CPUs Decent visualization capabilities Relatively new; growing community not syntactically optimized for statistical operations on data arrays Pros Cons
  • 28.
  • 29.
  • 30.
    Cloud Computing, Bluemix andAnalytics SAAS – PAAS – IAAS IBM Bluemix
  • 32.
    Platform as aService Zero Infrastructure, Lower Risk Lower cost and improved profitability Easy and quick development, Monetize quickly Reusable code and business logics Integration with other web services Google App Engine Heroku IBM Bluemix Openshift Cloud Foundry … PaaS Offerings
  • 33.
  • 34.
    Bluemix Offerings Storage AnalyticsWatson Mobile IOT Containers And much more
  • 35.
    IBM Bluemix Data& Analytics Data Storage • Cloudant NOSQL DB • Redis • IBM DashDB Graph Processing • IBM Graph Number Crunching • IBM Analytics for Apache Spark
  • 36.
    Why Bluemix? Cloud Service Readyto Use Platform Because it’s IBM Open Source Tools
  • 37.
    DashDB Data Warehousing andAnalytics • Relational Data • Special Data types, eg. geospatial data Data Analysis • SQL Access • Advanced Built-in Analytics • R Studio Performance • IBM’s BLU Acceleration • processes terabytes of data extremely quickly
  • 38.
    DashDB Database Data Warehouse Collectionof data organized for storage, accessibility, and retrieval. Integrated copies of transaction data from multiple sources optimized for analytical use. OLTP OLTP + OLAP Very quick transaction processing for a single application Data from multiple applications used to guide analysis and decision-making Optimized for performing read-write operations of single point transactions. Optimized for efficiently reading/retrieving large data sets and for aggregating data. Analysis is hard because of slow multiple joins. Queries become very complex. Analysis queries are easy to form and execute faster.
  • 39.
  • 40.
  • 41.
    Basics of RProgramming
  • 42.
  • 43.
  • 44.
  • 46.
    What is MachineLearning?
  • 47.
    A computer programis said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  • 48.
    A computer programis said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  • 49.
    A computer programis said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  • 50.
    A computer programis said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E - Tom Mitchell
  • 51.
  • 52.
    Linear Regression modeling therelationship between a scalar dependent variable and one or more explanatory variables (or independent variables) If we have only one independent variable, the model is called as simple linear regression, otherwise, multiple linear regression
  • 53.
    Grade Point Average/ Average Marks Salary in the First Job
  • 54.
  • 55.
    Linear Regression Goal: Findthe line such that distance from line to each point is minimized. We will “fit” the points with a line, so that an “objective function” is minimized. The line we thus obtain would minimize the sum of squared residues (least squares).
  • 58.
    S (predictedi –actuali)2 =S (residuei)2
  • 59.
    Logistic Regression A regressionmodel where the dependent variable (DV) is categorical. Logistic regression is technically a classification technique – do not get confused by the word “Regression”
  • 60.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job ?
  • 61.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 62.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 63.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 64.
  • 65.
    Logistic Regression Goal: Findthe parameters to fit We will “fit” the points with a line, so that an “objective function” is minimized. The line we thus obtain would minimize the sum of squared residues (least squares).
  • 66.
    The logistic function •Where Y-hat is the estimated probability that the ith case is in a category and u is the regular linear regression equation: 1 u i u e Y e   1 1 2 2 K Ku A B X B X B X    
  • 67.
    Supervised Learning -Classification
  • 68.
    Nearest Neighbor Approaches Findk closest training examples, and poll their class values
  • 69.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 70.
    K Nearest Neighbors(k-NN) k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. One of the simplest machine learning algorithms.
  • 71.
    K Nearest Neighbors(k-NN) Require no training Easy to understand Easy for active learning processes Doesn’t scale up very well for large training set. (require special implementations like KD Trees) ‘K’ is hard to determine Require full dataset in memory Pros Cons
  • 72.
    Decision Trees Find amodel for class attribute as a function of the values of other attributes.
  • 73.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 74.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 75.
    Decision Trees Goal: Builda tree; At each node, split the data on the basis of one attribute which provides the maximum split Dt ? > If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt > If Dt is an empty set, then t is a leaf node labeled by the default class, yd > If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
  • 76.
    Decision Trees • Determinehow to split the records • How to specify the attribute test condition? • How to determine the best split? • Determine when to stop splitting
  • 77.
    Decision Trees When anode p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. An alternate method is using Information Gain and Entropy.   k i i split iGINI n n GINI 1 )(
  • 78.
    Decision Trees –Travel Time to Office Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Short Medium Medium Long No Yes No Yes
  • 79.
    Decision Trees –Travel Time to Office Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Short Medium Medium Long No Yes No Yes if hour == 8am commute time = Short else if hour == 9am if accident == yes commute time = long else commute time = medium else if hour == 10am if stall == yes commute time = long else commute time = medium
  • 80.
    Decision Trees Inexpensive toconstruct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Time for building a tree may be higher than another type of classifier Error propagates as the number of classes increases Pros Cons
  • 81.
    Random Forests Ensemble classifiercontaining many decision trees and outputs the class that is the mode of the class's output by individual trees.
  • 82.
    Random Forests Runs efficientlyon large databases. Can handle thousands of input variables without variable deletion. It has methods for balancing error in class population unbalanced data sets. Have been observed to overfit for noisy datasets. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Pros Cons
  • 83.
  • 84.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 85.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 86.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 87.
    Grade Point Average/ Average Marks Assessment Score In Programming Skills Got a job Didn’t get job
  • 88.
    • Plus-plane ={ x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Classify as.. +1 if w . x + b >= 1 -1 if w . x + b <= -1 Support Vector Machines
  • 89.
    Slide By AndrewW. Moore, CMU What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + l w • |x+ - x- | = M • M = Margin Width = M = |x+ - x- | =| l w |= x- x+ w.w 2 λ wwww ww . 2 . .2  www .|| λλ  ww. 2
  • 90.
    Maximize    R k R l kllk R k k Qααα 1 11 2 1 where ).( lklkkl yyQ xx Subject to these constraints: kCαk 0 Then define:   R k kkk yα 1 xw k k KKKK αK εyb maxargwhere .)1(   wx Then classify with: f(x,w,b) = sign(w. x - b) 0 1  R k kk yα Slide By Andrew W. Moore, CMU
  • 91.
    Support Vector Machines- Kernels Let’s consider data points in only one dimension for simplicity
  • 92.
  • 93.
  • 94.
    Support Vector Machines- Kernels The linear classifier relies on dot product between vectors K(xi,xj)=xi Txj If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi Txj)2 , Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj)=(1 + xi Txj)2 , = 1+ xi1 2xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2xj2 2 + 2xi1xj1 + 2xi2xj2 = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2]T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2]
  • 96.
  • 97.
  • 98.
    Support Vector Machines Lessprone to overfitting Needs less memory to store the predictive model Reach the global optimum due to quadratic programming Works well with smaller sized datasets Hyperparameter search is important, and complex Deep NN are performing better than previously SVM based state-of-the- art solutions Pros Cons
  • 99.
    Naïve Bayes Apply Bayes’theorem with the “naive” assumption of independence between every pair of features
  • 100.
    Naïve Bayes Before theevidence is obtained; prior probability • P(a) the prior probability that the proposition is true • P(cavity)=0.1 After the evidence is obtained; posterior probability • P(a|b) • The probability of a given that all we know is b • P(cavity|toothache)=0.8
  • 101.
    Naïve Bayes Bayes Theorem(Thomas Bayes, 1763) Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D.  P(b | a)  P(a |b)P(b) P(a)
  • 102.
    Naïve Bayes Bayes Theorem(Thomas Bayes, 1763) Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D.  P(b | a)  P(a |b)P(b) P(a) )|(argmax DhPh Hh MAP   )( )()|( argmax DP hPhDP Hh  )()|(argmax hPhDP Hh  We can drop this because P(D) is independent of the hypothesis
  • 103.
    Naïve Bayes Training Set:instances of different classes described cj as conjunctions of attributes values Classify a new instance d based on attribute values into one of the classes cj  C Key idea: assign the most probable class CMAPusing Bayes Theorem. ),,,|(argmax 21 nj Cc MAP xxxcPc j    ),,,( )()|,,,( argmax 21 21 n jjn Cc xxxP cPcxxxP j     )()|,,,(argmax 21 jjn Cc cPcxxxP j   
  • 104.
    Naïve Bayes Performs atstate of the art level for some use- cases Performs good in small datasets Converges quickly Not suitable for very large datasets Performs poorly if features are correlated Pros Cons
  • 105.
  • 106.
    Questions What is therelative importance of each predictor? How does each variable affect the outcome? Does a predictor make the solution better or worse or have no effect? Can parameters be accurately predicted? How good is the model at classifying cases for which the outcome is known ? What is the prediction equation in the presence of covariates? Can prediction models be tested for relative fit to the data? So called “goodness of fit” statistics What is the strength of association between the outcome variable and a set of predictors? Often in model comparison you want non-significant differences so strength of association is reported for even non-significant effects.
  • 108.
  • 109.
    Clustering Draw inferences fromdatasets consisting of input data without labeled responses. Clustering is used for exploratory data analysis to find hidden patterns or grouping in data
  • 113.
    Where is ClusteringUsed? Marketing: segment customer behaviors Banking: fraud detection Gene Analysis: identify gene responsible for a disease Image Processing: identifying objects in an image (e.g. face recognition) Insurance: identify policy holders with high average claim cost
  • 114.
    K-Means Algorithm 1. Fixa number k = number of required clusters K=3
  • 115.
    K-Means Algorithm 2. SelectK random points in the given space K=3
  • 116.
    K-Means Algorithm 3. Foreach point, find distance from the ‘k’ centroids, and assign it to the closest centroid
  • 117.
    K-Means Algorithm 4. Withineach newly formed cluster, find the new centroid
  • 118.
    K-Means Algorithm 5. Repeatthe process till the centroids are stable (converge)
  • 119.
    K-Means Algorithm 5. Repeatthe process till the centroids are stable (converge)
  • 120.
    K-Means Algorithm 5. Repeatthe process till the centroids are stable (converge)
  • 121.
  • 123.
  • 124.