Analytics Boot Camp - Slides

Analytics Boot Camp Course
Aditya Joshi
Kumar Rishabh
Srinivas Nv Gannavarapu

the science of examining raw data
with the purpose of drawing
conclusions about that information
- Techtarget

Data analytics refers to qualitative and
quantitative techniques and processes used
to enhance productivity and business gain.
Data is extracted and categorized to identify
and analyze behavioral data and patterns,
and techniques vary according to
organizational requirements. - Techopedia

Brainstorm
Where else is Data Analytics used?

Workflow
• Planning, organizing and requirement gathering
• Gathering Data
• Data Cleaning
• Analyzing Data, Predictive Modelling and Result Generation
• Result Presentation

Learning Process
Obtain Data Extract Features Training Model

Learning Process
New Data Features Model Result

Evaluation
How is data obtained?
Train-Test data
Evaluation methodologies
Precision, Recall, F-Measure
Other methods of evaluation, Confusion Matrix

Data Attributes Types
Nominal Categorical Ordinal Continuous

R Programming Language
Best suited for data-
oriented problems
Strong package
ecosystem
Graphics and
charting
Simple learning
curve
Memory
management, speed,
and efficiency
Not a complete
programming
language
Not for advanced
programmers
Pros
Cons

Matlab
Contains a lot of
advance toolboxes
Good
documentation
Good customer
support
Expensive
Not as much open
source code available
because matlab
requires a license
Cannot integrated
your code into a
Webservice
Pros
Cons

Python
Great data analysis
libraries (pandas,
statsmodels)
Code can be easily
integrated into a
web service
Simple and easy to
begin
A lot of cutting edge,
advanced academic
research is still being
done in R/Matlab
Advanced features of
the language might
offer a steep learning
curve for newcomers
Pros
Cons

SAS
Easy to learn
dedicated customer
service along with
the community
Simple learning
curve
Expensive enterprise
tool
Limited Functionality
Pros
Cons

Julia
High performance,
efficient
good to write a
computationally
intensive program
that uses multiple
CPUs
Decent visualization
capabilities
Relatively new;
growing community
not syntactically
optimized for
statistical operations
on data arrays
Pros
Cons

Brainstorm
What’s the Best Language?

Cloud Computing, Bluemix
and Analytics
SAAS – PAAS – IAAS
IBM Bluemix

Platform as a Service
Zero Infrastructure, Lower Risk
Lower cost and improved profitability
Easy and quick development, Monetize quickly
Reusable code and business logics
Integration with other web services
Google App Engine
Heroku
IBM Bluemix
Openshift
Cloud Foundry
…
PaaS Offerings

https://www.zoho.com/creator/paas.html

Bluemix Offerings
Storage Analytics Watson
Mobile IOT Containers
And much
more

IBM Bluemix Data & Analytics
Data
Storage
• Cloudant NOSQL DB
• Redis
• IBM DashDB
Graph
Processing
• IBM Graph
Number
Crunching
• IBM Analytics for Apache Spark

Why Bluemix?
Cloud Service
Ready to Use Platform
Because it’s IBM
Open Source Tools

DashDB
Data Warehousing and Analytics
• Relational Data
• Special Data types, eg. geospatial data
Data Analysis
• SQL Access
• Advanced Built-in Analytics
• R Studio
Performance
• IBM’s BLU Acceleration
• processes terabytes of data extremely quickly

DashDB
Database Data Warehouse
Collection of data organized for storage,
accessibility, and retrieval.
Integrated copies of transaction data from
multiple sources optimized for analytical
use.
OLTP OLTP + OLAP
Very quick transaction processing for a
single application
Data from multiple applications used to
guide analysis and decision-making
Optimized for performing read-write
operations of single point transactions.
Optimized for efficiently reading/retrieving
large data sets and for aggregating data.
Analysis is hard because of slow multiple
joins. Queries become very complex.
Analysis queries are easy to form and
execute faster.

Getting Started
Setup and Basics

A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by
P, improves with experience E
- Tom Mitchell

Supervised Learning - Regression

Linear Regression
modeling the relationship between a scalar dependent variable and one or more
explanatory variables (or independent variables)
If we have only one independent variable, the model is called as simple linear
regression, otherwise, multiple linear regression

Grade Point Average / Average Marks
Salary in the
First Job

Independent Variable
Dependent Variable

Linear Regression
Goal: Find the line such that distance from line to each point is
minimized.
We will “fit” the points with a line, so that an “objective function” is
minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).

S (predictedi – actuali)2 =S (residuei)2

Logistic Regression
A regression model where the dependent variable (DV) is categorical.
Logistic regression is technically a classification technique – do not get confused by
the word “Regression”

Assessment Score
In Programming Skills
Got a job
Didn’t get job
?

Assessment Score
In Programming Skills
Got a job
Didn’t get job

Linear Regression – Animation

Logistic Regression
Goal: Find the parameters to fit
We will “fit” the points with a line, so that an “objective function” is
minimized. The line we thus obtain would minimize the sum of squared
residues (least squares).

The logistic function
• Where Y-hat is the estimated probability that the
ith case is in a category and u is the regular linear
regression equation:
1
u
i u
e
Y
e


1 1 2 2 K Ku A B X B X B X    

Supervised Learning - Classification

Nearest Neighbor Approaches
Find k closest training examples, and poll their class values

K Nearest Neighbors (k-NN)
k-NN is a type of instance-based learning, or lazy learning, where the
function is only approximated locally and all computation is deferred
until classification.
One of the simplest machine learning algorithms.

K Nearest Neighbors (k-NN)
Require no training
Easy to understand
Easy for active learning
processes
Doesn’t scale up very
well for large training set.
(require special
implementations like KD
Trees)
‘K’ is hard to determine
Require full dataset in
memory
Pros
Cons

Decision Trees
Find a model for class attribute as a function of the values of other attributes.

Decision Trees
Goal: Build a tree; At each node, split the data on the basis of one
attribute which provides the maximum split
Dt
?
> If Dt contains records that belong the same class yt,
then t is a leaf node labeled as yt
> If Dt is an empty set, then t is a leaf node labeled by
the default class, yd
> If Dt contains records that belong to more than one
class, use an attribute test to split the data into smaller
subsets. Recursively apply the procedure to each
subset.

Decision Trees
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting

Decision Trees
When a node p is split into k partitions (children), the quality of split is computed as,
where, ni = number of records at child i, n = number of records at node p.
An alternate method is using Information Gain and Entropy.


k
i
i
split iGINI
n
n
GINI
1
)(

Decision Trees – Travel Time to Office
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Short
Medium Medium Long
No Yes No Yes

Decision Trees – Travel Time to Office
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Short
Medium Medium Long
No Yes No Yes
if hour == 8am
commute time = Short
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = medium

Decision Trees
Inexpensive to construct
Extremely fast at
classifying unknown
records
Easy to interpret for
small-sized trees
Time for building a tree
may be higher than
another type of classifier
Error propagates as the
number of classes
increases
Pros
Cons

Random Forests
Ensemble classifier containing many decision trees and outputs the class that is the
mode of the class's output by individual trees.

Random Forests
Runs efficiently on large
databases.
Can handle thousands of
input variables without
variable deletion.
It has methods for
balancing error in class
population unbalanced
data sets.
Have been observed to
overfit for noisy datasets.
For data including
categorical variables with
different number of
levels, random forests are
biased in favor of those
attributes with more
levels.
Pros
Cons

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1
-1 if w . x + b <= -1
Support Vector Machines

Slide By Andrew W. Moore, CMU
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
•
M = Margin Width =
M = |x+ - x- | =| l w |=
x-
x+
w.w
2
λ
wwww
ww
.
2
.
.2

www .|| λλ 
ww.
2

Maximize   

R
k
R
l
kllk
R
k
k Qααα
1 11 2
1
where ).( lklkkl yyQ xx
Subject to these
constraints:
kCαk 0
Then define:


R
k
kkk yα
1
xw
k
k
KKKK
αK
εyb
maxargwhere
.)1(

 wx
Then classify with:
f(x,w,b) = sign(w. x - b)
0
1

R
k
kk yα
Slide By Andrew W. Moore, CMU

Support Vector Machines - Kernels
Let’s consider data points in only one dimension for simplicity

The linear classifier relies on dot product between vectors K(xi,xj)=xi
Txj
If every data point is mapped into high-dimensional space via some transformation Φ: x →
φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]

Support Vector Machines - Animation

Support Vector Machines
Less prone to overfitting
Needs less memory to
store the predictive
model
Reach the global
optimum due to quadratic
programming
Works well with smaller
sized datasets
Hyperparameter search is
important, and complex
Deep NN are performing
better than previously
SVM based state-of-the-
art solutions
Pros
Cons

Naïve Bayes
Apply Bayes’ theorem with the “naive” assumption of independence between
every pair of features

Naïve Bayes
Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1
After the evidence is obtained; posterior probability
• P(a|b)
• The probability of a given that all we know is b
• P(cavity|toothache)=0.8

Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested in the best hypothesis for some space H given observed training
data D.

P(b | a) 
P(a |b)P(b)
P(a)

Naïve Bayes
Bayes Theorem (Thomas Bayes, 1763)
Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP)
hypothesis for the data
We are interested in the best hypothesis for some space H given observed training
data D.

P(b | a) 
P(a |b)P(b)
P(a)
)|(argmax DhPh
Hh
MAP


)(
)()|(
argmax
DP
hPhDP
Hh

)()|(argmax hPhDP
Hh

We can drop this because P(D) is
independent of the hypothesis

Naïve Bayes
Training Set: instances of different classes described cj as conjunctions of attributes
values
Classify a new instance d based on attribute values into one of the classes cj  C
Key idea: assign the most probable class CMAPusing Bayes Theorem.
),,,|(argmax 21 nj
Cc
MAP xxxcPc
j



),,,(
)()|,,,(
argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j 



)()|,,,(argmax 21 jjn
Cc
cPcxxxP
j




Naïve Bayes
Performs at state of the
art level for some use-
cases
Performs good in small
datasets
Converges quickly
Not suitable for very
large datasets
Performs poorly if
features are correlated
Pros
Cons

Brainstorm
Which classifier should I use?

Questions
What is the relative importance of each predictor?
How does each variable affect the outcome?
Does a predictor make the solution better or worse or have no effect?
Can parameters be accurately predicted?
How good is the model at classifying cases for which the outcome is known ?
What is the prediction equation in the presence of covariates?
Can prediction models be tested for relative fit to the data?
So called “goodness of fit” statistics
What is the strength of association between the outcome variable and a set of
predictors?
Often in model comparison you want non-significant differences so strength of association
is reported for even non-significant effects.

Clustering
Draw inferences from datasets consisting of input data without labeled responses.
Clustering is used for exploratory data analysis to find hidden patterns or grouping
in data

Where is Clustering Used?
Marketing: segment customer behaviors
Banking: fraud detection
Gene Analysis: identify gene responsible for a disease
Image Processing: identifying objects in an image (e.g. face recognition)
Insurance: identify policy holders with high average claim cost

K-Means Algorithm
1. Fix a number k = number of required clusters
K=3

K-Means Algorithm
2. Select K random points in the given space
K=3

K-Means Algorithm
3. For each point, find distance from the ‘k’ centroids, and assign it to the closest centroid

K-Means Algorithm
4. Within each newly formed cluster, find the new centroid

K-Means Algorithm
5. Repeat the process till the centroids are stable (converge)

K-Means Algorithm
6. Obtain final clusters

Analytics Boot Camp - Slides

More Related Content

What's hot

Similar to Analytics Boot Camp - Slides

Recently uploaded

Analytics Boot Camp - Slides