Predictive analytics

Evolution of Predictive
Analytics

What is Predictive Analytics?
Predictive analytics is the practice of extracting
insights from the existing data set with the help
data mining, statistical modeling and machine
learning techniques and using it to predict
unobserved/unknown events.
 Identifying cause-effect relationships across the
variables from the historical data.
 Discovering hidden insights and patterns with
the help of data mining techniques.
 Apply observed patterns to unknowns in the
Past, Present or Future.

Predictive Analytics Process Cycle

Analytics & Predictive Analytics
 Analytics is the understanding of existing
(retrospective) data with the goal of
understanding trends via comparison
 Developing analytics is the first step towards
deriving predictive analytics
 Predictive Analytics are more sophisticated analytics
that “forward thinking” in nature
 used for gaining insights from mathematical
and/or financial modeling by enhancing
understanding, interpretation and judgment for
the purpose of good decision making

© 2011 Predictive Dashboards LLC 7
Comparative Study:: Analytics and Predictive Analytics
Attribute Analytics Predictive Analytics
Purpose:
Understand the Past
Observe Trends
Catalyst for Discussion
Gain Insights
Make Decisions
Take Action
View: Historical and Current Future Oriented
Metrics Type: Lagging Indicators Leading Indicators
Data Used: Raw & Compiled Information
Data Type: Structured Structured and Unstructured
Users: Middle & Senior Mgt
Analysts, End Users
C-Level & Senior Mgt
Strategists, Analysts, Mgrs
Benefits: Gaining an understanding
of data
Productivity Improvements
Gaining Information & Insights
Process Improvements

Benefits
Benefits of Analytics:
productivity gains through improved data-
gathering processes
results in less time required for producing
reports and metrics
beneficial Not scalable, not repeatable
Benefits of predictive analytics:
process improvement gains through
improve revenue generation & cost
structures
enhanced decision making
Beneficial, Scalable, repeatable

Common Predictive Analytics
• Regression:
 Predicting output variable using its cause-effect
relationship with input variables. OLS Regression, GLM,
Random forests, ANN etc.
• Classification:
Predicting the item class. Decision Tree, Logistic
Regression, ANN, SVM, Naïve Bayes classifier etc.
• Time Series Forecasting:
Predicting future time events given past history. AR,
MA, ARIMA, Triple Exponential Smoothing, Holt-
Winters etc.
Contd.,

Common Predictive Analytics
• Association rule mining:
Mining items occurring together. Apriori Algorithm.
• Clustering:
Finding natural groups or clusters in the data. K-
means, Hierarchical, Spectral, Density based EM
algorithm Clustering etc.
• Text mining:
Model and structure the information content of
textual sources. Sentiment Analysis, NLP

Regression
 A Regression Model defines three types of
regression models:
 linear,
 Polynomial or Multiple regression
 logistic regression. Or log linear regression

Linear Regression
• The simplest form of regression to visualize is
linear regression with a single predictor. A
linear regression technique can be used if the
relationship between x and y can be
approximated with a straight line
Y =  +  X

Nonlinear Regression
• the relationship between x and y cannot be
approximated with a straight line. In this case,
a nonlinear regression technique may be used.
Alternatively, the data could be preprocessed
to make the relationship linear.

Multivariate Regression
• Multivariate regression refers to regression
with multiple predictors (x1 , x2 , ..., xn). For
purposes of illustration
Y = b0 + b1 X1 + b2 X2.

Decision Tree Induction
• A decision tree is a structure that includes a
root node, branches, and leaf nodes.
• Each internal node denotes a test on an
attribute,
• Each branch denotes the outcome of a test,
and
• Each leaf node holds a class label.
• The topmost node in the tree is the root node.

Decision Tree Issues
• Choosing Splitting Attributes
• Ordering of Splitting Attributes
• Splits (No. of Splits to take)
• Tree Structure (few levels are required for a
balanced tree)
• Stopping Criteria
– once training data is perfectly classified tree will stop
– Stopping earlier results overfitting
• Training Data
– Neither too small nor too big
• Pruning – improving of tree is needed by
removing sub-trees when required

Entropy
• Entorpy It is the measure of disorder in a data set
or amount of uncertainty in the dataset H(S)
H (S) =
• S – The current data set for which entropy is
being calculated (change for every iteration of
the ID3 algorithm
• X – Set of classes in S
• p(x) - It shows the ratio of number of elements in
a class x with reference to number of elements of
set S

Here our Partition is on Colour not on Shapes . After partition we get

After partition
24 stars, 25 Diamonds i.e., 24+25 = 49

Entropy
• Entropy is calculated for all stars and all
diamonds

Information Gain
• Information gain is a measure of the decrease
of disorder achieved by partitioning the
original data set

CART
• Create Binary Tree
• Uses entropy
• Formula to choose split point, s, for node t:
• PL,PR probability that a tuple in the training set will
be on the left or right side of the tree.

CART Example
• At the start, there are six choices for split
point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8

What is a Neural Network
• A computer system modeled on the human
brain and nervous system
• NN is usually organized in layers
• Layers are made up of number of inter-
connected nodes
• The connection strengths of neurons are
called weights, that are used to store the
acquired information (training examples)

Contd..Neural Networks
• During the learning process the weights are
modified in order to model the particular
learning task correctly on the training
examples.

Propagation
Tuple Input
Output
Input

Classification Using Rules
• Perform classification using If-Then rules
• Classification Rule: r = <a,c>
Antecedent, Consequent
• May generate from from other techniques
(DT, NN) or generate directly.
• Algorithms: Gen, RX, 1R, PRISM

CLASSIFICATION TREES
(ENSEMBLED METHODS)

Ensemble Methods
• Ensemble means collection of large number
of replicas (or mental copies or virtual
copies), of the microstate of the system
under macroscopic condition
or
• ensembling is a technique of combining two
or more algorithms of similar or dissimilar
types called base learners,
• that incorporates predictions from base
learners
Contd.,

• Eg:- A decision of an interviewr depnds on
the feedback of the all the interviewrs in
various rounds
Ensemble Methods

• Types of Ensembling:
– 3 types
–Averaging
• Taking the average of predictions from
models in regression
• Predicting of probabilites in classification
problem
Eg:
Ensemble Methods
Contd.,

• Types of Ensembling
–Majority vote
• Taking prediction with maximum vote or
• Recommendations from multiple models
Eg:
Ensemble Methods
Contd.,

• Types of Ensembling
–Weighted Average:
• Different weights are applied to predictions
from multiple models
• The average of those is taken which means
giving high or low importance for a specific
model in output
• Eg:
Ensemble Methods

Ensemble Modeling Techniques
• Mostly 3 Techniques are used. They are
–Bagging
–Boosting
–Stacking
Contd.,

–Bagging
• Referred as bootstrap aggregation
• BootStrap or Bootstrapping
–It is a sampling technique
–We choose ‘n’ observations from the
original dataset
–The probability of selecting each row
from the dataset is equal for all in each
iteration
Contd.,

Contd.,
• For Boot Strap
sample we have to
choose one row
• Here we have
selected Row2

Contd.,
• Row 2 exists in
Data even after
selection of it into
Bootstraped
samples
• Row 1 is again
selected from Data
into Bootstraped
sample again

Contd.,
• Now Bootstraped samples is ready for
growing trees.
• The above sample use majority vote or
averaging concepts to get final Prediction
• Bagging is mainly used to reduce variance
– Eg: Random Forest

–Boosting
• Avoids overfitting, in which first
algorithm is trained on the entire
dataset.
• Subsequent algorithms are built by
fitting the residuals of first algorithm,
• This gives higher weight for the poor
predicted observations in previous
model to avoid overfitting Contd.,

–Eg: XGBoost, GBM, ADABOOST, etc.,
• F1(x)=F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1F1(x)=
F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1.
• So this person earns $7504.1 per month according to
our model.

• Third Technique
–Stacking has two layers of machine
learning
Contd.,
• d1,d2,d3 receive
original inputs
feeatres (x)
• Top Layer f() takes the
output of bottom
layers (d1,d2,d3) and
predicts the output
(y)

• Key Principles for selecting models
– The individual models fulfill particular accuracy
criteria.
– The model predictions of various individual
models are not highly correlated with the
predictions of other models
– Note: This top layer model can also be
replaced by many other simpler formulas like:
• Averaging
• Majority vote
• Weighted Average
Contd.,

• Advantages
–Proven method of improving accuracy
–Key ingredient for winning almost all the
machine learning hackathons (it is a big
gathering of programmers code in a extreme manner
over a short period of time)
–Make model robust, stable and decent
performance on the test cases
–Ensembling can be used to capture
simple linear or non-linear complex
relationships in data Contd.,

• Dis-Advantages
–Reduces model interpretability
–Difficult to draw crucial business insights
at the end
–Time consuming activity, not suitable for
real time scenarios
–Selection of ensemble model is difficult
Contd.,

Association Rules
• Association Rule :
–“finding
–frequent patterns,
–associations,
–correlations, or casual structures
–among sets of items or objects in
transactional/relational databases or any
other informational repository”
– Cross-marketing
– Catalogue Design Contd.,

What is Association Rule
– Cross-marketing
– Catalogue Design
• Examples
• {bread}  {milk}
• {soda}  {chips}
• {bread}  {jam}
Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Contd.,

What is Association Rule
Contd.,

Goal of Association
Contd.,
• For a given set of transactions T, the goal of
Association Mining is to find the rules
having
– Support ≥ minsup threshnold
– Confidence ≥ minconf threshold

Association Rule Mining
Contd.,

Association Rule Techniques
Two step approach
• Find Large Itemsets or Frequent
itemset generation
• Generate all itemsets whose support ≥
minsup
• Rule generation
• Generate rules from frequent itemsets.
• Each rule is a binary partitioning of a
frequent item set
• Note: frequent item set generation is
computationally expensive
Contd.
,

Algorithm to Generate ARs
Contd.
,

Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5. i = i + 1;
6. Ci = Apriori-Gen(Li-1);
7. Count Ci to determine Li;
8. until no more large itemsets found;
Contd.
,

Working of Apriori Principle
Contd.
,

Apriori
Advantages/Disadvantages
 Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
 Disadvantages:
– Assumes transaction database is memory
resident.
– Requires up to m database scans.

Sequence Rules
• For a Database D, finding the maximal
sequences among the ‘n’ sequences, that
have certail user-specified minimum
support and confidence

Segmentation
• For a Database D, finding the maximal
sequences among the ‘n’ sequences, that
have certail user-specified minimum
support and confidence

Clustering Approaches
Sampling Compression
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive

Types of Clustering
• Hierarchical – Nested set of clusters created.
• Partitional – One set of clusters created.
• Incremental – Each element handled one at a
time.
• Simultaneous – All elements handled
together.
• Overlapping/Non-overlapping

Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples
• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate

Hierarchical Clustering
• Agglomerative
– Bottom up approach
– Start with single-instance clusters
– At each step join the two closest clusters
– Design Decision: distance between clusters
• Eg: two closest instances in clusters vs distance
between means
• Divisive / deglomarative
– Top Down Approach
– Start with one universal cluster
– Find two clusters
– Proceed recursively on each subset
– Can be very fast
Both produce a Dendrogram
Top Down
Approach
Bottom up
Approach

Hierarchical Agglomerative Clustering
(HAC) or Agglomerative Clustering
• Starts with each doc in a separate cluster
–then repeatedly joins the closest pair of
clusters, until there is only one cluster.
• The history of merging forms a binary tree
or hierarchy.

K – Means Clustering
Now group this data into two clusters

• Randomly initialize
two points called
centroids

• Colour the datasets
in red and blue
• Find the nearest
dataset for Red or
Blue coloured
centroid
• Move the centroids

• After calculating
the mean of all
centroids for both
the colour i.e.,
Red and Blue
• Closest to the
both the colours

K-means clustering
• Calculate the
average of blue
points and red
points and move
the centroid again

K-means Clustering
• Calculate the
average and mean
of blue and red
points then move
the centroid again
• Continue this steps
of iteration
• At a particular point
cluster centroids
will not change

K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)

Social network analysis is:
• a set of relational methods for systematically
understanding and identifying connections among actors
Introduction

• Actors (nodes, points, vertices):
- Individuals, Organizations, Events …
• Relations (lines, arcs, edges, ties): between pairs of actors.
- Undirected (symmetric) / Directed (asymmetric)
- Binary / Valued
Basic concepts
Network Components

1) Egocentered Networks
• Data on a respondent (ego) and the people they are connected
to.
Measures:
Size
Types of relations
Basic concepts
Types of network data:
Def: An ego-centered network, or Egonet
represents the one-hop neighbourhood of
the node of interest
Or
Egonet consists of particular node and its
immediate neighbours

2) Complete Networks
• Connections among all members of
a population.
• Data on all actors within a
particular (relevant) boundary.
• Never exactly complete (due to
missing data), but boundaries are set
• Ex: Friendships among workers in a company.
Measures:
Graph properties
Density
Sub-groups
Positions
Background
Types of network data:

The unit of interest in a network are the combined sets of
actors and their relations.
We represent actors with points and relations with lines.
Example:
Social Network data
a
b
c e
d

In general, a relation can be:
Undirected / Directed
Binary / Valued
a
b
c e
d
Undirected, binary Directed, binary
a
b
c e
d
a
b
c e
d
Undirected, Valued Directed, Valued
a
b
c e
d
1 3
4
21
Social Network data

From pictures to matrices
Undirected, binary Directed, binary
a b c d e
a
b
c
d
e
1
1
1 1 1
1 1
a b c d e
a
b
c
d
e
1
1 1
1 1 1
1 1
1 1
Basic Data Structures
Social Network data
a
b
c e
d
a
b
c e
d

d e
c
Indirect connections are what make networks
systems. One actor can reach another if there is
a path in the graph connecting them.
a
b
c e
d
f
b f
a
Connectivity
Measuring Networks

Distance is measured by the (weighted) number of
relations separating a pair, Using the shortest path.
Actor “a” is:
1 step from 4
2 steps from 5
3 steps from 4
4 steps from 3
5 steps from 1
Distance & number of paths
Measuring Networks
a

An information
network:
Email exchanges
within the
Reagan white
house, early
1980s
(source: Blanton,
1995)
Measuring Networks

Centrality refers to (one dimension of) location, identifying
where an actor resides in a network.
Centrality
Measuring Networks
Centrality is fairly straight forward: we want to
identify which nodes are in the ‘center’ of the
network. In the sense that they have many and
important connections.
Three standard centrality measures capture a wide
range of “importance” in a network:
Degree
Closeness
Betweenness

The most intuitive notion of centrality focuses on
degree. Degree is the number of lines, and the
actor with the most lines is the most important:
Centrality
Measuring Networks

Centrality
Measuring Networks
Relative measure of Degree Centrality:
1
),(
)(' 1


 
n
ppa
PC
ki
n
i
kD
Degree Centrality:
),()(
1
ki
n
i
kD ppaPC



A second measure is closeness centrality. An actor
is considered important if he/she is relatively close to all
other actors.
Closeness is based on the inverse of the
distance of each actor to every other actor in the
network.
Closeness Centrality:
Relative Closeness Centrality
Centrality
Measuring Networks
1
1
)],([)( 

 ki
n
i
kC ppdPC
),(
1
1
),(
)('
1
1
1
ki
n
i
ki
n
i
kC
ppd
n
n
ppd
PC




















Closeness Centrality
Centrality
Measuring Networks

Betweenness Centrality:
Model based on communication flow: A person who lies
on communication paths can control communication flow, and is
thus important. Betweenness centrality counts the number of
shortest paths between i and k that actor j resides on.
b
a
C d e f g h
Centrality
Measuring Networks

Centrality
Measuring Networks
Betweenness centrality can be defined in terms of probability (1/gij),
CB(pk) = iij(pk) = =
gij = number of geodesics that bond actors pi and pj.
gij(pk)= number of geodesics which bond pi and pj and content pk.
iij(pk) = probability that actor pk is in a geodesic randomly chosen among the
ones which join pi and pj.
Betweenness centrality is the sum of these probabilities (Freeman, 1979).
)(*
g
1
ij
kij pg
ij
kij
g
)(pg
Normalizad: C’B(pk) = CB(pk) / [(n-1)(n-2)/2]

Betweenness Centrality:
Centrality
Measuring Networks

If we want to measure the degree to which the graph as a whole is centralized, we look
at the dispersion of centrality:
Freeman’s general formula for centralization (which ranges from 0 to 1):
 
)]2)(1[(
)()(1
*




nn
pCpC
C
n
i iDD
D
Centralization
Measuring Networks

Degree Centralization Scores
Freeman: 1.0 Freeman: .02 Freeman: 0.0
Centralization
Measuring Networks

Density
Measuring Networks
The more actors are connected to one another, the more dense the network will
be.
Undirected network: n(n-1)/2 = 2n-1 possible pairs of actors.
Δ =
Directed network: n(n-1)*2/2 = 2n-2possible lines.
ΔD =
2/)1( nn
L
)1( nn
L

Freeman: .25 Freeman: .23 Freeman: 0.25
Density
Measuring Networks

UCINET
•The Standard network analysis program, runs in Windows
•Good for computing measures of network topography for single
nets
•Input-Output of data is a special 2-file format, but is now able to
read PAJEK files directly.
•Not optimal for large networks
•Available from:
Analytic Technologies
Social Network Software

PAJEK
•Program for analyzing and plotting very large networks
•Intuitive windows interface
•Started mainly a graphics program, but has expanded to a wide range of
analytic capabilities
•Can link to the R statistical package
•Free
•Available from: http://vlado.fmf.uni-lj.si/pub/networks/pajek/

NetDraw
•Also very new, but by one of the best known names in
network analysis software.
•Free

Predictive analytics

More Related Content

What's hot

Similar to Predictive analytics

Recently uploaded

Predictive analytics