The document discusses probabilistic graphical models (PGMs) and Bayesian networks. It provides an overview of constructing Bayesian networks and using them for problem solving. Specifically, it discusses:
1. Constructing a Bayesian network from a dataset to identify probabilistic relationships between variables and determine the most influential variables for a given target.
2. Using the network to perform forward and backward analysis to understand how variable changes affect the target and determine variable values to achieve a desired target value.
3. Demonstrating these concepts using a Bayesian network constructed from an account-level dataset, with participation rate as the target variable. The network identifies additional influential variables beyond what was known initially.
Finding important nodes in social networks based on modified pagerankcsandit
Important nodes are individuals who have huge influence on social network. Finding important
nodes in social networks is of great significance for research on the structure of the social
networks. Based on the core idea of Pagerank, a new ranking method is proposed by
considering the link similarity between the nodes. The key concept of the method is the use of
the link vector which records the contact times between nodes. Then the link similarity is
computed based on the vectors through the similarity function. The proposed method
incorporates the link similarity into original Pagerank. The experiment results show that the
proposed method can get better performance.
Finding important nodes in social networks based on modified pagerankcsandit
Important nodes are individuals who have huge influence on social network. Finding important
nodes in social networks is of great significance for research on the structure of the social
networks. Based on the core idea of Pagerank, a new ranking method is proposed by
considering the link similarity between the nodes. The key concept of the method is the use of
the link vector which records the contact times between nodes. Then the link similarity is
computed based on the vectors through the similarity function. The proposed method
incorporates the link similarity into original Pagerank. The experiment results show that the
proposed method can get better performance.
ERATA AND SUMMARY (ten years after starting investigation)siegfried van hoek
SUMMARY OF WRITINGS Medical Research A, B and C, WITH A MINOR CORRECTION IN INTERPRETATION OF ANATOMY. Still the case stands, because surgery on the right half side of the brain, after making a hole at the left side, was performed without patient conscent and kept hidden after. RX manipulation tot frustrate the investigation after is also as finding supporting the case. THE CASE STILL STANDS in spite of counyter-actions. What is there more to be found? Thank you Slideshare for your democratic tool!
Using PySpark to Scale Markov Decision Problems for Policy ExplorationDatabricks
Finding policies that lead to optimal outcomes for an organization are some of the most difficult challenges facing decision makers within an organization. The reason for it is the fact that policies are not made in a world with perfect information and markets in equilibrium. These are complex systems where the behavior of entities within the system are dynamic and generally uncertain. Reinforcement Learning (RL) has gained popularity for modeling complex behavior to identify optimal strategy. RL maps states or situations to actions in order to maximize some result or reward. The Markov Decision Process (MDP) is a core component of the RL methodology. The Markov chain is a probabilistic model that uses the current state to predict the next state.
This presentation discusses using PySpark to scale an MDP example problem. When simulating complex systems, it can be very challenging to scale to large numbers of agents, due to the amount of processing that needs to be performed in memory as each agent goes through a permutation. PySpark allows us to leverage Spark for the distributed data processing and Python to define the states and actions of the agents.
ERATA AND SUMMARY (ten years after starting investigation)siegfried van hoek
SUMMARY OF WRITINGS Medical Research A, B and C, WITH A MINOR CORRECTION IN INTERPRETATION OF ANATOMY. Still the case stands, because surgery on the right half side of the brain, after making a hole at the left side, was performed without patient conscent and kept hidden after. RX manipulation tot frustrate the investigation after is also as finding supporting the case. THE CASE STILL STANDS in spite of counyter-actions. What is there more to be found? Thank you Slideshare for your democratic tool!
Using PySpark to Scale Markov Decision Problems for Policy ExplorationDatabricks
Finding policies that lead to optimal outcomes for an organization are some of the most difficult challenges facing decision makers within an organization. The reason for it is the fact that policies are not made in a world with perfect information and markets in equilibrium. These are complex systems where the behavior of entities within the system are dynamic and generally uncertain. Reinforcement Learning (RL) has gained popularity for modeling complex behavior to identify optimal strategy. RL maps states or situations to actions in order to maximize some result or reward. The Markov Decision Process (MDP) is a core component of the RL methodology. The Markov chain is a probabilistic model that uses the current state to predict the next state.
This presentation discusses using PySpark to scale an MDP example problem. When simulating complex systems, it can be very challenging to scale to large numbers of agents, due to the amount of processing that needs to be performed in memory as each agent goes through a permutation. PySpark allows us to leverage Spark for the distributed data processing and Python to define the states and actions of the agents.
The main target of this project is that it enables the telemarketing team to prioritize targeting for term loan marketing program by adopting a data-driven approach in Machine learning.
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Building Identity Graphs over Heterogeneous DataDatabricks
In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together.
This brief work is aimed in the direction of basics of data sciences and model building with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.
This presentation is intended to give the viewer a working knowledge of the practical applications of SAS in terms of Banking Analytics. Specifically, Enterprise Guide and Enterprise Miner have been discussed in detail.
Python software development provides ease of programming to the developers and gives quick results for any kind of projects. Suma Soft is an expert company providing complete Python software development services for small, mid and big level companies. It holds an expertise for 19 years and is backed up by a strong patronage. To know more- https://www.sumasoft.com/python-software-development
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
Primary Goals
1. To determine what factors are driving the lead conversion process.
2. To Identify which leads are more likely to convert to paid customers.
Data Description
3. Dataset consists of 4613 rows and 15 columns.
Modelling Strategies
4. Plan
4.1 Perform Dummy Encoding
4.2 List Variables for Modeling
4.3 Identify metric of interest to judge model's performance
5. Build
5.1 Build Logistic Regression Model (Preliminary Model)
5.2 Observe the metrics of the model
6. Improve
6.1 Identify the significant variables
6.2 Rebuild model
6.3 Observe the metrics of the models
7. Decide
7.1 Compare the results of Logistic Regression model (Base model) and Decision Tree Model
7.2 Conclude on best model for this project
8. Recommend
8.1 Determine factors driving the lead conversion process
8.2 Recommend what that may help to identify which leads are more likely to convert to paying customers
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Dataset: Gather a large dataset of laptops and their features, including processor speed, RAM, storage, and display size, along with their corresponding prices.
Feature engineering: Extracting meaningful features from the dataset, such as brand, model, and year, and transforming them into a format that machine learning algorithms can use.
Model selection: Choosing the most appropriate machine learning algorithm, such as linear regression, decision tree, or random forest, based on the type of data and desired level of accuracy.
Model training: Splitting the dataset into training and testing sets, and using the training data to train the machine learning model.
Model evaluation: Testing the model's performance on the testing data and evaluating its accuracy using metrics such as mean squared error or R-squared.
Hyperparameter tuning: Optimizing the model's hyperparameters, such as learning rate or regularization strength, to achieve the best performance.
Detailed insight into Analytical Steps required for generating reliable insights from analysis - Univariate, Bivariate, Multivariate, OLS & Logistic Models, etc
Was put together to train friends and mentees. Based on personal learnings/research and no proprietary info, etc. and no claims on 100% accuracy. Also every institution/organization/team uses it own steps/methodologies, so please use the one relevant for you and this only for training purposes.
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
1. — 0
Essex Lake Group LLC www.essexlg.com
Probabilistic Graphical Model
(Bayesian Network)
Advanced Distributed Machine Learning Labs (ADML)
Essex Lake Group LLC
Knowledge Sharing Session
July, 2015
2. — 1
Objective:
Technical Objectives:
• How can we identify the probabilistic relationships among the variables in the given data
• For any given target, how can we identify the most influencing variables ?
• Can Bayesian networks help us in finding better insights than the existing Pivot approach ?
Business Objective :
• For the MET004 Account level data, how do we evaluate the performance of “Participation Rate” using PGM
Embracing Bayesian approaches for business problem solving
3. — 2
▪ How do we construct Probabilistic Graphical Models (PGMs) ?
▪ PGMs on Sales Ratio Data
▪ PGM with respect to CART
Agenda
4. — 3
What are Probabilistic Graphical Models /Bayesian Networks?
Graph’s Joint Probability Table will be tabulated using the following equation:
P(A, B, C, D) = P(T|A,B) * P(D|T,C) *P(D|C)* P(A) * P(B) * P(C )
• Bayesian Network is Directed Acyclic Graph
• Nodes are variables and edges are relationships between variables.
• Each node has set of values
• Direction of edges indicates Causality i.e. cause-effect relationships among variables
• They are graphical models, capable of displaying relationships clearly and intuitively
• They are directional, thus being capable of representing cause-effect relationships
• They handle uncertainty though the established theory of probability
• They can be used to represent indirect in addition to direct causation
Example: “Car”
A
T
BA
Parent 1 Parent 2
Child
Spouse
C
D
“steering wheel angle”B
A
“gas pedal pressed”
“gear”
“car direction”
C
D
5. — 4
Network Terminology
Markov Blanket: For a given target variable What are the most
influencing variables ?
Dependency among the variables of a system (Car )
The Markov Blanket of a (Target) node consists of all nodes that make this Target
conditionally independent of all the other nodes in the model:
• The Parents (blue nodes in the example) are used for cutting the information
coming from their ascendants
• The Children (red node) are used for cutting the information coming from their
descendants
• The Spouses (or co-parents, dark green) are used for cutting the information
coming from the ascendants of the Children
If one Variable takes on some value then other variables are going to take that value
B
A
“gas pedal pressed”
“gear”
“car direction” = Forward
C
D
“steering wheel angle”
B
A
“gas pedal pressed” = Yes
“gear” = Forward
“car direction” = Forward
C
D
“steering wheel angle” = 90
Target
A
BA
Parent 1 Parent 2
Child
Spouse
C
D
T
T
6. — 5
▪ Data obtained after
processing must have:
- categorical values
- must be binned in case
of continuous values
▪ All data must be converted
to factors.
▪ For the fitting of Bayesian
Network, a scoring function φ
is considered
▪ Given data and scoring
function φ, find a network
such that it maximizes the
value of φ.
▪ Model tries various
permutations of adding and
removing arcs, such that final
network has maximum
posterior probability
distribution, starting from a
prior probability distribution
on the possible networks,
conditioned on data
▪ While learning the network,
the joint probability
distribution is created for all
the variables in the data set
▪ This joint probability
distribution is converted to
conditional probabilities for
each dependent variable.
▪ These CPTs are used to solve
any kind of analysis problems.
▪ Import the learned
structure and its CPT's into
SAMIAM to form
hypotheses and verify
them.
01. Processing data
02. Creating
relationships between
variables
03. Build CPTs of final
fitted network
04. Record Network
Constructing Bayesian Network - Process Flow
7. — 6
How do we interpret the Bayesian Network for problem solving ?
A. Forward Analysis :
Q1. I have not done any mailing activity
this year. How did this affect my
participation percentage?
Q2. To check for conditional dependencies
like mailing activity not being helpful if the
account is already active.
B. Backward Analysis
Q3. If we know that our participation is
high, then find responsible variables and
their values to replicate the result in the
future.
Q4. For my participation bucket to lie in
the higher bucket, should I offer payroll
deduct or not?
Variable Name Evidence
Set
Participation Bucket
value before evidence
Participation Bucket
value after evidence
Change in Values
Sponsored Mail 2014 Y 15.15% 19.46% +4.31%
Non Sponsored Mail 2013 Y 15.15% 18.85% +3.70%
Mail Status 2013 Y 15.15% 18.68% +3.53%
Sponsored Mail 2014 N 15.15% 13.8% -1.35%
Non Sponsored Mail 2013 N 15.15% 10.30% -4.85%
Mail Status 2013 N 15.15% 10.32% -4.83%
All Probability Values for Participation being in the 2.5%-5% (high) bucket.
ES_Supp_Mail_2014 NS_Mail_Status_2013
Participation Bucket
Onsite_Event_ind_2013
Mail_status_2013
A
Example: 1
Target
Fig: Markov Blanket for a target variable
8. — 7
A. Forward Analysis :
Q1. I have not done any mailing activity
this year. How did this affect my
participation percentage?
Q2. To check for conditional dependencies
like mailing activity not being helpful if the
account is already active.
B. Backward Analysis
Q3. If we know that our participation is
high, then find responsible variables and
their values to replicate the result in the
future.
Q4. For my participation bucket to lie in
the higher bucket, should I offer payroll
deduct or not?
All Probability Values for Participation being in the 2.5%-5% (high) bucket.
Evidence Value Non-Sponsored Mail 2013 Payroll Deduct Account Status
Participation Bucket = 2.5-
5%
Y Y ACTIVE
Participation
Bucket
Variable Evidence Participation Bucket
Value before Evidence
Participation Bucket
Value before
Evidence
Change in
Values
2.5% - 5% Non-Sponsored
Mail 2013
Payroll Deduct
Account Status
Y
Y
ACTIVE
15.15% 28.55% +13.40%
PD_AVAIL_ NS_Mail_Status_ind_2013
Participation
Bucket =2.5-5%
ACCOUNT_STATUS
Example: 2
One more example How do we interpret the Bayesian Network for problem solving ?
9. — 8
Most Probable Explanation for given variable (MAP): Process Flow
▪ The network consists of all
the variables in the data set
▪ To perform any kind of
analysis, the target variable
must be selected.
▪ The network is built without
keeping a target variable so
in case analysis needs to be
done on any other variable,
it can also be selected in the
same network.
▪ Target’s parent nodes, child
nodes and all other parents of
target’s child nodes form it’s
Markov Blanket.
▪ These variables make the
target variable independent
from the rest of the variables
in the network.
▪ In Samiam, for the given
Bayesian Network, set
evidence to the target
variable, which is the variable
of our interest.
▪ We need to find what values
should other variables have
so as to observe the
evidenced value of target
variable.
▪ Select the target variable’s
Markov Blanket variables
for which MAP values are
to be computed.
▪ Run MAP query.
▪ This query returns the
value of the selected
variables such that the
posterior probability of the
evidenced target is
maximized.
01. Selection of Target
Variable
02. Find Target’s
Markov Blanket
03. Set Evidence on
Target
04. Run MAP Query
on Markov Blanket
11. — 10
Employee Sponsored Mail
Participation Bucket
Onsite Event
Payroll Deduct
Email Status
More details on Network Structure:
• Payroll Deduct, Employee Sponsored Mail and Participation Bucket for a V-Structure.
• This V-Structure means that Payroll Deduct and Employee Sponsored Mail are
independent of each other.
• This means, any change in Payroll Deduct won’t influence Employee Sponsored Mail
and vice versa.
• Also, any change in Email Status which is the sibling of Participation Bucket will result in
change in Participation Bucket as well due to the presence of an active trail between the
two.
• Payroll deduct can however, affect Onsite Event through Participation Bucket but not
Email Status due to the presence of a V-Structure between Payroll Deduct and
Employee Sponsored Mail which is the parent of Email Status.
• Similarly, Employee Sponsored Mail can affect Onsite Event due to the presence of the
active trail between them.
• Also, if Participation Bucket is evidenced, then, The V-Structure will be activated and
influence can flow between Payroll Deduct and Employee Sponsored Mail.
12. — 11
▪ What are Probabilistic Graphical Models (PGM)
▪ PGMs on Sales Ratio Data
▪ PGM with respect to CART
Agenda
13. — 12
Our work in 3 modules:
1. Data Preparation
Imputation
Bayesian Network
Construction
Binning
Raw Data
Identify the variables
Sensitivity Analysis
(SAMIAM – a tool)
2. Model Building 3. Network Analysis - Hypothesis
Testing
Module 1:
Module 2: Pivot Approach Vs. Bayesian Network Approach
Pivots Bayesian network
Module 3: Observations on CART and Bayesina Networks
CART Bayesian Network
14. — 13
Data Details
Account Variables
Company Name
Market Segment
Company State
Mail Variables
Marketing Activity
Non-Sponsored Mail
Employee Sponsored Mail
Flag Variables
New or Existing account flag
Policy Variables
Program Type
Payroll Deduct
Employer Type
1. Sales Account Data (2013 to 2015)
2. Demographic details
3. MetLife’s mailing
4. Nature and tenure policy
5. Agent and employer type details
6. No. of participants and eligible accounts
1. Target Variable : Participation Bucket
(Ratio of no. of participants and no. of eligible)
2. No. of observations : 5007
1. Company Name and Company State depict the account details. Market segment could be local, national, regional, USD or other. Majority of the accounts are regional
2. Mailing variable(Non Sponsored Mail) is crucial for participation bucket
3. Policy Variable (Payroll Deduct) is essential to improve participation bucket
4. Most of the accounts are of Single Carrier type and don’t have Payroll Deduct
0
1000
2000
3000
4000
5000
6000
Total 0%-1% 1%-2.5% 2.5-5% >5%
Frequency
Participation Bucket
15. — 14
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Marketing Activity 2013 - For Participation
Bucket in range 2.5%-5%, 11% accounts have
marketing activity in 2013
Onsite Event 2013 - For Participation Bucket in
range 2.5%-5%, 3% accounts have Onsite Events
in 2013
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Mail Status 2013 - For Participation Bucket in
range 2.5%-5%, 11% accounts have Mail Status
2013
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Employee Sponsored Mail 2013 - For
Participation Bucket in range 2.5%-5%, 5.81%
accounts have Employee Sponsored Mail 2013
Payroll Deduct - For Participation Bucket in range
2.5%-5%, 9% accounts have Payroll Deduct
Non Sponsored Mail 2013 - For Participation
Bucket in range 2.5%-5%, 11% accounts have
Non Sponsored Mail 2013
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% - 1% 1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Bivariate Analysis on Target Variable “Participation Bucket”
16. — 15
32.22%
67.78%
N
Y
64.80%
35.20%
N
Y
35.01%
64.99% N
Y
88.28%
11.72%
N
Y
Profiling of Important Variables against Target Variable – “Participation Bucket”
Marketing Activity 2013 Onsite Event 2013 Payroll Deduct
Non Sponsored Mail 2013 Mail Status 2013 Employee Sponsored Mail 2013
32.55%
67.45%
N
Y
68% of the accounts have Marketing Activity
2013
88% of the accounts don’t have Onsite Event
2013
65% of the accounts don’t have Payroll Deduct
65% of the accounts have Non Sponsored Mail
2013
65% of the accounts have Non Sponsored Mail
2013
25% of the accounts received employee
sponsored mails.
75.40%
24.60%
N
Y
17. — 16
Bayesian Network discovers NOT ONLY the variables identified by the business users AND
ALSO discovers other most influencing variables
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Target Variable:
Participation Bucket
1st Level Markov Blanket Variables:
• Payroll Deduct
• Non Sponsored Mail 2013
• Marketing Activity 2013, 2015
• Mail Status 2013
• Employee Sponsored Mail 2014
• Account Type
• Account Status
• Onsite Event 2013
2nd Level Markov Blanket
Variables:
• Eligibility File
• Non Sponsored Mail 2013
• Employee Sponsored Mail
2014
• Program Type
• Tenure Years
Level 1 Level 2
Business Users’
variables
Business Users’
variables
Business Users’
variables
Target Target
Module 1
18. — 17
Better Results with Bayesian Network (All Variables)
Variables Selected: Evidence
Payroll Deduct -> Yes
Variables Selected: Evidence
Payroll Deduct -> Yes
Non- Sponsored Mail Status
2013 -> Yes
Mail Status 2013 -> Yes
23.47% 24.49% 23.27%
28.77%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
13.51%
19.57%
28.23%
38.69%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Inference:
• Payroll Deduct is Critical to account performance
• Payroll Deduct without the support of any Mailing Activity cannot drive account performance.
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Module 1
19. — 18
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Eligibility File -> Yes
Variables Selected: Evidence
Eligibility File -> Yes
Non- Sponsored Mail -> Yes
Employee Sponsored Mail ->Yes
Inference:
• Payroll Deduct causes the maximum change in the Target Variable. Apart from that, Payroll Deduct’s Markov blanket variables like
Eligibility File, Employee Sponsored Mail should be considered as well.
33.03%
24.35%
19.61%
23.01%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
6.81%
16.01%
28.15%
49.02%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Module 1
20. — 19
Hypothesis: Mailing activity significantly improves participation bucket with mailing activity in 2013 having more influence than 2014
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Employee Sponsored Mail 2014 ->
Yes
Non-Sponsored Mail 2014 -> Yes
Onsite Event 2014 -> Yes
Variables Selected: Evidence
Employee Sponsored Mail 2013 ->
Yes
Non-Sponsored Mail 2013 -> Yes
Onsite Event 2013 -> Yes
29.05%
26.59%
19.84%
24.51%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
4.91%
17.90%
27.20%
49.98%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Inference :
• Improving Mailing Activity for the year 2013 has more influence than in the year 2014
• Certain combinations of marketing activities tend to more effective, in particular - a combination of Non-sponsored, employee
sponsored and On-site Events for the year 2013 is yielding higher Participation Bucket (12.05% higher)
Module 1
21. — 20
Hypothesis: Participation Bucket depends on type of Marketing Channel – Eligibility File, Employee Sponsored Mail, Onsite Events,
Payroll Deduct.
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Marketing Activity 2013 -> Yes
Payroll Deduct -> Yes
Variables Selected: Evidence
Marketing Activity 2013 -> Yes
Onsite Event 2013 -> Yes
Inference :
• Along with Marketing Activity, performance also depends on the type of channels – Payroll Deduct, and Onsite Event Activity 2013
being the most important
15.80%
20.70%
26.97%
36.53%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
15.70%
25.61%
23.79%
34.91%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Module 1
22. — 21
Hypothesis: Newer accounts can improve their Participation Bucket by Mailing Activity
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Tenure Years -> 0
Non-Sponsored Mail 2013 -> Y
Employee Sponsored Mail 2013 -> Yes
Onsite Event 2013 -> Yes
Inference :
• Newer accounts can improve their Participation Bucket by Mailing Activity - specifically, sending non sponsored and employee
sponsored mails and doing some onsite events.
-19.48%
0.74%
7.25%
11.49%
-30.00%
-20.00%
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Difference % between probabilites
Module 1
23. — 22
Same insights from the both Pivot and Bayesian Analysis (3 variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
0% - 1% 1% - 2.5% 2.5% -5% >5%
BROKER CHOICE 25.04% 21.41% 24.74% 28.81%
MARSH COMBINED 41.61% 29.93% 11.68% 16.79%
METLIFE CHOICE 20.00% 31.11% 26.67% 22.22%
BROKER CHOICE 25.04% 21.41% 24.74% 28.81%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
Participation Bucket
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Pivot N 54.64% 27.22% 12.16% 5.98%
Pivot Y 16.09% 20.96% 28.16% 34.79%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
Participation Bucket
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Bayesian N 54.34% 27.19% 12.30% 6.17%
Bayesian Y 16.16% 21.01% 28.11% 34.72%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
Participation Bucket
0% - 1% 1% - 2.5% 2.5% -5% >5%
SINGLE CARRIER 25.02% 21.46% 24.72% 28.80%
MARSH COMBINED 41.41% 29.87% 11.84% 16.88%
METLIFE CHOICE 20.18% 30.89% 26.61% 22.32%
BROKER CHOICE 28.64% 28.64% 25.40% 17.32%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
Participaton Bucket
Program TypeProgram Type
Payroll Deduct Payroll Deduct
Pivot Analysis Bayesian Analysis
Module 2
24. — 23
More insights with more interactions: Pivot and Bayesian (6 variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
Pivot Analysis Bayesian Analysis
0% - 1% 1% - 2.5% 2.5% -5% >5%
N 61.32% 24.53% 3.77% 10.38%
Y 40.54% 23.94% 20.46% 15.06%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail Non – Sponsored Mail
0% - 1% 1% - 2.5% 2.5% -5% >5%
N 59.32% 24.77% 4.77% 11.14%
Y 40.47% 23.99% 20.35% 15.18%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
Bayesian Advantage:
• Pivot Tables often return zero values in case it cannot return values according to the set filter when multiple filters are applied.
• Bayesian Networks do not work on values but on conditional probabilities. Thus, even though data rows may be unavailable,
probability of the occurrence of each bucket according to the evidence given to filter variables in the network will still be returned.
Variables Selected:
Non Sponsored Mail 2014
Employee Sponsored Mail 2014
Onsite Event 2014
Microsite Activity 2014
Participation Bucket 2014
Email Status 2013
Module 2
25. — 24
Better Insights with Bayesian (Interaction among all variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Pivot N 54.64% 27.22% 12.16% 5.98%
Pivot Y 16.09% 20.96% 28.16% 34.79%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00% Payroll Deduct
Pivot Analysis Bayesian Analysis
0% - 1%
1% -
2.5%
2.5% -5% >5%
N 65.33% 18.70% 8.19% 7.78%
Y 23.47% 24.49% 23.27% 28.77%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% - 1%
1% -
2.5%
2.5% -
5%
>5%
N 61.32% 24.53% 3.77% 10.38%
Y 40.54% 23.94% 20.46% 15.06%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail
0% - 1%
1% -
2.5%
2.5% -
5%
>5%
N 57.85% 20.74% 10.30% 11.11%
Y 38.28% 22.24% 18.85% 20.62%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail
Payroll Deduct
Module 2
26. — 25
▪What are Probabilistic Graphical Models (PGM)
▪ PGMs on Sales Ratio Data
▪PGM with respect to CART
Agenda
27. — 26
Observations On Bayesian Networks and CART
CART is target variable dependent. Problem specific and rigid in
terms of modification.
Bayesian Network doesn’t involve fixed target variable. It is
easier to build and modify.
To perform second level analysis we need to construct a new
classification tree taking important variables as a target variable.
Second Level of analysis can be done by considering Markov
Blanket of the important variable in target’s Markov Blanket.
Participation Bucket
Payroll Deduct
No variable can be added to the tree forcibly or otherwise if
needed.
As all variables of data set are already in the model, such need
would never arise.
Module 3
28. — 27
CART (Classification And Regression Tree): For the same data set, we try to see how many business questions can the Bayesian Network
and CART models answer.
ES_Supp_Mail_ind_2014
NS_Mail_Status_ind_2013
Participation Bucket
Onsite_Event_ind_2013
Mail_status_ind_2013
Participation
Bucket
Zone
Open to New
Business
Open to New
Business
Eligibility File
Questions CART Bayesian Network
If we know that our participation is high, then find responsible variables and their values to replicate the result
in the future.
Yes Yes
For my participation bucket to lie in the higher bucket, should I offer payroll deduct or not? Yes Yes
I have not done any mailing activity this year. How did this affect my participation percentage? No Yes
To check for conditional dependencies like mailing activity not being helpful if the account is already active. No Yes
Is the model easily interpretable? Yes Yes
N N
Observations On Bayesian Networks and CART
Module 3