SlideShare a Scribd company logo
Data Mining Tasks /
Operations
Managers need to perform a variety of tasks with the data to formulate
best course of action / take decisions. The tasks are given in detail in
this unit.
Data mining tasks/operations relevant to CRM
1. Classification: mapping a given data item into one of the several predefined
classes. (It is a predictive technique)
2. Regression: Predicting the value of a dependent variable based on values of
independent variable/s
3. Link Analysis: Establishing relationship between items or variables in a database
record to expose patterns and trends. Example- Association rules, Sequential
patterns, Time Sequences
4. Segmentation: Identifying a finite set of naturally occurring clusters or categories
to describe data : Clustering
5. Deviation Detection: Discovering the most significant change in the data from
previously measured or expected values.
• Classification is perhaps the most basic form of data analysis.
• Classification is a process that maps a given data item into one of the several predefined
classes (Weiss & Kulikowski, 1991)
• A Retail store allows customers to purchase on credit & pay later in installments QUESTION-
Which of the customers are going to receive this facility??
Answer  use classification technique to predict if customer is worthy of credit
• CRM uses classification for a variety of purposes like behavior prediction, product & customer
categorization. For example - The recipient of an offer can respond or not respond. An applicant
for a loan can repay on time, repay late or declare bankruptcy. A credit card transaction can be
normal or fraudulent. Classification helps in all the above cases
• Classification is meaningful only if we have predefined classes.
• In classification we are trying to give a correct label to a given input
• It is a directed (supervised) style data mining technique (there is a target variable that the
manager is interested in) and it is predictive.
1. Classification
Classification
Goal of Classification: mapping previously unseen records to a a pre-
defined class as accurately as possible
1. Given a collection of historical records (used a training set)
2. Each record contains a set of attributes, one of the attributes is the
‘CLASS’.
3. Build a model for class attribute as a function of the values of other
attributes.
4. A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
5. New customers can then be classified into two classes. Example:
those likely to default payment and those unlikely to default payment
of installments.
Example of use of classification in Income tax
department
• Identify people who would have underpaid tax
• classify people (historical data) into 2 groups – those who pay required tax (not
cheat) and those who underpay tax (cheat)
• Variables used to build a model – claim for tax refund, marital status, taxable
income
• Historical data divided into training set and test set. Model built with training
set and accuracy of the prediction checked with test set
Example of use of Classification (Income
tax)
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund Marital
Status
Taxable
Income Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?
10
Test
Set
Training
Set
Model
Learn
Classifier
Classification: Applications
Direct Marketing
Goal: Reduce cost of mailing by targeting a set
of consumers most likely to buy a new
product.
Approach:
1. Use the data for a similar product
introduced before.
2. We know which customers decided
to buy and who decided otherwise.
This {buy, don’t buy} decision
forms the class attribute.
3. Use various demographic, lifestyle,
and company-interaction related
information about all such
customers.
4. Use this information as input
attributes to learn a classifier
model.
5. Use model on new data and send
mails only to those who are
predicted to buy
Credit Rating
Goal: Predict who will default payments for a
bank loan/credit scheme of a retail store
Approach:
1. Use demographics and behavioral data as
attributes -Age, salary, marital status, tax
paid etc
2. Label past customers as low and high risk
(class attribute)
3. Learn a model for the class of the
transactions.
4. Use this model to detect possible
defaulters
5. Extend loan only to the class who are of
‘low risk class’
2. Regression (Prediction Task)
Regression Basics
• Managerial decisions are often based on relationship between 2 or more
variables
• Regression is a data mining task which helps us to find the relation between a
dependent variable (Y) and one (X) or more independent variables (X1, X2, X3 …..)
• We can quantify the strength of the relationship between the two variables.
• This can then be used to predict the value of Y, given the value of X.
• It is a predictive technique.
• It is a directed data mining techniques
• Both dependent and independent variables are numeric in linear regression.
WHERE DO WE USE REGRESSION?
• Companies would like to know about factors significantly associated with their Key
Performance Indicators in order to be in a position to predict them.
• Marketing – Sales, Customer Satisfaction, Churn, New product acceptance
• Finance – Credit-risk
• HRM – Attrition, Job satisfaction
• Operations – efficiency
Regression uses in CRM
• The prediction of sales using data of ad spending,
price (FMCG)
• Predicting demand with price (airline, FMCG etc)
• To predict customer churn based on customer bill
details. (Mobile)
• Model sales with variables like hours when shop is
open, discounts given etc (retail).
Regression basics
• Dependent (response/outcome - Y) and
Independent (explanatory/input - X) variable
• Scatter plot
• Line of best fit- The relationship between the
two variables is approximated by a straight
line
• b0 - This is the intercept of the regression
line with the y-axis. It is the value of Y if the
value of X = 0
• b1 - This is the SLOPE of the regression line.
Thus this is the amount that the Y variable
(dependent) will change for each 1 unit
change in the X variable
y = b0 + b1x
+e
Linear regression calculates an equation
that minimizes the distance between the
fitted line and all of the data points.
Find the answer
• If the slope of the regression line is calculated to be 2.5 and the
intercept 16 then what is the value of Y (dependent variable) when X
(independent) is 4?
Simple Linear Regression Equation
Positive Linear Relationship
E(y)
x
Slope b1
is positive
Regression line
Intercept
b0
Corporate Example: Increase in end retailer knowledge of offers positively
correlated to pre-paid recharge sales in mobile
Simple Linear Regression Equation
Negative Linear Relationship
E(y)
x
Slope b1
is negative
Regression line
Intercept
b0
Price and demand of products negatively correlated
3. LINK ANALYSIS
It is a data mining technique that addresses relationships and
connections
Link Analysis
• Based on a branch of mathematics called Graph Theory, which represents
relationships between different objects as edges in a graph.
• Nodes: (Vertices) Things in the graphs that have relationships Eg. People ,
Organization, Objects, places, transactions etc.
• Edges: The relationship connection between nodes
Examples of Graphs with Nodes and Edges
Link Analysis
• Link analysis seeks to establish relation between items or variables in a database
to expose patterns and trends.
• It can also trace connections between items in records over a period of time. It is
mostly undirected.
• Helps in visualizing relationship. It is descriptive techniques
• Has limited application - Cannot solve all types of problems
• Yields good results in following CRM related applications:
• Analyzing telephone call patterns
• Analyzing link between cities / places travelled by customers (airline,
transportation companies)
• Understanding referral patterns
• Analyzing patterns of purchase in shopping cart (characterizing product
affinities in retail purchases) – most common use
Understanding referral patterns
Link analysis helps us to
analyze referral patterns
This help us to locate most
influential customers and
target them with offers
Telephone call patterns
Market Basket analysis
Used in understanding the link between
items in grocery shopping
Other Applications
• Analyzing link between web pages
• Social Network Analysis
• Spread of diseases (like flu) / Spread of ideas / diffusion of
innovation
• Crime & terror links
Google – Directed Graph Example
• Web pages = nodes
• Hyperlinks = edges
• Spiders & Web crawlers
updating
Google – example continued
• Authority versus mere
popularity
• Rank by number of unrelated
sites linking to a site yields
popularity
• Rank by number of subject-
related hubs that point to
them yields authority
• Helps to overcome the
situation that often arises in
popularity where the real
authority is ranked lower
because of lack of popularity
of links to it
Page Rank
• Each link acts as a vote which
helps in ranking
• Analyse the incoming links to your
page
• Evaluate the quality of link
• Analyse the link building strategy
• Improve your page rankings
• Used in SEO
LINK ANALYSIS
Criminal Investigation
Helps in finding terrorist groups
4. Deviation Detection
Anomalies or
Deviations
• Outliers are data points that are
considerably different from the
remainder of the data
• Anomalous events occur
infrequently but the consequences
can be dramatic and very often
negative
• 1985 British Antarctic survey showed
depletion of ozone layer by 10%.
Why did NASA’s nimbus 7 satellite
not capture this fact?
Deviation Detection
• Focuses on discovering the most significant changes in the data from
previously measured, expected or normative values.
• Most CRM solutions have a DD task running in parallel on a regular
basis.
• Deviation might be a result of random occurrence or a result of true
change in fundamentals.
• It is predictive.
• It is exploratory in nature and does not ascertain the reason for the
occurrence of the detected pattern.
Application of Deviation detection
Mostly used for security purposes
• Fraud detection in insurance claims (fictitious, inflated claims)
• Deviation in Credit card/ATM/Debit card usage (purchases not fitting the profile of
customer, large amount of e-commerce purchase, flagging off sudden
international cases, small purchases followed by big purchases)
• Telecommunications fraud detection (subscription fraud, superimposed fraud)
• Store sales deviation (in chain stores – revenues, expenses)
• Deviation in medical parameters (genes linked diseases)
• Network intrusion detection
5. Segmentation
Segmentation
Segmentation is a key data
mining technique.
Segmentation refers to the business task
of identifying groups of records that are
similar to each other (A segment is a
group of consumers that react in a
similar way to a particular marketing
approach)
Discussion
• What is the difference between Segmentation and
Clustering?
Clustering ( a tool for segmentation) is the process of finding
similarities in customers so that they can be grouped, and
therefore segmented.
• What is the difference between Segmentation and
Classification?
Classification, in the most traditional sense has training
labels, and is in charge of classifying something as a class or
not. (It predictive technique)
Segmentation can be directed or undirected
• Directed - One simple way of segmentation may be segmenting customers based
on the type of credit cards that they use – silver, gold, platinum etc.
• Undirected – Sometimes target variables may not be known in advance, the goal
then is to search of patterns that suggest targets. The clustering tool is used for this
purpose. Example – Grouping people together based on their size fit for stitching
military uniforms
Need for segmentation
• Today businesses that cast a wide net in their marketing activities cannot
maximize their profitability and return on investments.
• The ability to identify segments with latent need, purchase behavior patters,
attitudes, etc will make it easy for actionable marketing programs
Segmentation helps in
1.Provide a platform to identify and serve your most profitable customers
2.Focusing retention efforts
3.Customizing messages and offers
4.Identifying less served groups
Some factors that can be used to segment
customers
• Geography / Demographics
• Psychographics – Attitudes, Interest, opinions etc
• Buying behaviour – usage, purchase occasion
• Length of relationship / Tenure in current subscription
• Revenue / profitability / CLV
• Loyalty / propensity to churn
• Channel used
An Example
Data Mining – tools
/techniques
Data mining tools
1.Decision Tree: Decision Tree is a tool that classify cases into finite number
of classes considering one variable at a time & dividing the entire set based
on it. (Helps in classification task)
2.Rule Induction (Association rules) : is a tool which helps in extraction of
formal rules from a set of observations. (Helps in finding link between
products)
3.Nearest Neighbor Technique: It is a tool which uses past data instances with
known output values to predict an unknown output value of a new data
instance. (Helps in deviation detection and classification)
4.Clustering: Clustering is the process of grouping observations of similar
kinds into smaller groups within the larger population. (Helps in
segmentation task)
5.Visualization: Graphical representation of Data
1. Decision tree
• Decision Tree is a tool that classify cases into finite number of classes
considering one variable at a time & dividing the entire set based on it
• Classification is possible because at each node, the decision tree splits the
database into segments based on value of the variable considered
• There are hierarchical collection of rules that describe how to divide a large
collection of records into successively smaller groups of records. With each
successive division, the members of the resulting segments become more and
more similar to one another with respect to the target
How it works?
• Decision trees recursively split data into smaller and smaller cells which
are increasingly “pure "in the sense of having similar values of the
target
• Decision tree uses target variable to determine how each input should
be partitioned
• Taken together, the rules for all the segments form the decision tree
model
2. Rule Induction
Rule induction is an area of machine learning in which formal rules are extracted
from a set of observations.
Rule Induction - Association Rules
• In rule induction, we focus on association rules in our syllabus.
• Association rules were originally derived from point-of-sale data that
describes transactions consisting several products
• These are rules of the form ‘if A then B’ (A  B).
• These rules do not give the exact nature of the relationship but only point
towards a general interaction between the two items.
• Association analysis is a type of undirected data mining that finds patterns in
the data where the target is not specified beforehand
• Useful rules contain high-quality, actionable information. After the pattern is
found, justifying it by telling a story can lead to insights and action.
Association Rule – application in retail sector
• Association rules help in market
basket analysis in retail sector -
What are the product affinities in a
shopping cart.
• IF <LHS> THEN <RHS>
• IF { X } THEN { Y }
• They imply a relationship between
X and Y, where X and Y can be
single items or sets of items.
• X is some times referred to the
antecedent and Y as the
consequent
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Rules are not always useful. They might be
actionable at times, and trivial or inexplicable
at other times
• Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of
also purchasing one of the three types of a brand of candy bars (actionable).
• Customers who purchase maintenance agreements are very likely to
purchase large appliances (trivial findings)
• When a new hardware store opens, one of the most commonly sold items is
toilet cleaners (inexplicable)
Developing Association Rules
• The table on the right side indicates
transactions (of 5 different
customers) from a grocery store.
• Each transaction gives information
about which products are purchased
with which other products
• The information is then copied on to
a co-occurrence matrix that tells the
number of times any pair of product
was purchased together.
• The goal is to turn item sets into
association rules.
Co-Occurrence of Products
The numbers in the co-occurrence matrix
represent the no. of transactions
For example, window cleaners and orange
juice occur together only once, while soda
and orange juice occur together in 2
transactions out of 5
2
2
How good is an association rule
Three traditional methods for
measuring the goodness of an
association rule are
•Support
•Confidence
•Lift
Support, Confidence and Lift
• Support shows the prevalence of the rule in the given set of transactions. It
measures the ratio of transactions that contain all the items in the rule to
the total number of transactions. A high level of support indicates that the
rule is frequent enough for the business to be interested in it.
• Confidence measures how good a rule is at predicting the right hand side,
by comparing how often the right-hand side appears when the condition
on the left-hand side is true. It is the probability of the occurrence of the
consequent once the antecedent has taken place (conditional probability). It
is calculated as the ratio of number of transactions supporting the entire
rule to the number of transactions supporting the left-hand side of the rule.
• Lift measures the power of association of the rule. Lift indicates how much
more likely the consequent is to be purchased if the customer has bought
the antecedent as compared to the likelihood of consequent being
purchased without the other item. Lift must be above 1 for the association
to be of interest. Lift is calculated as the ratio of the confidence of the rule to
the prevalence of the right-side of the rule.
Calculating Support
A= soda and B= orange juice,
Support A->B
= n(AnB)/ total transactions
• No. of transactions when
soda and orange juice
appear together = 2.
• Total no. of transactions
(total customer billings) = 5
• Support for IF <Soda> Then
<OJ> = 2/5 = 40%
Calculating Confidence
Confidence AB = Support
(A n B)
Support (A)
Support AnB = 2/5
Support A = 3/5
Confidence IF <Soda> Then
<OJ> = 2/5 = 67%
3/5
Calculating Lift
• Lift (also called improvement) measures the power of
association of the full rule.
• The reason for doing that is because the items on the RHS
might be very common so the confidence may not be telling us
anything regarding the actual predictability of the rule
Lift AB = Support (A n B)/ Support (A) Support(B)
Or it can be simplified as
• Confidence (A n B)/Support (B)
Calculating Lift Confidence of the rule A B = 67%
Support (B) = 4/5 = 80%
Lift A B= 67/80 = 0.8375
Here Lift is <1, rule not useful
If Lift is > 1 then it shows that the association rule
is well established and the RHS is going to be in
more no. of cases when LHS is there than when
LHS is not there.
Hypothetically say if Lift in this case was 1.25 then
we can say that ‘’OJ’’ is 25% more likely to be in the
transaction if ‘’Soda’’ is purchased than when soda
not purchased
Association Rule : Application
Retail
1. Supermarket shelf management
2. Bundling products with offers
3. Inventory Management
4. Pricing associated products
5. Catalog designing
6. Recommender systems
Other sectors
1. Items purchased on a credit card such as rental cars and hotel rooms.
2. Optional services purchased by telecommunication customer (call forwarding, caller tune etc.)
3. Banking products used by retail customers (money market accounts, investment services, car
loans etc.)
4. Medical patient histories can give indications of likely complications on certain combinations of
treatments.
Support, confidence and Lift
1.Milk, Bread, Orange juice
2.Milk, Bread, floor cleaner
3.Milk, bread, Chocolate
4.Chocolate, floor cleaner
Support of If milk then bread = 75%
Confidence = 100%
Lift (of If milk then bread) = .75
.75 x .75
= 1.33
Inference  There is a strong positive
association
1.Milk, Cornflakes, Noodles
2.Milk, Shampoo, Bread
3.Shampoo, soap, noodles
4.Milk, Cornflakes
Support of If milk then Shampoo = 25%
Confidence = 33%
Lift of If milk then bread = .25
.75 x .5
= .66
3. Nearest Neighbor
Technique
NNT is a Classification and Predictive data mining technique
NNT - Introduction
• Distance and similarity are important to nearest neighbour
technique
• Uses past data instances with known output values to predict an
unknown output value of a new data instance.
• It is based on:
1. Memory-Based Reasoning(MBR)
Identifying similar cases from past experience and applying the
information from these cases to the problem at hand.
2. Collaborative filtering or Social filtering
Finds neighbours who are similar(with similar preference) to a new
record and uses it for classification and prediction.
Height and weight of
known cases
• X axis – Height
• Y axis – weight
• Suppose the dark circles represent Men and
White circles represent women. Given the height
and weight of the new data point we can predict if
it is a man or a woman.
• NNT is about existence of two operations
-Distance function capable of calculating a distance
between any two records
-Combination function capable of combining results
from several neighbours to arrive at an answer
• K value (how many neighbours to be considered
while combining) has to be specified
Attributes to consider
• People living in the same neighborhood generally have similar
household income, hence if a new person comes as a neighbor,
there is a high chance that the new person income is similar to
the other people income in the neighborhood
• However, there could be a variety of factors which can affect
the person income like the degree attained or college attended
hence a better predictor of the person’s income
• So ‘near’ can be the degree attained or college attended than
the neighborhood
Applications of NNT
• Deviation detection – flag cases for further investigation based on
nearest neighbour
• Customer response prediction – Next likely customers for campaigns
• Loan default prediction – predicting default bases on income and
debt or any other available variables
• Medical treatment – Diagnosis and treatment of cases (classifying
anomalies in mammograms)
• Using to fill Missing values in data - classification
• Plagiarism testing
• Making recommendations - Netflix
NNT characteristics
• NNT uses data “as is” and is not affected by format of records as the algorithms are robust with
missing data and erroneous data
• Many advantages but NNT is a resource hog !! – It requires large amount of historical data and
Large amount of processing time
Finding nearest neighbor for audio files
• Shazam identifies the song in half a minute
and provides the name of the song, artist etc
• Over 100 million downloads
• Many challenges for this app –direct
comparison of snippets of music will not
work
• Uses nearest neighbor technique
• To classify songs based on their audio
signature
Spectrogram: Picture of a song in
the frequency domain. (difficult
to identify)
Some apps are listening to you through the smartphone’s mic to
track your TV viewing, says report
• The apps use software from
Alphonso, a start-up that collects TV-
viewing data for advertisers. Using a
smartphone’s microphone,
Alphonso’s software can detail what
people watch by identifying audio
signals in TV ads and shows,
sometimes even matching that
information with the places people
visit and the movies they see. The
information can then be used to
target ads more precisely and to try
to analyze things like which ads
prompted a person to go to a car
dealership.
This feature is based on NNT
Measuring Distance and similarity – NNT
works with all kinds of data
Marketing data for five customers – NNT
works with categorical data also
Distance between the data points
4. Clustering
Clustering is the process of grouping
observations of similar kinds into smaller groups
within the larger population.
Clusters are homogeneous within and
heterogeneous across, based on given
characteristics.
Example - Sales data for 9 stores
Store Electric Kettle Refrigerators
S1 35 12
S2 42 15
S3 40 10
S4 10 20
S5 14 18
S6 13 19
S7 4 25
S8 8 30
S9 6 27
Clusters of stores
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45
Refrigerators
Electric kettle
Characteristics of Clustering
• No dependent variable; no causal relationships
• Popularly called subjective segmentation
• Undirected / unsupervised data mining technique
Characteristics of Clustering
• It relies on geometric interpretation of data as points in space
• Based upon distance and similarity concept
• Clustering can be a preliminary step for data mining.
Data preparation for clustering
• Data preparation issues same as those for memory based reasoning because both rely on the
concept of distance.
• Scaling and weighting is important
• Scaling adjusts the value of variables to take into account the fact that different
variables are measured in different units or over different ranges.
• Weighting provides a relative adjustment for a variable because some variables are more
important than others
The K-Means clustering algorithm
• One of the most commonly used clustering algorithms.
• “K” refers to the fact that the algorithm looks for a fixed number of
clusters
• K is specified by the user
• Each record is considered a point in a scatter plot, which implies that
all the input variables are numeric
• The data can be pictured as clouds on the scatter plot
Steps in K-means clustering Algorithm
• K-means algorithm starts with an initial guess and uses a series of steps to improve upon
it.
• Initial cluster seeds are chosen randomly
• Each record is ASSIGNED to the cluster defined by its nearest cluster seed
• Next the algorithm identifies the Centroid for each cluster. A Centroid is the average
position of cluster members in each dimension.
• The data points are now UPDATED to new clusters based on their distance from the
centroid
• Algorithm alternates between two steps
• Assignment step
• Update step
Steps in K-means clustering Algorithm
• When the assignment step changes the cluster membership, it causes the
update step to be executed again. Successive iteration is done till no new
clusters emerge
• The algorithm continues alternating between update and assignment until
no new assignment is made
• A diagrammatic representation of the K-means clustering is shown in the
following slides
K – Means Clustering Algorithm – First
step – random cluster seeds are chosen
K – Means Clustering Algorithm –
Assignment step
K – Means Clustering Algorithm –
Calculating Centroid
K – Means Clustering Algorithm – Update
step
Hard and soft clusters
• A cluster model can make “hard” or “soft” cluster
assignments
• Hard clustering assigns each record to a single cluster
• Soft clustering associates each record with several clusters
with varying degree of strength.
Evaluating clusters
• Good cluster centre should be the densest part of the data cloud.
• The best assignment of cluster centres could be defined as the one that minimizes
the sum of distance from every data point to its nearest cluster centre.
• Clusters should have members with high degree of similarity (Close to each other)
• Clusters themselves should be widely spaced
• Clusters should have roughly the same number of members (except for detecting
fraud and anomalies)
Some basic features about clustering
• There may not be any prior knowledge concerning the clusters
• Cluster results are dynamic as the cluster membership may change over
time
• There is not any one correct answer to clustering problem
• Difficult to handle outliers in Clustering – They can be put in solitary
clusters or they will be forced to be placed in the nearest cluster
Clustering: USES
• Grouping customers for targeting offers
• Grouping customers for cross selling and upselling
• Grouping customers for customizing
• Categorizing products for mark down (In the first scatter plot electric kettles
can be discounted in certain stores and refrigerators can be marked down in
certain others)
• Calculating footfall to branches of a store based on their location
Voronoi Diagram
• Based on the concept of clustering
• Each region on a Voronoi diagram consists of all the points closest to one of
the cluster centres.
• It is a diagram whose lines mark the points that are equidistant from the two
nearest cluster seeds.
• Voronoi diagram facilitate nearest location queries such as the location of the
closest metro station, cell phone tower, etc to a given address.
• A voronoi representation can be used to find the catchment zone for outlets of
a chain store. It can be used to understand the number of students who will be
enrolling in the nearest primary school, the population who will be visiting the
closest healthcare centre etc.
Example
There are 19 pizza outlets in Mumbai belonging to the same company selling
the same products and have same pricing and quality service. We want to
estimate the customer footfall for a particular store in one region of Mumbai
We assume that the customer will visit the store based on the closest distance in
this case. A Voronoi diagram will define the catchment area for each of the
stores which will give us a rough estimate of the customer footfall for each of
the stores

More Related Content

Similar to Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx

Measurement and scaling
Measurement and scalingMeasurement and scaling
Rd big data & analytics v1.0
Rd big data & analytics v1.0Rd big data & analytics v1.0
Rd big data & analytics v1.0
Yadu Balehosur
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
VickyKumar131533
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptx
AsadAli104515
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
Neerupa Chauhan
 
Analytics
AnalyticsAnalytics
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules
FEG
 
Cluster2
Cluster2Cluster2
Cluster2work
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptx
JayaprakashGururaj
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligenceFaisal Aziz
 
Machine Learning - Algorithms and simple business cases
Machine Learning - Algorithms and simple business casesMachine Learning - Algorithms and simple business cases
Machine Learning - Algorithms and simple business cases
Claudio Mirti
 
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
Supply Chain Analytics, Supply Chain Management, Supply Chain Data AnalyticsSupply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
MujtabaAliKhan12
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
Nguyen Ngoc Binh Phuong
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
The Art and Science of Database(d) Marketing
The Art and Science of Database(d) MarketingThe Art and Science of Database(d) Marketing
The Art and Science of Database(d) MarketingRobert Fulmer
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Data Analysis - Approach & Techniques
Data Analysis - Approach & TechniquesData Analysis - Approach & Techniques
Data Analysis - Approach & Techniques
InvenkLearn
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
Aniket Patil
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
patilaniket2418
 

Similar to Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx (20)

Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scaling
 
Rd big data & analytics v1.0
Rd big data & analytics v1.0Rd big data & analytics v1.0
Rd big data & analytics v1.0
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptx
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
 
Analytics
AnalyticsAnalytics
Analytics
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules
 
Cluster2
Cluster2Cluster2
Cluster2
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptx
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Machine Learning - Algorithms and simple business cases
Machine Learning - Algorithms and simple business casesMachine Learning - Algorithms and simple business cases
Machine Learning - Algorithms and simple business cases
 
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
Supply Chain Analytics, Supply Chain Management, Supply Chain Data AnalyticsSupply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scaling
 
The Art and Science of Database(d) Marketing
The Art and Science of Database(d) MarketingThe Art and Science of Database(d) Marketing
The Art and Science of Database(d) Marketing
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Data Analysis - Approach & Techniques
Data Analysis - Approach & TechniquesData Analysis - Approach & Techniques
Data Analysis - Approach & Techniques
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 

Recently uploaded

Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
creerey
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
Lital Barkan
 
Training my puppy and implementation in this story
Training my puppy and implementation in this storyTraining my puppy and implementation in this story
Training my puppy and implementation in this story
WilliamRodrigues148
 
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdfikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
agatadrynko
 
Company Valuation webinar series - Tuesday, 4 June 2024
Company Valuation webinar series - Tuesday, 4 June 2024Company Valuation webinar series - Tuesday, 4 June 2024
Company Valuation webinar series - Tuesday, 4 June 2024
FelixPerez547899
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
Workforce Group
 
Authentically Social by Corey Perlman - EO Puerto Rico
Authentically Social by Corey Perlman - EO Puerto RicoAuthentically Social by Corey Perlman - EO Puerto Rico
Authentically Social by Corey Perlman - EO Puerto Rico
Corey Perlman, Social Media Speaker and Consultant
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
SynapseIndia
 
Exploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social DreamingExploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social Dreaming
Nicola Wreford-Howard
 
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdfMeas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
dylandmeas
 
amptalk_RecruitingDeck_english_2024.06.05
amptalk_RecruitingDeck_english_2024.06.05amptalk_RecruitingDeck_english_2024.06.05
amptalk_RecruitingDeck_english_2024.06.05
marketing317746
 
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Arihant Webtech Pvt. Ltd
 
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdfBài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
daothibichhang1
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
Cynthia Clay
 
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdfikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
agatadrynko
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
usawebmarket
 
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
Boris Ziegler
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
LR1709MUSIC
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
Operational Excellence Consulting
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
Lviv Startup Club
 

Recently uploaded (20)

Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
 
LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024LA HUG - Video Testimonials with Chynna Morgan - June 2024
LA HUG - Video Testimonials with Chynna Morgan - June 2024
 
Training my puppy and implementation in this story
Training my puppy and implementation in this storyTraining my puppy and implementation in this story
Training my puppy and implementation in this story
 
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdfikea_woodgreen_petscharity_dog-alogue_digital.pdf
ikea_woodgreen_petscharity_dog-alogue_digital.pdf
 
Company Valuation webinar series - Tuesday, 4 June 2024
Company Valuation webinar series - Tuesday, 4 June 2024Company Valuation webinar series - Tuesday, 4 June 2024
Company Valuation webinar series - Tuesday, 4 June 2024
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
 
Authentically Social by Corey Perlman - EO Puerto Rico
Authentically Social by Corey Perlman - EO Puerto RicoAuthentically Social by Corey Perlman - EO Puerto Rico
Authentically Social by Corey Perlman - EO Puerto Rico
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
 
Exploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social DreamingExploring Patterns of Connection with Social Dreaming
Exploring Patterns of Connection with Social Dreaming
 
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdfMeas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
 
amptalk_RecruitingDeck_english_2024.06.05
amptalk_RecruitingDeck_english_2024.06.05amptalk_RecruitingDeck_english_2024.06.05
amptalk_RecruitingDeck_english_2024.06.05
 
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdfSearch Disrupted Google’s Leaked Documents Rock the SEO World.pdf
Search Disrupted Google’s Leaked Documents Rock the SEO World.pdf
 
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdfBài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
Bài tập - Tiếng anh 11 Global Success UNIT 1 - Bản HS.doc.pdf
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdfikea_woodgreen_petscharity_cat-alogue_digital.pdf
ikea_woodgreen_petscharity_cat-alogue_digital.pdf
 
Buy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star ReviewsBuy Verified PayPal Account | Buy Google 5 Star Reviews
Buy Verified PayPal Account | Buy Google 5 Star Reviews
 
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
Agency Managed Advisory Board As a Solution To Career Path Defining Business ...
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
 
Sustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & EconomySustainability: Balancing the Environment, Equity & Economy
Sustainability: Balancing the Environment, Equity & Economy
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
 

Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx

  • 1. Data Mining Tasks / Operations Managers need to perform a variety of tasks with the data to formulate best course of action / take decisions. The tasks are given in detail in this unit.
  • 2. Data mining tasks/operations relevant to CRM 1. Classification: mapping a given data item into one of the several predefined classes. (It is a predictive technique) 2. Regression: Predicting the value of a dependent variable based on values of independent variable/s 3. Link Analysis: Establishing relationship between items or variables in a database record to expose patterns and trends. Example- Association rules, Sequential patterns, Time Sequences 4. Segmentation: Identifying a finite set of naturally occurring clusters or categories to describe data : Clustering 5. Deviation Detection: Discovering the most significant change in the data from previously measured or expected values.
  • 3. • Classification is perhaps the most basic form of data analysis. • Classification is a process that maps a given data item into one of the several predefined classes (Weiss & Kulikowski, 1991) • A Retail store allows customers to purchase on credit & pay later in installments QUESTION- Which of the customers are going to receive this facility?? Answer  use classification technique to predict if customer is worthy of credit • CRM uses classification for a variety of purposes like behavior prediction, product & customer categorization. For example - The recipient of an offer can respond or not respond. An applicant for a loan can repay on time, repay late or declare bankruptcy. A credit card transaction can be normal or fraudulent. Classification helps in all the above cases • Classification is meaningful only if we have predefined classes. • In classification we are trying to give a correct label to a given input • It is a directed (supervised) style data mining technique (there is a target variable that the manager is interested in) and it is predictive. 1. Classification
  • 4. Classification Goal of Classification: mapping previously unseen records to a a pre- defined class as accurately as possible 1. Given a collection of historical records (used a training set) 2. Each record contains a set of attributes, one of the attributes is the ‘CLASS’. 3. Build a model for class attribute as a function of the values of other attributes. 4. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 5. New customers can then be classified into two classes. Example: those likely to default payment and those unlikely to default payment of installments.
  • 5.
  • 6. Example of use of classification in Income tax department • Identify people who would have underpaid tax • classify people (historical data) into 2 groups – those who pay required tax (not cheat) and those who underpay tax (cheat) • Variables used to build a model – claim for tax refund, marital status, taxable income • Historical data divided into training set and test set. Model built with training set and accuracy of the prediction checked with test set
  • 7. Example of use of Classification (Income tax) Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marital Status Taxable Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 Test Set Training Set Model Learn Classifier
  • 8. Classification: Applications Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers most likely to buy a new product. Approach: 1. Use the data for a similar product introduced before. 2. We know which customers decided to buy and who decided otherwise. This {buy, don’t buy} decision forms the class attribute. 3. Use various demographic, lifestyle, and company-interaction related information about all such customers. 4. Use this information as input attributes to learn a classifier model. 5. Use model on new data and send mails only to those who are predicted to buy Credit Rating Goal: Predict who will default payments for a bank loan/credit scheme of a retail store Approach: 1. Use demographics and behavioral data as attributes -Age, salary, marital status, tax paid etc 2. Label past customers as low and high risk (class attribute) 3. Learn a model for the class of the transactions. 4. Use this model to detect possible defaulters 5. Extend loan only to the class who are of ‘low risk class’
  • 10. Regression Basics • Managerial decisions are often based on relationship between 2 or more variables • Regression is a data mining task which helps us to find the relation between a dependent variable (Y) and one (X) or more independent variables (X1, X2, X3 …..) • We can quantify the strength of the relationship between the two variables. • This can then be used to predict the value of Y, given the value of X. • It is a predictive technique. • It is a directed data mining techniques • Both dependent and independent variables are numeric in linear regression.
  • 11. WHERE DO WE USE REGRESSION? • Companies would like to know about factors significantly associated with their Key Performance Indicators in order to be in a position to predict them. • Marketing – Sales, Customer Satisfaction, Churn, New product acceptance • Finance – Credit-risk • HRM – Attrition, Job satisfaction • Operations – efficiency
  • 12. Regression uses in CRM • The prediction of sales using data of ad spending, price (FMCG) • Predicting demand with price (airline, FMCG etc) • To predict customer churn based on customer bill details. (Mobile) • Model sales with variables like hours when shop is open, discounts given etc (retail).
  • 13. Regression basics • Dependent (response/outcome - Y) and Independent (explanatory/input - X) variable • Scatter plot • Line of best fit- The relationship between the two variables is approximated by a straight line • b0 - This is the intercept of the regression line with the y-axis. It is the value of Y if the value of X = 0 • b1 - This is the SLOPE of the regression line. Thus this is the amount that the Y variable (dependent) will change for each 1 unit change in the X variable y = b0 + b1x +e Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points.
  • 14. Find the answer • If the slope of the regression line is calculated to be 2.5 and the intercept 16 then what is the value of Y (dependent variable) when X (independent) is 4?
  • 15. Simple Linear Regression Equation Positive Linear Relationship E(y) x Slope b1 is positive Regression line Intercept b0 Corporate Example: Increase in end retailer knowledge of offers positively correlated to pre-paid recharge sales in mobile
  • 16. Simple Linear Regression Equation Negative Linear Relationship E(y) x Slope b1 is negative Regression line Intercept b0 Price and demand of products negatively correlated
  • 17.
  • 18. 3. LINK ANALYSIS It is a data mining technique that addresses relationships and connections
  • 19. Link Analysis • Based on a branch of mathematics called Graph Theory, which represents relationships between different objects as edges in a graph. • Nodes: (Vertices) Things in the graphs that have relationships Eg. People , Organization, Objects, places, transactions etc. • Edges: The relationship connection between nodes
  • 20. Examples of Graphs with Nodes and Edges
  • 21. Link Analysis • Link analysis seeks to establish relation between items or variables in a database to expose patterns and trends. • It can also trace connections between items in records over a period of time. It is mostly undirected. • Helps in visualizing relationship. It is descriptive techniques • Has limited application - Cannot solve all types of problems • Yields good results in following CRM related applications: • Analyzing telephone call patterns • Analyzing link between cities / places travelled by customers (airline, transportation companies) • Understanding referral patterns • Analyzing patterns of purchase in shopping cart (characterizing product affinities in retail purchases) – most common use
  • 22. Understanding referral patterns Link analysis helps us to analyze referral patterns This help us to locate most influential customers and target them with offers
  • 24. Market Basket analysis Used in understanding the link between items in grocery shopping
  • 25. Other Applications • Analyzing link between web pages • Social Network Analysis • Spread of diseases (like flu) / Spread of ideas / diffusion of innovation • Crime & terror links
  • 26. Google – Directed Graph Example • Web pages = nodes • Hyperlinks = edges • Spiders & Web crawlers updating
  • 27. Google – example continued • Authority versus mere popularity • Rank by number of unrelated sites linking to a site yields popularity • Rank by number of subject- related hubs that point to them yields authority • Helps to overcome the situation that often arises in popularity where the real authority is ranked lower because of lack of popularity of links to it
  • 28. Page Rank • Each link acts as a vote which helps in ranking • Analyse the incoming links to your page • Evaluate the quality of link • Analyse the link building strategy • Improve your page rankings • Used in SEO
  • 29. LINK ANALYSIS Criminal Investigation Helps in finding terrorist groups
  • 31. Anomalies or Deviations • Outliers are data points that are considerably different from the remainder of the data • Anomalous events occur infrequently but the consequences can be dramatic and very often negative • 1985 British Antarctic survey showed depletion of ozone layer by 10%. Why did NASA’s nimbus 7 satellite not capture this fact?
  • 32. Deviation Detection • Focuses on discovering the most significant changes in the data from previously measured, expected or normative values. • Most CRM solutions have a DD task running in parallel on a regular basis. • Deviation might be a result of random occurrence or a result of true change in fundamentals. • It is predictive. • It is exploratory in nature and does not ascertain the reason for the occurrence of the detected pattern.
  • 33. Application of Deviation detection Mostly used for security purposes • Fraud detection in insurance claims (fictitious, inflated claims) • Deviation in Credit card/ATM/Debit card usage (purchases not fitting the profile of customer, large amount of e-commerce purchase, flagging off sudden international cases, small purchases followed by big purchases) • Telecommunications fraud detection (subscription fraud, superimposed fraud) • Store sales deviation (in chain stores – revenues, expenses) • Deviation in medical parameters (genes linked diseases) • Network intrusion detection
  • 35. Segmentation Segmentation is a key data mining technique. Segmentation refers to the business task of identifying groups of records that are similar to each other (A segment is a group of consumers that react in a similar way to a particular marketing approach)
  • 36. Discussion • What is the difference between Segmentation and Clustering? Clustering ( a tool for segmentation) is the process of finding similarities in customers so that they can be grouped, and therefore segmented. • What is the difference between Segmentation and Classification? Classification, in the most traditional sense has training labels, and is in charge of classifying something as a class or not. (It predictive technique)
  • 37. Segmentation can be directed or undirected • Directed - One simple way of segmentation may be segmenting customers based on the type of credit cards that they use – silver, gold, platinum etc. • Undirected – Sometimes target variables may not be known in advance, the goal then is to search of patterns that suggest targets. The clustering tool is used for this purpose. Example – Grouping people together based on their size fit for stitching military uniforms
  • 38. Need for segmentation • Today businesses that cast a wide net in their marketing activities cannot maximize their profitability and return on investments. • The ability to identify segments with latent need, purchase behavior patters, attitudes, etc will make it easy for actionable marketing programs Segmentation helps in 1.Provide a platform to identify and serve your most profitable customers 2.Focusing retention efforts 3.Customizing messages and offers 4.Identifying less served groups
  • 39. Some factors that can be used to segment customers • Geography / Demographics • Psychographics – Attitudes, Interest, opinions etc • Buying behaviour – usage, purchase occasion • Length of relationship / Tenure in current subscription • Revenue / profitability / CLV • Loyalty / propensity to churn • Channel used
  • 41. Data Mining – tools /techniques
  • 42. Data mining tools 1.Decision Tree: Decision Tree is a tool that classify cases into finite number of classes considering one variable at a time & dividing the entire set based on it. (Helps in classification task) 2.Rule Induction (Association rules) : is a tool which helps in extraction of formal rules from a set of observations. (Helps in finding link between products) 3.Nearest Neighbor Technique: It is a tool which uses past data instances with known output values to predict an unknown output value of a new data instance. (Helps in deviation detection and classification) 4.Clustering: Clustering is the process of grouping observations of similar kinds into smaller groups within the larger population. (Helps in segmentation task) 5.Visualization: Graphical representation of Data
  • 43. 1. Decision tree • Decision Tree is a tool that classify cases into finite number of classes considering one variable at a time & dividing the entire set based on it • Classification is possible because at each node, the decision tree splits the database into segments based on value of the variable considered • There are hierarchical collection of rules that describe how to divide a large collection of records into successively smaller groups of records. With each successive division, the members of the resulting segments become more and more similar to one another with respect to the target
  • 44. How it works? • Decision trees recursively split data into smaller and smaller cells which are increasingly “pure "in the sense of having similar values of the target • Decision tree uses target variable to determine how each input should be partitioned • Taken together, the rules for all the segments form the decision tree model
  • 45.
  • 46.
  • 47. 2. Rule Induction Rule induction is an area of machine learning in which formal rules are extracted from a set of observations.
  • 48. Rule Induction - Association Rules • In rule induction, we focus on association rules in our syllabus. • Association rules were originally derived from point-of-sale data that describes transactions consisting several products • These are rules of the form ‘if A then B’ (A  B). • These rules do not give the exact nature of the relationship but only point towards a general interaction between the two items. • Association analysis is a type of undirected data mining that finds patterns in the data where the target is not specified beforehand • Useful rules contain high-quality, actionable information. After the pattern is found, justifying it by telling a story can lead to insights and action.
  • 49. Association Rule – application in retail sector • Association rules help in market basket analysis in retail sector - What are the product affinities in a shopping cart. • IF <LHS> THEN <RHS> • IF { X } THEN { Y } • They imply a relationship between X and Y, where X and Y can be single items or sets of items. • X is some times referred to the antecedent and Y as the consequent Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
  • 50. Rules are not always useful. They might be actionable at times, and trivial or inexplicable at other times • Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of the three types of a brand of candy bars (actionable). • Customers who purchase maintenance agreements are very likely to purchase large appliances (trivial findings) • When a new hardware store opens, one of the most commonly sold items is toilet cleaners (inexplicable)
  • 51. Developing Association Rules • The table on the right side indicates transactions (of 5 different customers) from a grocery store. • Each transaction gives information about which products are purchased with which other products • The information is then copied on to a co-occurrence matrix that tells the number of times any pair of product was purchased together. • The goal is to turn item sets into association rules.
  • 52. Co-Occurrence of Products The numbers in the co-occurrence matrix represent the no. of transactions For example, window cleaners and orange juice occur together only once, while soda and orange juice occur together in 2 transactions out of 5 2 2
  • 53. How good is an association rule Three traditional methods for measuring the goodness of an association rule are •Support •Confidence •Lift
  • 54. Support, Confidence and Lift • Support shows the prevalence of the rule in the given set of transactions. It measures the ratio of transactions that contain all the items in the rule to the total number of transactions. A high level of support indicates that the rule is frequent enough for the business to be interested in it. • Confidence measures how good a rule is at predicting the right hand side, by comparing how often the right-hand side appears when the condition on the left-hand side is true. It is the probability of the occurrence of the consequent once the antecedent has taken place (conditional probability). It is calculated as the ratio of number of transactions supporting the entire rule to the number of transactions supporting the left-hand side of the rule. • Lift measures the power of association of the rule. Lift indicates how much more likely the consequent is to be purchased if the customer has bought the antecedent as compared to the likelihood of consequent being purchased without the other item. Lift must be above 1 for the association to be of interest. Lift is calculated as the ratio of the confidence of the rule to the prevalence of the right-side of the rule.
  • 55. Calculating Support A= soda and B= orange juice, Support A->B = n(AnB)/ total transactions • No. of transactions when soda and orange juice appear together = 2. • Total no. of transactions (total customer billings) = 5 • Support for IF <Soda> Then <OJ> = 2/5 = 40%
  • 56. Calculating Confidence Confidence AB = Support (A n B) Support (A) Support AnB = 2/5 Support A = 3/5 Confidence IF <Soda> Then <OJ> = 2/5 = 67% 3/5
  • 57. Calculating Lift • Lift (also called improvement) measures the power of association of the full rule. • The reason for doing that is because the items on the RHS might be very common so the confidence may not be telling us anything regarding the actual predictability of the rule Lift AB = Support (A n B)/ Support (A) Support(B) Or it can be simplified as • Confidence (A n B)/Support (B)
  • 58. Calculating Lift Confidence of the rule A B = 67% Support (B) = 4/5 = 80% Lift A B= 67/80 = 0.8375 Here Lift is <1, rule not useful If Lift is > 1 then it shows that the association rule is well established and the RHS is going to be in more no. of cases when LHS is there than when LHS is not there. Hypothetically say if Lift in this case was 1.25 then we can say that ‘’OJ’’ is 25% more likely to be in the transaction if ‘’Soda’’ is purchased than when soda not purchased
  • 59. Association Rule : Application Retail 1. Supermarket shelf management 2. Bundling products with offers 3. Inventory Management 4. Pricing associated products 5. Catalog designing 6. Recommender systems Other sectors 1. Items purchased on a credit card such as rental cars and hotel rooms. 2. Optional services purchased by telecommunication customer (call forwarding, caller tune etc.) 3. Banking products used by retail customers (money market accounts, investment services, car loans etc.) 4. Medical patient histories can give indications of likely complications on certain combinations of treatments.
  • 60. Support, confidence and Lift 1.Milk, Bread, Orange juice 2.Milk, Bread, floor cleaner 3.Milk, bread, Chocolate 4.Chocolate, floor cleaner Support of If milk then bread = 75% Confidence = 100% Lift (of If milk then bread) = .75 .75 x .75 = 1.33 Inference  There is a strong positive association 1.Milk, Cornflakes, Noodles 2.Milk, Shampoo, Bread 3.Shampoo, soap, noodles 4.Milk, Cornflakes Support of If milk then Shampoo = 25% Confidence = 33% Lift of If milk then bread = .25 .75 x .5 = .66
  • 61. 3. Nearest Neighbor Technique NNT is a Classification and Predictive data mining technique
  • 62. NNT - Introduction • Distance and similarity are important to nearest neighbour technique • Uses past data instances with known output values to predict an unknown output value of a new data instance. • It is based on: 1. Memory-Based Reasoning(MBR) Identifying similar cases from past experience and applying the information from these cases to the problem at hand. 2. Collaborative filtering or Social filtering Finds neighbours who are similar(with similar preference) to a new record and uses it for classification and prediction.
  • 63. Height and weight of known cases • X axis – Height • Y axis – weight • Suppose the dark circles represent Men and White circles represent women. Given the height and weight of the new data point we can predict if it is a man or a woman. • NNT is about existence of two operations -Distance function capable of calculating a distance between any two records -Combination function capable of combining results from several neighbours to arrive at an answer • K value (how many neighbours to be considered while combining) has to be specified
  • 64. Attributes to consider • People living in the same neighborhood generally have similar household income, hence if a new person comes as a neighbor, there is a high chance that the new person income is similar to the other people income in the neighborhood • However, there could be a variety of factors which can affect the person income like the degree attained or college attended hence a better predictor of the person’s income • So ‘near’ can be the degree attained or college attended than the neighborhood
  • 65. Applications of NNT • Deviation detection – flag cases for further investigation based on nearest neighbour • Customer response prediction – Next likely customers for campaigns • Loan default prediction – predicting default bases on income and debt or any other available variables • Medical treatment – Diagnosis and treatment of cases (classifying anomalies in mammograms) • Using to fill Missing values in data - classification • Plagiarism testing • Making recommendations - Netflix
  • 66. NNT characteristics • NNT uses data “as is” and is not affected by format of records as the algorithms are robust with missing data and erroneous data • Many advantages but NNT is a resource hog !! – It requires large amount of historical data and Large amount of processing time
  • 67. Finding nearest neighbor for audio files • Shazam identifies the song in half a minute and provides the name of the song, artist etc • Over 100 million downloads • Many challenges for this app –direct comparison of snippets of music will not work • Uses nearest neighbor technique • To classify songs based on their audio signature
  • 68. Spectrogram: Picture of a song in the frequency domain. (difficult to identify)
  • 69. Some apps are listening to you through the smartphone’s mic to track your TV viewing, says report • The apps use software from Alphonso, a start-up that collects TV- viewing data for advertisers. Using a smartphone’s microphone, Alphonso’s software can detail what people watch by identifying audio signals in TV ads and shows, sometimes even matching that information with the places people visit and the movies they see. The information can then be used to target ads more precisely and to try to analyze things like which ads prompted a person to go to a car dealership. This feature is based on NNT
  • 70. Measuring Distance and similarity – NNT works with all kinds of data
  • 71. Marketing data for five customers – NNT works with categorical data also
  • 72. Distance between the data points
  • 73. 4. Clustering Clustering is the process of grouping observations of similar kinds into smaller groups within the larger population. Clusters are homogeneous within and heterogeneous across, based on given characteristics.
  • 74. Example - Sales data for 9 stores Store Electric Kettle Refrigerators S1 35 12 S2 42 15 S3 40 10 S4 10 20 S5 14 18 S6 13 19 S7 4 25 S8 8 30 S9 6 27
  • 75. Clusters of stores 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 Refrigerators Electric kettle
  • 76. Characteristics of Clustering • No dependent variable; no causal relationships • Popularly called subjective segmentation • Undirected / unsupervised data mining technique
  • 77. Characteristics of Clustering • It relies on geometric interpretation of data as points in space • Based upon distance and similarity concept • Clustering can be a preliminary step for data mining.
  • 78. Data preparation for clustering • Data preparation issues same as those for memory based reasoning because both rely on the concept of distance. • Scaling and weighting is important • Scaling adjusts the value of variables to take into account the fact that different variables are measured in different units or over different ranges. • Weighting provides a relative adjustment for a variable because some variables are more important than others
  • 79. The K-Means clustering algorithm • One of the most commonly used clustering algorithms. • “K” refers to the fact that the algorithm looks for a fixed number of clusters • K is specified by the user • Each record is considered a point in a scatter plot, which implies that all the input variables are numeric • The data can be pictured as clouds on the scatter plot
  • 80. Steps in K-means clustering Algorithm • K-means algorithm starts with an initial guess and uses a series of steps to improve upon it. • Initial cluster seeds are chosen randomly • Each record is ASSIGNED to the cluster defined by its nearest cluster seed • Next the algorithm identifies the Centroid for each cluster. A Centroid is the average position of cluster members in each dimension. • The data points are now UPDATED to new clusters based on their distance from the centroid • Algorithm alternates between two steps • Assignment step • Update step
  • 81. Steps in K-means clustering Algorithm • When the assignment step changes the cluster membership, it causes the update step to be executed again. Successive iteration is done till no new clusters emerge • The algorithm continues alternating between update and assignment until no new assignment is made • A diagrammatic representation of the K-means clustering is shown in the following slides
  • 82. K – Means Clustering Algorithm – First step – random cluster seeds are chosen
  • 83. K – Means Clustering Algorithm – Assignment step
  • 84. K – Means Clustering Algorithm – Calculating Centroid
  • 85. K – Means Clustering Algorithm – Update step
  • 86. Hard and soft clusters • A cluster model can make “hard” or “soft” cluster assignments • Hard clustering assigns each record to a single cluster • Soft clustering associates each record with several clusters with varying degree of strength.
  • 87. Evaluating clusters • Good cluster centre should be the densest part of the data cloud. • The best assignment of cluster centres could be defined as the one that minimizes the sum of distance from every data point to its nearest cluster centre. • Clusters should have members with high degree of similarity (Close to each other) • Clusters themselves should be widely spaced • Clusters should have roughly the same number of members (except for detecting fraud and anomalies)
  • 88. Some basic features about clustering • There may not be any prior knowledge concerning the clusters • Cluster results are dynamic as the cluster membership may change over time • There is not any one correct answer to clustering problem • Difficult to handle outliers in Clustering – They can be put in solitary clusters or they will be forced to be placed in the nearest cluster
  • 89. Clustering: USES • Grouping customers for targeting offers • Grouping customers for cross selling and upselling • Grouping customers for customizing • Categorizing products for mark down (In the first scatter plot electric kettles can be discounted in certain stores and refrigerators can be marked down in certain others) • Calculating footfall to branches of a store based on their location
  • 90. Voronoi Diagram • Based on the concept of clustering • Each region on a Voronoi diagram consists of all the points closest to one of the cluster centres. • It is a diagram whose lines mark the points that are equidistant from the two nearest cluster seeds. • Voronoi diagram facilitate nearest location queries such as the location of the closest metro station, cell phone tower, etc to a given address. • A voronoi representation can be used to find the catchment zone for outlets of a chain store. It can be used to understand the number of students who will be enrolling in the nearest primary school, the population who will be visiting the closest healthcare centre etc.
  • 91. Example There are 19 pizza outlets in Mumbai belonging to the same company selling the same products and have same pricing and quality service. We want to estimate the customer footfall for a particular store in one region of Mumbai We assume that the customer will visit the store based on the closest distance in this case. A Voronoi diagram will define the catchment area for each of the stores which will give us a rough estimate of the customer footfall for each of the stores