Customer
Segmentation
Presented By:
Tanu Rupa
Diksha Milind
Spoorthi Supriya
Mentor Name:
Bapuram Pallavi
Business Problem:
Customer Misclassification
Customer misclassification in business refers to incorrectly categorizing customers, leading to
ineffective marketing, sales strategies, and customer service. This can result in wasted resources,
reduced customer satisfaction, and lost revenue.
Objective:
1. To precisely classify and segment customers based on shared
attributes and behavioral patterns, thereby optimizing targeted
marketing strategies, enhancing customer journey personalization,
and maximizing overall business profitability.
2. Aim is to seeks to classify clients into specific groups to customize
strategies, enhance service delivery, and boost revenue generation
targeted approaches.
Project Flow:
• Data Preprocessing: Clean and prepare the gathered data for analysis, ensuring it's
uniform, accurate, and suitable for segmentation examination.
• Feature Engineering: Feature engineering is the process of transforming raw data into
informative features that improve the performance of machine learning algorithms by
capturing relevant patterns and relationships within the data.
• Feature Selection: Feature selection is the process of identifying and choosing a subset of
relevant features from a larger set of features in a dataset, aiming to improve model
performance, reduce computational complexity, and mitigate overfitting by selecting the
most informative and discriminative features.
• Model Development: Model development is the process of constructing and refining
mathematical or computational representations that predict outcomes or patterns based on
input data, often involving iterative optimization and validation procedures.
Data set details:
Exploratory Data Analysis
(EDA):
Checking Missing
Values:
To assess null values within a
dataset, employ Pandas functions
such as "isnull()" or "isna()".
The bar plot analysis
indicates the presence of null
values within the 'Income'
feature.
Imputation:
• To address the null values in
the dataset, we will remove the
records or observations that
contain null values.
CheckingDuplicates:
• To check for duplicates in a dataset using
Pandas, we can use the duplicated() function,
which returns a boolean Series indicating
whether each row is a duplicate or not.
• After conducting this operation on our
dataset, it has been determined that there are
no duplicate values present, as each
observation appears only once throughout the
entire dataset.
Checking Outliers:
• In our dataset, the attribute
'income' exhibits outlier data
points, which are values that
deviate markedly from the
other observations,
potentially impacting the
statistical analysis and the
performance of any
predictive models by
introducing skewness and
variability.
Imputation:
• To address the outlier data
points in the dataset, we will
employ quantile-based
methods to identify and
remove outliers from the
'income' attribute. This
involves calculating specific
quantiles (25%,50%,75%)
and excluding data points
that fall below or above
these thresholds, thereby
mitigating the impact of
extreme values on
subsequent analyses.
Visualisation:
• This scatter plot represents
the relationship between
the 'ID' and 'MntWines'
features.
• Each point corresponds to
an individual observation
from the dataset, plotted
according to its 'ID' and
'MntWines' values,
facilitating the analysis of
patterns and potential
correlations between these
two variables.
Histogram:
• This histogram is plotted
for 'Recency' for Identifying patterns
such as skewness or the presence of
outliers in the data distribution.
• Through this histogram we analyse
the distribution of
'Recency' and understood the spread
of data in a dataset.
Bar Plot:
• This bar plot is plotted
for 'Accepted Cmp3'
for comparing the offer
acceptance by the customers
in Camp 3.
• Were we observed that the Camp
2 has the lowest acceptance as
compared to other camps.
Feature Engineering
(Labeling And Scaling):
Methodologies:
Labeling Methods:
• Label Encoder
• One-Hot Encoder
• Dummies
Scaling Methods:
• Standard Scaler
• MinMax Scaler
• Robust Scaler
• Normalizer
Label Encoder:
• Labelencoder is a utility in Python scikit-learn library used for converting
categorical data into numerical format.
• It assigns a unique integer to each category, enabling machine learning algorithms
to operate on such data.
• This transformation enables algorithms to effectively interpret and learn from
categorical data during model training.
There are three columns "Education, Marital_Status and Dt_Customer" in our dataset which
are in "object and date-time" format which we will convert into numerical value by
"Label Encoding".
Check Correlation:
• To check correlation between
variables in a dataset, Pandas
provides the corr() function,
yielding a correlation matrix.
• It was observed that the attributes
'Z_CostContact' and 'Z_Revenue'
exhibit a significantly high
correlation.
Imputation:
• Attributes exhibiting high correlation were identified and subsequently
removed from the dataset.
Feature Selection:
(Variance Threshold)
Variance Threshold:
• Variance Threshold is a feature selection technique used to remove
low-variance features from a dataset.
Building Model
PCA
PCA is a dimensionality reduction technique that transforms the data
into a new coordinate system. The new coordinates, or principal
components, are linear combinations of the original features. These
components are ordered by the amount of variance they capture from
the data. The first principal component captures the most variance
K MEANS
K-Means is a popular clustering algorithm used to partition a dataset
into K distinct, non-overlapping subsets (or clusters). It aims to
minimize the within-cluster variance, ensuring that points within each
cluster are as similar as possible.
Elbow method
The elbow method is a commonly
used technique in clustering to
determine the optimal number of
clusters (K). The goal is to find a
balance between having too few
and too many clusters, leading to a
meaningful segmentation of the
data.
2 is the best clustering based on the Elbow
Method
If the Elbow Method indicates that 2 is the optimal number of clusters,
then apply K-Means clustering with 𝐾=2
Steps:
• Fit the K-Means algorithm with 𝐾=2.
• Predict the cluster labels for each data point.
• Visualize the clusters and the cluster centroids.
• Analyze and interpret the resulting clusters.
Clustering Result
After applying K-Means clustering
with 𝐾=2 we can delve into the
characteristics of each cluster to
better understand the customer
segments.
• Cluster 1: Represents older,
lower-income customers with
lower spending and shorter
engagement.
• Cluster 2: Represents younger,
higher-income customers with
higher spending and longer
engagement.
Cluster Profiling
Cluster profiling involves
analyzing and summarizing the
characteristics of clusters that have
been identified through clustering
algorithms.
Customer 1: High-Earning Singles
• This cluster predominantly comprises
individuals born between the mid-1980s
and early 1990s, typically possessing
advanced educational qualifications
such as master's degrees or doctorates.
• These customers are primarily single or
cohabiting without dependent children.
• They exhibit high annual household
incomes, significantly exceeding the
median, which enables financially
secure lifestyles characterized by
frequent, discretionary spending on
premium and high-quality goods and
services.
• Their residence is primarily urban,
facilitating an affluent lifestyle that
encompasses fine dining, extensive
travel, and investment in personal
development and enrichment activities.
Customer 2:Middle-Income Families
• This segment comprises individuals
born between the early 1970s and
mid-1980s.
• Predominantly married, they
manage households with one or
more children, including teenagers.
• Their educational attainment and
their household incomes range from
moderate to upper-middle class.
• Their expenditure patterns are
driven by family-oriented needs, with
a significant allocation towards
education, groceries, and household
necessities.
• Residing primarily in suburban
areas, they place a high value on
community and safety.
• While their shopping frequency is
lower relative to other segments,
their purchasing decisions are
deliberate and well-considered.
DBSCAN Clustering
• DBSCAN Initialization: The DBSCAN algorithm is initialized with
eps=0.5 and min_samples=2.
• eps: The maximum distance between two samples for one to be
considered as in the neighborhood of the other.
• min_samples: The number of samples (or total weight) in a neighborhood
for a point to be considered as a core point.
• Fit and Predict: The fit_predict method applies DBSCAN to the PCA-
transformed data (pca_trf), resulting in cluster labels (label_db).
• Silhouette Score: The silhouette score, which measures how similar an
object is to its own cluster compared to other clusters, is calculated. A
higher silhouette score indicates better-defined clusters.
• Output: The silhouette score for DBSCAN is printed:
0.23263068785884128.
Visual Outputs
Silhouette Score for DBSCAN:
• The silhouette score for DBSCAN clustering is 0.23, indicating that the clusters are not well-
defined compared to other clustering methods used (like KMeans).
Scatter Plot:
• The scatter plot visualizes the PCA-transformed data with the clusters identified by DBSCAN.
• Points are colored based on their cluster assignments, showing how DBSCAN grouped the data.
• The plot highlights the density-based clusters and shows potential noise points (usually assigned a
different color or labeled as -1 in DBSCAN).
Summary:
➔ Checked null values.
➔ Checked duplicates.
➔ Checked outliers.
➔ Feature Engineering.
➔ Feature Selection.
➔ Model Building.
➔ Clustering Result.
➔ Clustering Profile.
➔ DBSCAN Clustering.
Thank You

sarisus hdyses can create targeted .pptx

  • 1.
    Customer Segmentation Presented By: Tanu Rupa DikshaMilind Spoorthi Supriya Mentor Name: Bapuram Pallavi
  • 2.
    Business Problem: Customer Misclassification Customermisclassification in business refers to incorrectly categorizing customers, leading to ineffective marketing, sales strategies, and customer service. This can result in wasted resources, reduced customer satisfaction, and lost revenue. Objective: 1. To precisely classify and segment customers based on shared attributes and behavioral patterns, thereby optimizing targeted marketing strategies, enhancing customer journey personalization, and maximizing overall business profitability. 2. Aim is to seeks to classify clients into specific groups to customize strategies, enhance service delivery, and boost revenue generation targeted approaches.
  • 3.
    Project Flow: • DataPreprocessing: Clean and prepare the gathered data for analysis, ensuring it's uniform, accurate, and suitable for segmentation examination. • Feature Engineering: Feature engineering is the process of transforming raw data into informative features that improve the performance of machine learning algorithms by capturing relevant patterns and relationships within the data. • Feature Selection: Feature selection is the process of identifying and choosing a subset of relevant features from a larger set of features in a dataset, aiming to improve model performance, reduce computational complexity, and mitigate overfitting by selecting the most informative and discriminative features. • Model Development: Model development is the process of constructing and refining mathematical or computational representations that predict outcomes or patterns based on input data, often involving iterative optimization and validation procedures.
  • 4.
  • 5.
  • 6.
    Checking Missing Values: To assessnull values within a dataset, employ Pandas functions such as "isnull()" or "isna()". The bar plot analysis indicates the presence of null values within the 'Income' feature.
  • 7.
    Imputation: • To addressthe null values in the dataset, we will remove the records or observations that contain null values.
  • 8.
    CheckingDuplicates: • To checkfor duplicates in a dataset using Pandas, we can use the duplicated() function, which returns a boolean Series indicating whether each row is a duplicate or not. • After conducting this operation on our dataset, it has been determined that there are no duplicate values present, as each observation appears only once throughout the entire dataset.
  • 9.
    Checking Outliers: • Inour dataset, the attribute 'income' exhibits outlier data points, which are values that deviate markedly from the other observations, potentially impacting the statistical analysis and the performance of any predictive models by introducing skewness and variability.
  • 10.
    Imputation: • To addressthe outlier data points in the dataset, we will employ quantile-based methods to identify and remove outliers from the 'income' attribute. This involves calculating specific quantiles (25%,50%,75%) and excluding data points that fall below or above these thresholds, thereby mitigating the impact of extreme values on subsequent analyses.
  • 11.
    Visualisation: • This scatterplot represents the relationship between the 'ID' and 'MntWines' features. • Each point corresponds to an individual observation from the dataset, plotted according to its 'ID' and 'MntWines' values, facilitating the analysis of patterns and potential correlations between these two variables.
  • 12.
    Histogram: • This histogramis plotted for 'Recency' for Identifying patterns such as skewness or the presence of outliers in the data distribution. • Through this histogram we analyse the distribution of 'Recency' and understood the spread of data in a dataset.
  • 13.
    Bar Plot: • Thisbar plot is plotted for 'Accepted Cmp3' for comparing the offer acceptance by the customers in Camp 3. • Were we observed that the Camp 2 has the lowest acceptance as compared to other camps.
  • 15.
  • 16.
    Methodologies: Labeling Methods: • LabelEncoder • One-Hot Encoder • Dummies Scaling Methods: • Standard Scaler • MinMax Scaler • Robust Scaler • Normalizer
  • 17.
    Label Encoder: • Labelencoderis a utility in Python scikit-learn library used for converting categorical data into numerical format. • It assigns a unique integer to each category, enabling machine learning algorithms to operate on such data. • This transformation enables algorithms to effectively interpret and learn from categorical data during model training. There are three columns "Education, Marital_Status and Dt_Customer" in our dataset which are in "object and date-time" format which we will convert into numerical value by "Label Encoding".
  • 19.
    Check Correlation: • Tocheck correlation between variables in a dataset, Pandas provides the corr() function, yielding a correlation matrix. • It was observed that the attributes 'Z_CostContact' and 'Z_Revenue' exhibit a significantly high correlation.
  • 20.
    Imputation: • Attributes exhibitinghigh correlation were identified and subsequently removed from the dataset.
  • 22.
  • 23.
    Variance Threshold: • VarianceThreshold is a feature selection technique used to remove low-variance features from a dataset.
  • 24.
    Building Model PCA PCA isa dimensionality reduction technique that transforms the data into a new coordinate system. The new coordinates, or principal components, are linear combinations of the original features. These components are ordered by the amount of variance they capture from the data. The first principal component captures the most variance
  • 25.
    K MEANS K-Means isa popular clustering algorithm used to partition a dataset into K distinct, non-overlapping subsets (or clusters). It aims to minimize the within-cluster variance, ensuring that points within each cluster are as similar as possible.
  • 26.
    Elbow method The elbowmethod is a commonly used technique in clustering to determine the optimal number of clusters (K). The goal is to find a balance between having too few and too many clusters, leading to a meaningful segmentation of the data.
  • 27.
    2 is thebest clustering based on the Elbow Method If the Elbow Method indicates that 2 is the optimal number of clusters, then apply K-Means clustering with 𝐾=2 Steps: • Fit the K-Means algorithm with 𝐾=2. • Predict the cluster labels for each data point. • Visualize the clusters and the cluster centroids. • Analyze and interpret the resulting clusters.
  • 28.
    Clustering Result After applyingK-Means clustering with 𝐾=2 we can delve into the characteristics of each cluster to better understand the customer segments. • Cluster 1: Represents older, lower-income customers with lower spending and shorter engagement. • Cluster 2: Represents younger, higher-income customers with higher spending and longer engagement.
  • 29.
    Cluster Profiling Cluster profilinginvolves analyzing and summarizing the characteristics of clusters that have been identified through clustering algorithms.
  • 30.
    Customer 1: High-EarningSingles • This cluster predominantly comprises individuals born between the mid-1980s and early 1990s, typically possessing advanced educational qualifications such as master's degrees or doctorates. • These customers are primarily single or cohabiting without dependent children. • They exhibit high annual household incomes, significantly exceeding the median, which enables financially secure lifestyles characterized by frequent, discretionary spending on premium and high-quality goods and services. • Their residence is primarily urban, facilitating an affluent lifestyle that encompasses fine dining, extensive travel, and investment in personal development and enrichment activities. Customer 2:Middle-Income Families • This segment comprises individuals born between the early 1970s and mid-1980s. • Predominantly married, they manage households with one or more children, including teenagers. • Their educational attainment and their household incomes range from moderate to upper-middle class. • Their expenditure patterns are driven by family-oriented needs, with a significant allocation towards education, groceries, and household necessities. • Residing primarily in suburban areas, they place a high value on community and safety. • While their shopping frequency is lower relative to other segments, their purchasing decisions are deliberate and well-considered.
  • 31.
    DBSCAN Clustering • DBSCANInitialization: The DBSCAN algorithm is initialized with eps=0.5 and min_samples=2. • eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other. • min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. • Fit and Predict: The fit_predict method applies DBSCAN to the PCA- transformed data (pca_trf), resulting in cluster labels (label_db). • Silhouette Score: The silhouette score, which measures how similar an object is to its own cluster compared to other clusters, is calculated. A higher silhouette score indicates better-defined clusters. • Output: The silhouette score for DBSCAN is printed: 0.23263068785884128.
  • 32.
    Visual Outputs Silhouette Scorefor DBSCAN: • The silhouette score for DBSCAN clustering is 0.23, indicating that the clusters are not well- defined compared to other clustering methods used (like KMeans). Scatter Plot: • The scatter plot visualizes the PCA-transformed data with the clusters identified by DBSCAN. • Points are colored based on their cluster assignments, showing how DBSCAN grouped the data. • The plot highlights the density-based clusters and shows potential noise points (usually assigned a different color or labeled as -1 in DBSCAN).
  • 33.
    Summary: ➔ Checked nullvalues. ➔ Checked duplicates. ➔ Checked outliers. ➔ Feature Engineering. ➔ Feature Selection. ➔ Model Building. ➔ Clustering Result. ➔ Clustering Profile. ➔ DBSCAN Clustering.
  • 34.