sarisus hdyses can create targeted .pptx

Customer
Segmentation
Presented By:
Tanu Rupa
Diksha Milind
Spoorthi Supriya
Mentor Name:
Bapuram Pallavi

Business Problem:
Customer Misclassification
Customer misclassification in business refers to incorrectly categorizing customers, leading to
ineffective marketing, sales strategies, and customer service. This can result in wasted resources,
reduced customer satisfaction, and lost revenue.
Objective:
1. To precisely classify and segment customers based on shared
attributes and behavioral patterns, thereby optimizing targeted
marketing strategies, enhancing customer journey personalization,
and maximizing overall business profitability.
2. Aim is to seeks to classify clients into specific groups to customize
strategies, enhance service delivery, and boost revenue generation
targeted approaches.

Project Flow:
• Data Preprocessing: Clean and prepare the gathered data for analysis, ensuring it's
uniform, accurate, and suitable for segmentation examination.
• Feature Engineering: Feature engineering is the process of transforming raw data into
informative features that improve the performance of machine learning algorithms by
capturing relevant patterns and relationships within the data.
• Feature Selection: Feature selection is the process of identifying and choosing a subset of
relevant features from a larger set of features in a dataset, aiming to improve model
performance, reduce computational complexity, and mitigate overfitting by selecting the
most informative and discriminative features.
• Model Development: Model development is the process of constructing and refining
mathematical or computational representations that predict outcomes or patterns based on
input data, often involving iterative optimization and validation procedures.

Exploratory Data Analysis
(EDA):

Checking Missing
Values:
To assess null values within a
dataset, employ Pandas functions
such as "isnull()" or "isna()".
The bar plot analysis
indicates the presence of null
values within the 'Income'
feature.

Imputation:
• To address the null values in
the dataset, we will remove the
records or observations that
contain null values.

CheckingDuplicates:
• To check for duplicates in a dataset using
Pandas, we can use the duplicated() function,
which returns a boolean Series indicating
whether each row is a duplicate or not.
• After conducting this operation on our
dataset, it has been determined that there are
no duplicate values present, as each
observation appears only once throughout the
entire dataset.

Checking Outliers:
• In our dataset, the attribute
'income' exhibits outlier data
points, which are values that
deviate markedly from the
other observations,
potentially impacting the
statistical analysis and the
performance of any
predictive models by
introducing skewness and
variability.

Imputation:
• To address the outlier data
points in the dataset, we will
employ quantile-based
methods to identify and
remove outliers from the
'income' attribute. This
involves calculating specific
quantiles (25%,50%,75%)
and excluding data points
that fall below or above
these thresholds, thereby
mitigating the impact of
extreme values on
subsequent analyses.

Visualisation:
• This scatter plot represents
the relationship between
the 'ID' and 'MntWines'
features.
• Each point corresponds to
an individual observation
from the dataset, plotted
according to its 'ID' and
'MntWines' values,
facilitating the analysis of
patterns and potential
correlations between these
two variables.

Histogram:
• This histogram is plotted
for 'Recency' for Identifying patterns
such as skewness or the presence of
outliers in the data distribution.
• Through this histogram we analyse
the distribution of
'Recency' and understood the spread
of data in a dataset.

Bar Plot:
• This bar plot is plotted
for 'Accepted Cmp3'
for comparing the offer
acceptance by the customers
in Camp 3.
• Were we observed that the Camp
2 has the lowest acceptance as
compared to other camps.

Feature Engineering
(Labeling And Scaling):

Methodologies:
Labeling Methods:
• Label Encoder
• One-Hot Encoder
• Dummies
Scaling Methods:
• Standard Scaler
• MinMax Scaler
• Robust Scaler
• Normalizer

Label Encoder:
• Labelencoder is a utility in Python scikit-learn library used for converting
categorical data into numerical format.
• It assigns a unique integer to each category, enabling machine learning algorithms
to operate on such data.
• This transformation enables algorithms to effectively interpret and learn from
categorical data during model training.
There are three columns "Education, Marital_Status and Dt_Customer" in our dataset which
are in "object and date-time" format which we will convert into numerical value by
"Label Encoding".

Check Correlation:
• To check correlation between
variables in a dataset, Pandas
provides the corr() function,
yielding a correlation matrix.
• It was observed that the attributes
'Z_CostContact' and 'Z_Revenue'
exhibit a significantly high
correlation.

Imputation:
• Attributes exhibiting high correlation were identified and subsequently
removed from the dataset.

Feature Selection:
(Variance Threshold)

Variance Threshold:
• Variance Threshold is a feature selection technique used to remove
low-variance features from a dataset.

Building Model
PCA
PCA is a dimensionality reduction technique that transforms the data
into a new coordinate system. The new coordinates, or principal
components, are linear combinations of the original features. These
components are ordered by the amount of variance they capture from
the data. The first principal component captures the most variance

K MEANS
K-Means is a popular clustering algorithm used to partition a dataset
into K distinct, non-overlapping subsets (or clusters). It aims to
minimize the within-cluster variance, ensuring that points within each
cluster are as similar as possible.

Elbow method
The elbow method is a commonly
used technique in clustering to
determine the optimal number of
clusters (K). The goal is to find a
balance between having too few
and too many clusters, leading to a
meaningful segmentation of the
data.

2 is the best clustering based on the Elbow
Method
If the Elbow Method indicates that 2 is the optimal number of clusters,
then apply K-Means clustering with 𝐾=2
Steps:
• Fit the K-Means algorithm with 𝐾=2.
• Predict the cluster labels for each data point.
• Visualize the clusters and the cluster centroids.
• Analyze and interpret the resulting clusters.

Clustering Result
After applying K-Means clustering
with 𝐾=2 we can delve into the
characteristics of each cluster to
better understand the customer
segments.
• Cluster 1: Represents older,
lower-income customers with
lower spending and shorter
engagement.
• Cluster 2: Represents younger,
higher-income customers with
higher spending and longer
engagement.

Cluster Profiling
Cluster profiling involves
analyzing and summarizing the
characteristics of clusters that have
been identified through clustering
algorithms.

Customer 1: High-Earning Singles
• This cluster predominantly comprises
individuals born between the mid-1980s
and early 1990s, typically possessing
advanced educational qualifications
such as master's degrees or doctorates.
• These customers are primarily single or
cohabiting without dependent children.
• They exhibit high annual household
incomes, significantly exceeding the
median, which enables financially
secure lifestyles characterized by
frequent, discretionary spending on
premium and high-quality goods and
services.
• Their residence is primarily urban,
facilitating an affluent lifestyle that
encompasses fine dining, extensive
travel, and investment in personal
development and enrichment activities.
Customer 2:Middle-Income Families
• This segment comprises individuals
born between the early 1970s and
mid-1980s.
• Predominantly married, they
manage households with one or
more children, including teenagers.
• Their educational attainment and
their household incomes range from
moderate to upper-middle class.
• Their expenditure patterns are
driven by family-oriented needs, with
a significant allocation towards
education, groceries, and household
necessities.
• Residing primarily in suburban
areas, they place a high value on
community and safety.
• While their shopping frequency is
lower relative to other segments,
their purchasing decisions are
deliberate and well-considered.

DBSCAN Clustering
• DBSCAN Initialization: The DBSCAN algorithm is initialized with
eps=0.5 and min_samples=2.
• eps: The maximum distance between two samples for one to be
considered as in the neighborhood of the other.
• min_samples: The number of samples (or total weight) in a neighborhood
for a point to be considered as a core point.
• Fit and Predict: The fit_predict method applies DBSCAN to the PCA-
transformed data (pca_trf), resulting in cluster labels (label_db).
• Silhouette Score: The silhouette score, which measures how similar an
object is to its own cluster compared to other clusters, is calculated. A
higher silhouette score indicates better-defined clusters.
• Output: The silhouette score for DBSCAN is printed:
0.23263068785884128.

Visual Outputs
Silhouette Score for DBSCAN:
• The silhouette score for DBSCAN clustering is 0.23, indicating that the clusters are not well-
defined compared to other clustering methods used (like KMeans).
Scatter Plot:
• The scatter plot visualizes the PCA-transformed data with the clusters identified by DBSCAN.
• Points are colored based on their cluster assignments, showing how DBSCAN grouped the data.
• The plot highlights the density-based clusters and shows potential noise points (usually assigned a
different color or labeled as -1 in DBSCAN).

Summary:
➔ Checked null values.
➔ Checked duplicates.
➔ Checked outliers.
➔ Feature Engineering.
➔ Feature Selection.
➔ Model Building.
➔ Clustering Result.
➔ Clustering Profile.
➔ DBSCAN Clustering.

sarisus hdyses can create targeted .pptx

More Related Content

Similar to sarisus hdyses can create targeted .pptx

More from 13DikshaDatir

Recently uploaded

sarisus hdyses can create targeted .pptx