Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
3. Introduction
• The Insurance sector is undergoing rapid transformation,
propelled by technological advancements, shifting
consumer preferences, and a fiercely competitive market.
• Policyholder churn, the phenomenon of policyholders
terminating their relationship with an insurance company,
presents distinct challenges and opportunities. When an
insurance company loses policyholders, it can significantly
impact its revenue and market position.
Through data-driven insights and predictive modeling, this presentation aims to showcase my
Machine Learning Capstone Project focused on predicting customer churn in the Retail Sector.
4. Why Retail Domain?
I chose the Retail Domain for my Capstone Project because:
Consumer Behavior: Retail is all about understanding consumer behavior. Predicting how customers make their
purchasing decisions and what influences them is like solving an intriguing puzzle.
Dynamic Market: The retail market is full of rules and regulations. These rules keep changing and adapting to these
changes is a challenge, but it also keeps things interesting.
Data Privacy: Customer data in retail is confidential. We need to figure out how to analyze it without compromising
privacy, making it a complex but fascinating task.
Diverse Customers: Every customer’s shopping needs are different. Managing relationships with a diverse customer
base adds another layer of complexity.
Technological Advancements: New technologies are always emerging, especially in retail. Figuring out how to
leverage these technological innovations to enhance the customer shopping experience is part of the adventure. 😊
5. Project’s Significance and
its Benefits to Insuarance company
• Better Customer Experience: By predicting churn, we can create personalized strategies that
improve relationships with customers and make them happier.
• Saving Money: It’s cheaper to keep existing customers than to find new ones. By predicting and
reducing churn, we can keep more customers and increase our profits.
• Reducing Risk: By figuring out which customers might leave, we can take steps to prevent it. This
helps us manage risks and plan better.
• Staying Competitive: By managing churn effectively, our insurance company can stand out by
offering services that are tailored to each customer’s needs. This gives us an edge over our
competitors.
• Long-Term Success: Our project doesn’t just help keep customers—it also helps the insurance
company succeed in the long run. By focusing on customers and reducing churn, we’re building a
business that’s built to last.
6. Dataset
Information
Our dataset is a comprehensive collection of data, consisting of 3914 records. Each record
represents a unique customer, contributing to the depth and breadth of our analysis.
The dataset includes a diverse set of 24 features, each offering valuable insights into
customer behavior, preferences, and their insurance policies. These features form the
foundation of our predictive modeling.
7. Exploratory Data Analysis (EDA)
• Exploring the data allowed us to gain a comprehensive overview of
the data's structure. It uncovered potential patterns, helped us
identify key trends and get essential insights from the dataset.
• Throughout the EDA process, we analyzed the distribution of
individual features, investigated correlations, and explored any
inherent relationships between variables.
• Visualizations also played a crucial role in providing a clear
representation of the data, offering insights into customer behavior
and identifying the factors that may contribute to customer churn.
8. • First, we made sure there were no Null values and Duplicates in the dataset. And luckily,
there weren't any. Our dataset was clean to begin with.
• In our exploratory data analysis, we discovered an imbalance in our target variable,
“Response”. Over 7000 individuals had not responded, creating a class imbalance. This
insight will guide our next steps in addressing this issue for a more balanced and accurate
predictive model.
Exploratory Data Analysis (EDA)
9. Visualizations
Most responses fall under the “No” category, with
Two-Door Cars having the highest count of over
3500.
There are significantly fewer “Yes” responses across
all vehicle classes.
The x-axis represents different ranges of CLV, and
the y-axis represents the frequency of customers
falling within those ranges.
The majority of customers have a CLV between 0
and 10,000.
As the customer lifetime value increases, the
frequency decreases sharply.
10. • Churn Customers Demographics: This table shows the percentage of churned customers by
employment status and gender. For example, 70.31% of female and 74.03% of male churned
customers are retired.
• Response by Education: This bar graph shows the count of customers based on their education
levels: Bachelor, College, High School or Below, Master, Doctor. The highest count is for customers
with a Bachelor’s degree, and the lowest is for those with a Doctorate.
11. • The x-axis represents the days, marked at intervals of 5 up to 35 days.
• The y-axis represents the sum or number of policies, ranging from 0 to 1000.
• The orange line represents the trend in policies.
• For most of the duration (up to day 30), the policy count fluctuates between approximately 500 and
just under a thousand.
• After day 30, there’s a significant drop in policy count.
12. This heatmap provides insights into the relationships between different variables, which could be useful for
understanding patterns and dependencies in the data.
Red indicates positive correlation while blue indicates negative correlation.
Variables like “Customer Lifetime Value”, “Income”, “Monthly Premium Auto”, “Months Since Last Claim”,
“Months Since Policy Inception”, “Number of Open Complaints”, “Number of Policies” and “Total Claim
Amount” are being compared.
13. Preprocessing
• Dropping Irrelevant Columns: The ‘Customer’ and ‘Effective To Date’ columns are
dropped from the dataframe ‘df’ as they are deemed irrelevant for the analysis.
• Encoding Target Variable: The ‘Response’ column, which is the target variable, is
encoded from ‘Yes’ or ‘No’ to ‘1’ or ‘0’. This is done to facilitate machine learning
algorithms which work better with numerical data.
Splitting the data into X and
y• In this step, we partitioned the dataset into two components: X and y.
• The variable X encompasses all independent variables, representing the features
that contribute to our predictions.
• On the other hand, y encapsulates the dependent variable or target variable,
serving as the outcome we aim to predict.
14. Train-Test Split
• We then split the dataset into training data and testing data.
• We did an 80:20 split, meaning 80% of our data is Training Data and 20% of our data is
Testing Data. So, our test size was set to 0.2.
• We will take Random State as 123. This will guarantee the reproducibility of our results
across different runs.
Standard Scaler
• We used Standard Scaler to standardize the features of the dataset.
• This ensured that the consistency between the features of the dataset was maintained.
• Standardization is crucial for certain machine learning algorithms, promoting optimal
model performance by mitigating the influence of varying magnitudes among features
15. Over-Sampling with SMOTE
• We had data imbalance within our target variable. Initially, we evaluated our model's
accuracy in the presence of this imbalance.
• Then, to rectify the issue of imbalance, we implemented the Synthetic Minority Over-
Sampling Technique (SMOTE) as an oversampling method.
• We then compared the model accuracies before and after addressing the data imbalance using
SMOTE, providing valuable insights into the impact of this preprocessing technique.
• Distribution of our y_train before oversampling :
• Distribution of our y_train after oversampling:
Not Churned Churned
6261 6261
Not Churned Churned
6261 1046
16. Applying Machine
Learning Algorithms
This Insuarance Company Churn problem is a Binary Classification problem.
Models used:
• Logistic Regression : Logistic Regression is a powerful tool in binary classification. Its very good at modeling the
probability of an event occurring, making it suitable for scenarios where understanding the likelihood of customers
churning is essential.
• Support Vector Classification (SVC) : Support Vector Classification is a robust algorithm employed for
classification tasks, especially when there's a need for clear separation between classes. In the context of customer
churn prediction, it draws distinct decision boundaries between loyal and potential churned customers.
• Random Forest Classification: Random Forest Classification is a powerful algorithm used for both classification
and regression tasks. It operates by constructing multiple decision trees during training and outputting the class
that is the mode of the classes for classification, or mean prediction for regression. In the context of customer
churn prediction, it can handle a large number of features and identify the most significant ones, making it
effective in predicting customer churn.
17. Model Selection and Considerations
The Random Forest model was chosen for its high accuracy of 98% and its resistance to
overfitting. This was confirmed by consistently high cross-validation scores, averaging
around 0.984. Therefore, due to its performance and stability, the Random Forest model
was found to be the best fit for this analysis
18. Conclusion
• With the help of several insights, patterns and trends in our data, we’ve used Machine Learning to
address the intricate challenge of predicting Customer Churn.
• This project offers significant benefits to Insuarance Company:
By predicting potential churners, Insuarance Companies can adopt proactive strategies to
retain valuable customers. This involves personalized interventions, loyalty programs, and
targeted communication to address customer concerns and enhance satisfaction.
Understanding the factors influencing customer churn enables Insuarance Companies to tailor
their services to meet individual needs. This level of personalization fosters stronger customer
relationships, increases loyalty, and enhances the overall banking experience.