This document is a report analyzing revenue decline at a Portuguese banking institution. It uses data on 41188 clients to predict subscription to term deposits through machine learning models. The methodology section describes data preparation including handling missing values and outliers. Various data exploration visualizations are presented. KNN and decision tree models are applied at different train-test splits, with decision tree achieving slightly better accuracy scores. The duration attribute is found to most influence subscription. The report concludes decision trees perform better than KNN for this prediction problem.
This document discusses the statistical analysis carried out on survey data to estimate the willingness to pay (WTP) for improved water quality using multilevel modeling (MLM). It describes:
1) Conducting a conventional logistic regression analysis on the single-bound dichotomous choice (SBDC) responses before using MLM to account for the hierarchical structure of the data.
2) Estimating WTP from the double-bound dichotomous choice (DBDC) data using MLM, which models the natural hierarchy in responses nested within individuals.
3) Estimating the incidence of benefits across income groups using the WTP estimates from a linear regression of stated WTP responses. This found WTP generally
The document describes the phases of a data mining project on predicting home insurance purchases. It analyzes customer data using various techniques, creates new variables, selects the top predictive variables, and generates models. The best model was one where the cost of false negatives was 10 times the cost of false positives, as it had the lowest overall cost and highest return on investment. This model will be deployed.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
This binary classification model uses logistic regression to predict customer credit risk and maximize profit. The dataset was merged and cleaned. The dependent variable was transformed into a binary variable indicating good or bad credit risk. Some variables had coded non-numeric values that were replaced. The model was developed using variable transformations and selections to create a profitable clustering of customers.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
The document presents a model for estimating exposure at default (EAD) for contingent credit lines (CCLs) at the portfolio level. It models each CCL as a portfolio of put options, with the exercise of each put following a Poisson process. The model convolutes the usage distributions of individual obligors, sub-segments, and segments to estimate the portfolio-level EAD distribution. The authors test the model using data from Moody's and find near-Gaussian results. They discuss future work to refine the model and make it more practical for banks to estimate regulatory capital requirements.
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
Jim Porzak's presentation at useR! 2015 in Aalborg, Denmark. Learn on how to segment customers based on stated interest surveys using the flexclust package in R. Covers basic customer segmentation concepts, introduction to flexclust, and solutions to three practical issues: the numbering problem, the stability problem, and the best choice for k.
- The document describes a study on tipping behavior that examined whether the color of a female server's top (red vs other colors) affected the likelihood that male customers would leave a tip.
- Logistic regression was used to model the probability of tipping based on the server's top color. The probability was found to be higher when servers wore red tops, based on an analysis of 418 male customers.
- Estimates from the data found that male customers were 1.3793 times as likely to tip servers wearing red tops compared to other colored tops. The logistic regression model estimated the odds ratio as e0.8431, consistent with the data analysis.
This document discusses the statistical analysis carried out on survey data to estimate the willingness to pay (WTP) for improved water quality using multilevel modeling (MLM). It describes:
1) Conducting a conventional logistic regression analysis on the single-bound dichotomous choice (SBDC) responses before using MLM to account for the hierarchical structure of the data.
2) Estimating WTP from the double-bound dichotomous choice (DBDC) data using MLM, which models the natural hierarchy in responses nested within individuals.
3) Estimating the incidence of benefits across income groups using the WTP estimates from a linear regression of stated WTP responses. This found WTP generally
The document describes the phases of a data mining project on predicting home insurance purchases. It analyzes customer data using various techniques, creates new variables, selects the top predictive variables, and generates models. The best model was one where the cost of false negatives was 10 times the cost of false positives, as it had the lowest overall cost and highest return on investment. This model will be deployed.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
This binary classification model uses logistic regression to predict customer credit risk and maximize profit. The dataset was merged and cleaned. The dependent variable was transformed into a binary variable indicating good or bad credit risk. Some variables had coded non-numeric values that were replaced. The model was developed using variable transformations and selections to create a profitable clustering of customers.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
The document presents a model for estimating exposure at default (EAD) for contingent credit lines (CCLs) at the portfolio level. It models each CCL as a portfolio of put options, with the exercise of each put following a Poisson process. The model convolutes the usage distributions of individual obligors, sub-segments, and segments to estimate the portfolio-level EAD distribution. The authors test the model using data from Moody's and find near-Gaussian results. They discuss future work to refine the model and make it more practical for banks to estimate regulatory capital requirements.
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
Jim Porzak's presentation at useR! 2015 in Aalborg, Denmark. Learn on how to segment customers based on stated interest surveys using the flexclust package in R. Covers basic customer segmentation concepts, introduction to flexclust, and solutions to three practical issues: the numbering problem, the stability problem, and the best choice for k.
- The document describes a study on tipping behavior that examined whether the color of a female server's top (red vs other colors) affected the likelihood that male customers would leave a tip.
- Logistic regression was used to model the probability of tipping based on the server's top color. The probability was found to be higher when servers wore red tops, based on an analysis of 418 male customers.
- Estimates from the data found that male customers were 1.3793 times as likely to tip servers wearing red tops compared to other colored tops. The logistic regression model estimated the odds ratio as e0.8431, consistent with the data analysis.
Embark on a captivating journey into the realm of customer churn prediction with this insightful data analysis project presented by Boston Institute of Analytics. Our talented students delve into the intricacies of customer behavior, leveraging advanced data analysis techniques to forecast and mitigate churn risks. From examining historical customer data and purchase patterns to identifying predictive indicators and developing robust churn prediction models, this project offers a comprehensive exploration of the factors influencing customer retention. Gain invaluable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating world of customer churn prediction and unlock new perspectives on customer relationship management. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Predicting Bank Customer Churn Using ClassificationVishva Abeyrathne
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers from Kaggle and preprocessed the data. They then explored relationships between features and the target variable of whether a customer churned. Two classification models were tested - KNN and Decision Tree. After hyperparameter tuning, Decision Tree achieved the best accuracy of 84.25%, outperforming KNN. However, both models struggled to accurately predict customers who would churn. The authors concluded Decision Tree was the best model but recommend collecting more data on churning customers.
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers with 14 features from Kaggle and preprocessed the data. They explored relationships between features and the target (churn) variable. Two classifiers were tested - KNN and decision tree. After hyperparameter tuning, the decision tree model achieved the best accuracy of 84.25%, outperforming KNN. However, both models predicted churn (class 1) less accurately than non-churn (class 0). The decision tree was selected as the best overall model despite its weakness in predicting churn.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docxtidwellveronique
This document provides information about bond valuation and modeling bond prices using Excel functions. It includes examples of using the PRICE and PV functions to value a bond given its coupon rate, par value, maturity date, and market yield. It also shows how bond prices change over time as market interest rates change, rising to 15% or falling to 5% from the initial 10% rate. The document discusses modifications needed to the model for bonds that pay interest semiannually, such as dividing the coupon payment, years to maturity, and market yield by two.
1. The document describes a project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models.
2. Exploratory data analysis was conducted on usage, contract, payment and other customer data, finding some variable correlation.
3. Logistic regression performed best with 75% accuracy. KNN accuracy was also good with K=2.
4. The models identified contract renewal and monthly charges as critical factors for churn, suggesting the company focus on these areas.
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...Solventure
This article fits in a series of articles inspired by the book ‘Supply Chain Metrics That Matter’.
In her latest book Lora Cecere introduces ‘which are the metrics that matter’, ‘how to ensure strength, balance and resilience’, what are the ‘evolutions in different sectors’, …
In this second article, Bram tries to explore balanced target setting in the supply chain triangle. He uses benchmarking in 2 dimensions to reveal trade-offs between inventory turns and EBIT.
In next articles we will explore how to make choice in function of a chosen business strategy.
We hope you enjoy the reading.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Predictive modeling for resale hdb evaluation pricekahhuey
After going through the pain of buying and selling my HDB recently, I believe a predictive model of resale price is needed. I wrote a model using MLR.
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 +… + βpXp + Ɛ
This model includes variables like west sun, corridor etc. It would be a great help for buyers and sellers if this model with the support of real data from the authority is available publicly.
The Portuguese bank wants to increase sales of long-term deposits through a telemarketing campaign. The authors use logistic regression, decision trees, and neural networks on previous campaign data to build predictive models. They find that including external economic variables improves on a benchmark model using only internal variables. The decision tree and neural network models perform best at predicting successful calls. Combining the three models further increases profits from the campaign.
- The document analyzes switching costs and network externalities in the production of payment services between two banks, Bank A and Bank B.
- Switching costs lock in old clients to their existing bank, while new clients are free of switching costs. Network externalities arise due to the transaction costs of interbank transfers and the low marginal costs of intrabank transfers.
- This favors the larger bank, as the number of its old locked-in clients increases the discounted marginal profits from those clients. This could encourage the larger bank to also capture new clients, even without price discrimination between old and new clients.
Part I: Predictive models (Decision Tree and Regression) using SAS Enterprise Miner
Part II: Decision Tree using R.
Part III: Market-Basket Analysis using SAS miner.
Customer churn has been evolving as one of the major problems for financial organizations. The incessant competitions in the market and high cost of acquiring new customers have made organizations to drive their focus towards more effective customer retention strategies.
The document discusses exploratory data analysis techniques used to analyze a telecommunications customer churn dataset containing 3,333 records and 20 variables.
Key techniques included:
1. Examining relationships between categorical variables like international plan and voice mail plan to churn, finding customers in international plans churned at higher rates.
2. Exploring distributions and correlations of numeric variables like account length, day minutes, and customer service calls with churn. Higher values in calls and minutes were linked to higher churn.
3. Using histograms, scatter plots, and other graphs to identify multivariate relationships, like finding customers with many calls but low minutes churned more.
The analysis helped identify variables likely to predict churn for modeling without pre
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
The document proposes using several machine learning algorithms, including multilayer perceptron neural networks, logistic regression, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C), to improve personal credit scoring accuracy. The algorithms were tested on a large dataset from a Portuguese bank containing over 400,000 entries. HLVQ-C achieved the most accurate results, outperforming traditional linear methods. The document introduces a "usefulness" measure to evaluate classifiers based on earnings from correctly denying credit to risky applicants and losses from misclassifications.
The document presents a study comparing various machine learning algorithms for personal credit scoring, including logistic regression, multilayer perceptron, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy on both datasets and the best performance on a proposed "usefulness" metric, making it the most effective model for credit scoring applications according to this research.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Embark on a captivating journey into the realm of customer churn prediction with this insightful data analysis project presented by Boston Institute of Analytics. Our talented students delve into the intricacies of customer behavior, leveraging advanced data analysis techniques to forecast and mitigate churn risks. From examining historical customer data and purchase patterns to identifying predictive indicators and developing robust churn prediction models, this project offers a comprehensive exploration of the factors influencing customer retention. Gain invaluable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating world of customer churn prediction and unlock new perspectives on customer relationship management. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Predicting Bank Customer Churn Using ClassificationVishva Abeyrathne
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers from Kaggle and preprocessed the data. They then explored relationships between features and the target variable of whether a customer churned. Two classification models were tested - KNN and Decision Tree. After hyperparameter tuning, Decision Tree achieved the best accuracy of 84.25%, outperforming KNN. However, both models struggled to accurately predict customers who would churn. The authors concluded Decision Tree was the best model but recommend collecting more data on churning customers.
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers with 14 features from Kaggle and preprocessed the data. They explored relationships between features and the target (churn) variable. Two classifiers were tested - KNN and decision tree. After hyperparameter tuning, the decision tree model achieved the best accuracy of 84.25%, outperforming KNN. However, both models predicted churn (class 1) less accurately than non-churn (class 0). The decision tree was selected as the best overall model despite its weakness in predicting churn.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docxtidwellveronique
This document provides information about bond valuation and modeling bond prices using Excel functions. It includes examples of using the PRICE and PV functions to value a bond given its coupon rate, par value, maturity date, and market yield. It also shows how bond prices change over time as market interest rates change, rising to 15% or falling to 5% from the initial 10% rate. The document discusses modifications needed to the model for bonds that pay interest semiannually, such as dividing the coupon payment, years to maturity, and market yield by two.
1. The document describes a project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models.
2. Exploratory data analysis was conducted on usage, contract, payment and other customer data, finding some variable correlation.
3. Logistic regression performed best with 75% accuracy. KNN accuracy was also good with K=2.
4. The models identified contract renewal and monthly charges as critical factors for churn, suggesting the company focus on these areas.
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...Solventure
This article fits in a series of articles inspired by the book ‘Supply Chain Metrics That Matter’.
In her latest book Lora Cecere introduces ‘which are the metrics that matter’, ‘how to ensure strength, balance and resilience’, what are the ‘evolutions in different sectors’, …
In this second article, Bram tries to explore balanced target setting in the supply chain triangle. He uses benchmarking in 2 dimensions to reveal trade-offs between inventory turns and EBIT.
In next articles we will explore how to make choice in function of a chosen business strategy.
We hope you enjoy the reading.
Churn in the Telecommunications Industryskewdlogix
Strategic Business Analysis Capstone Project Telecommunications Churn Management
Churn is a significant problem that costs telecommunications companies billions of dollars through lost revenue. Now that the market is more mature, the only way for a company to grow is to take their competitors customers. This issue
combined with the greater choice that consumers have gained means that any adverse touch point with a consumer can result in a lost customer.
Predictive modeling for resale hdb evaluation pricekahhuey
After going through the pain of buying and selling my HDB recently, I believe a predictive model of resale price is needed. I wrote a model using MLR.
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 +… + βpXp + Ɛ
This model includes variables like west sun, corridor etc. It would be a great help for buyers and sellers if this model with the support of real data from the authority is available publicly.
The Portuguese bank wants to increase sales of long-term deposits through a telemarketing campaign. The authors use logistic regression, decision trees, and neural networks on previous campaign data to build predictive models. They find that including external economic variables improves on a benchmark model using only internal variables. The decision tree and neural network models perform best at predicting successful calls. Combining the three models further increases profits from the campaign.
- The document analyzes switching costs and network externalities in the production of payment services between two banks, Bank A and Bank B.
- Switching costs lock in old clients to their existing bank, while new clients are free of switching costs. Network externalities arise due to the transaction costs of interbank transfers and the low marginal costs of intrabank transfers.
- This favors the larger bank, as the number of its old locked-in clients increases the discounted marginal profits from those clients. This could encourage the larger bank to also capture new clients, even without price discrimination between old and new clients.
Part I: Predictive models (Decision Tree and Regression) using SAS Enterprise Miner
Part II: Decision Tree using R.
Part III: Market-Basket Analysis using SAS miner.
Customer churn has been evolving as one of the major problems for financial organizations. The incessant competitions in the market and high cost of acquiring new customers have made organizations to drive their focus towards more effective customer retention strategies.
The document discusses exploratory data analysis techniques used to analyze a telecommunications customer churn dataset containing 3,333 records and 20 variables.
Key techniques included:
1. Examining relationships between categorical variables like international plan and voice mail plan to churn, finding customers in international plans churned at higher rates.
2. Exploring distributions and correlations of numeric variables like account length, day minutes, and customer service calls with churn. Higher values in calls and minutes were linked to higher churn.
3. Using histograms, scatter plots, and other graphs to identify multivariate relationships, like finding customers with many calls but low minutes churned more.
The analysis helped identify variables likely to predict churn for modeling without pre
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
The document proposes using several machine learning algorithms, including multilayer perceptron neural networks, logistic regression, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C), to improve personal credit scoring accuracy. The algorithms were tested on a large dataset from a Portuguese bank containing over 400,000 entries. HLVQ-C achieved the most accurate results, outperforming traditional linear methods. The document introduces a "usefulness" measure to evaluate classifiers based on earnings from correctly denying credit to risky applicants and losses from misclassifications.
The document presents a study comparing various machine learning algorithms for personal credit scoring, including logistic regression, multilayer perceptron, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy on both datasets and the best performance on a proposed "usefulness" metric, making it the most effective model for credit scoring applications according to this research.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
Classification Problem with KNN
1. P a g e 1 | 10
Practical Data Science
Assignment – 2
Report on Revenue Decline for
Portuguese Banking Institution
Authors:
Phalgun Haribabu Chintal, s3702107
Santhosh Kumaravel Sundaravadivelu, s3729461
2. P a g e 2 | 10
Table of contents
1. Introduction
2. Methodology
2.1 Data Preparation
2.2 Data Exploration
2.3 Data Modelling
3. Results
4. Discussion
5. Conclusion
3. P a g e 3 | 10
Abstract
The purpose of this report was to predict the subscription deposit status of every client of a
Portuguese banking institution through direct marketing campaigns. A Portuguese banking
institution attempts to raise its subscriber base. The findings suggest that Some clients faced
problems coping with their subscription of a term deposit. Overall, the result clearly depends
on the duration attribute which affects the target variable. The report concludes that if the client
has the subscription or not.
1. Introduction
Term deposits in bank operate has to gain interest when the set of amount has been deposited.
The bank has numerous rules and regulations of term deposit which symbolizes that money
should be kept for some period of time that the client agrees. Portuguese bank organization
experienced a major decline in revenue unexampled and was seeking for a solution to overcome
this drawback. Some clients faced declination of subscription in bank institution as the duration
is 0 even before the call is processed. When investigated, the central setback was that their
clients were not depositing the amount continuously. The idea that lies with term deposits is to
set for a financial gain by retaining the amount for a specific time period as it emerges in profit.
Furthermore, it also boosts the clients' chances of taking up products or insurances which gives
a footprint to increase their revenues. As per these calculations, the institution is building a gap
to overcome the problem. Since it is a classification problem we have dealt with KNN and
Decision Algorithms.
2. Methodology
2.1 Data Preparation
2.1.1 Loading packages anddataset:
By default, not all packages are loaded into the jupyter notebook. Invoke all the necessary
packages required to perform the tasks. The dataset ‘bank.csv’ is loaded into the notebook with
the help of the pandas library because it is accessible to handle data structures and data analysis
for the python language. ';' is used in separator parameter as the columns in this dataset were
separated by ';'. The dataset here contains 41188 observations along with 21 variables.
2.1.2 Setting the names of the column:
The variable in the dataset is replaced with a new name to withdraw ambiguity. A total of 21
variable names are interpolated with the column function.
2.1.3 Removalofwhitespace
4. P a g e 4 | 10
The observation in this dataset might contain whitespace. It is time haunting to review for
whitespace as there are 21 variables present in the bank dataset. Striping function is used to
handle all the whitespace present in variables. In the beginning, remove_whitespace is defined
along with x, which stands for every bit of variable. If a base string holds whitespace, they are
deleted, or else, the string remains as an original observation.
2.1.4 Replacing the string observations to lowerstring
The dataset carries a pack of string values, which leads in difficulty to review all the
observations. Some values might signify in uppercase, which results in an error when processed
further. The genuine recommendation for it can be performed by replacing all the string to
lower case string. Originally, start by defining remove_letter coupling with x, which stands of
whole variables. If there is a base string with upper case, they are transformed into lower case.
Or else, the string is held as an original.
2.1.5 Typo errors:
Unusually, there might be manifold typological errors exist. From the clear observation in this
dataset, there is no typos error.
2.1.6 Dealing with the missing values:
The bank dataset holds various unknown observations that persist for missing values in some
categorical attributes. With the aim of dealing with these missing observations, they are first
converted into NaN values. Following with ffill method that processes with forward filling all
the NaN values.
2.2 Data Exploration:
5. P a g e 5 | 10
The box plot in fig.1 signifies a method graphically describing clubs of numerical data through
their quartiles. The minimum duration is 0 whereas, 74 being the maximum amount of data.
Any anything not included represents as an outlier.
The bar chart in fig.2 shows the volume of a number of employees. In terms of 5228.1, there
are above 16000 counts. On the other hand, 5023.5 remain the least below 2000.
Fig.3 is a density graph that is used for the distribution of a numeric variable. The output of the
density curve gives a smooth histogram. The number of days passed is the value of the variable,
while density is the estimation. 1000 days passed has the highest probability. The area between
these two values results in an estimation of the probability.
The fig.4 illustrates the proportion of two types of contact type used in the Portuguese banking
institution. Kind of contacts performed by the institution is cellular which compromised of
about 60% leaving the other portion of the telephone.
The box plot in fig.5 implies a variation rate of the institution. The large portion of the cases
have a value greater than the median, and few have a value lower. It consists of one outlier
which means the values do not settle in the inner fences.
The bar graph in fig.6 demonstrates the previous contact performed in the banking
organization. It is clear that 0 was the largest contribution performed by the institution.
However, contact performed with 1 accounted for the value of just less than 5000 and followed
6. P a g e 6 | 10
by 2 with at least 1000 counts. Contact performed with 3 took up only a few, which was the
lowest figure in the chart.
Three months rate in fig.7 represents the density curve of the institution. There exists a peak
rise of density for the value 0.
The pie chart in fig.8 explains the outcome of the bank institution. It can be observed that the
institution had the largest portion of non-existent than other types. Failure is the second most
result followed by the least role of success.
The fig.9 represents the density graph of price_index. This describes that there exists a peak
density of 500.
The density curve in fig.10 shows there is a peak rise in 0 to 500 in the campaign of the banking
institution.
Fig.11 illustrates the relation between duration and the target variable, subscription deposit.
The duration that ranges between 0 to 2000 experienced approval of term deposit. On the other
hand, the duration that falls between 50 to 2200 had disapproval of their term deposit.
In the given fig.12, the number of employees depositing has more portion when compared to
those not depositing.
The bar chart in fig.13 of Euribor manifests equal chances to pass for acceptance and rejection
of term deposit. The figure rising from 10000 to 14000 if the Euribor is in and around 5 has
got a good portion of term deposit, on the other hand, unsuccessful status sticks to 500. If the
Euribor is between 1 and 2, and near the term 4 experienced a higher portion of success term
deposit. When the Euribor is near 1, chances are equal for both types of subscription of term
deposit.
7. P a g e 7 | 10
When the number of days passed is 999, a successful term deposit is about 35000, in opposite,
it is 4000 for a declined term deposit. In contrast, if the days passed is between 0 to 20, chances
for rejection of deposit is just over the successful deposit as shown in fig.14.
In the fig.15, variation rate in the values has more subscription term when compared to a failure
term deposit. The variation rate in 1 extends to 15000 subscriptions while -1 stick at the bottom.
The bar chart provides information about the price index of the bank institution as shown in
fig.16. Subscription rate of price index in 94.0 was 14000, being here than the rest of the index
by a very large margin. The price index above 94.5 is lower in both types of subscription.
Fig.17 is the bar graph between campaign and subscription that displays the most challenging
aspect from 0 to 5, which experienced a 25000 subscription. However, only less than 2500 had
a subscription between 5 to 18. While fewer did not subscribe.
Fig.18 shows the total number of subscriptions by the previous contact performed. The 0
contacts performed was fairly high with more than 35000 subscriptions. Whereas 2, 3 and 4
had equal chances of subscriptions.
2.3 Data Modeling
This is the procedure to build the model that will enable which clients are expected to subscribe
for a term deposit. The target variable has binary observations; 'yes' and 'no'. This dataset is the
classification in which it classifies the data with the help of the class label. Once the data was
examined, there were multiple categorical variables discovered. In order to fit them in the
model, categorical variables are converted into the numeric variables. When further processed,
it was seen that many variables had missing data in them so, they were removed. On the other
8. P a g e 8 | 10
hand, the duration will be included due to its high correlation with the clients, if they get a
subscription to the bank. Random Forest is used for feature selection, the F1 score is used as a
feature selection method. The pipeline is made to link KNN and Random forest to get the best
features together. K-Nearest Neighbors (KNN), and decision tree are the two different models
that will fit to determine their performance in predicting whether the clients are subscribed to
a term deposit in the bank or not. The data is split into test and train data such as 20% : 80%,
40% : 60%, and 50% : 50% respectively.
3. Results
Results obtained after applying both the models on the 3 splits, in the K-NN model are as
follows:
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.91247 0.087521
40% : 60% 0.91090 0.089099
50% : 50% 0.90905 0.09094
Results for Decision Tree for the best score on 3 splits are as follows,
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.918062 0.081938
40% : 60% 0.916788 0.083212
50% : 50% 0.914538 0.085462
Classification report for K-NN Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.91247 0.90725 0 0.93 0.97 0.95
1 0.68 0.42 0.52
40% : 60% 0.91090 0.90830 0 0.93 0.97 0.95
1 0.66 0.44 0.53
50% : 50% 0.90905 0.90909 0 0.94 0.96 0.95
9. P a g e 9 | 10
1 0.62 0.50 0.55
Classification report for Decision Tree Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.918062 0.91402 0 0.93 0.98 0.95
1 0.71 0.46 0.56
40% : 60% 0.916788 0.91299 0 0.95 0.96 0.95
1 0.65 0.56 0.60
50% : 50% 0.914538 0.91264 0 0.94 0.96 0.95
1 0.65 0.53 0.58
4. Discussion
The prediction was to determine the ways to make the client subscribe the term deposit, both
the models performed well on different circumstances like 80:20, 50:50, 60:40 splits. KNN
model performed really well because the pipeline was used along with normal classifiers in
order to get good results. Random forest was able to filter the best features and the right number
of proportion of neighbors in different circumstances. The Decision Tree was pretty much
straight in getting the results compared to the KNN model. Different Depths were explored
before selecting the right depth to get good results.
There were few limitations which were observed, the dataset result may be biased because
there is an imbalance in the target variable which in turn may affect the overall result. This can
be dealt with the undersampling or oversampling method which can be performed in the mere
future to get more accurate results.
5. Conclusion
The objective of this investigation was to discover which attribute depends on clients if it's a
term deposit or not. In this reading, a Different number of features were determined in different
circumstances in the determination to obtain a term deposit. Whereas, the rest had the smallest
influence on the decision. The duration and previous contacts performed have the main role, if
these attributes play for a longer time, the chances of subscription of term deposit are higher.
The bank can focus on the impact variables to target clients to claim a term deposit. To sum
10. P a g e 10 | 10
up, Decision Tree gives more score compared to the K-NN model. So Decision tree is better
compared to K-NN according the results obtained.
References
Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bank Marketing Data Set. [online]
Available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing [Accessed 20 May 2019].
En.wikipedia.org. (2019). Box plot. [online] Available at: https://en.wikipedia.org/wiki/Box_plot [Acc
essed 24 May 2019].