It is a complete end to to end project for beginners on a banking data. where this project predicts the weather the clients is going to pay next month premium. This project also includes data pre-processing like uni-variate analysis, Bi-Variate analysis, outlier-detection, imputing strategies and finally predicting
Customer Clustering for Retailer MarketingJonathan Sedar
This was a 90 min talk given to the Dublin R user group in Nov 2013. It describes how one might go about a data analysis project using the R language and packages, using an open source dataset.
Delve into the realm of predictive modeling for loan approval. Learn how data science is revolutionizing the lending industry, making the loan approval process faster, more accurate, and fairer. Discover the key factors that influence loan decisions and how predictive modelling is shaping the future of lending. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Machine learning project called loan prediction which is implemented using different algorithms that are in machine learning and accuracy in all algorithms has been calculated and the data set has been downloaded from google and the data is splitted for training and testing purpose where 90% for training and 10% for testing the more data is used for training to increase efficiency also the system will give accurate information based on the data that is trained before implementing algorithms the steps like data collection data analyzing and data cleaning has to be performed
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Explore in-depth insights into the intricate world of bank loan approval with this compelling data analysis project presented by Boston Institute of Analytics. Our talented students delve into the complexities of loan approval processes, leveraging advanced data analysis techniques to uncover patterns, trends, and factors influencing loan decisions. From evaluating credit scores and income levels to analyzing loan terms and default rates, this project offers a comprehensive examination of the key metrics and variables impacting bank loan approval. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on bank loan approval dynamics. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Credit card fraud detection using python machine learningSandeep Garg
COMPANY_NAME provides data-driven business transformation services using advanced analytics and artificial intelligence. It helps businesses contextualize data, generate insights from complex problems, and make data-driven decisions. The document then discusses using machine learning for credit card fraud detection. It explains supervised learning as inferring a function from labeled training and test data to map inputs to outputs with minimal error. Screenshots are provided of exploring and preprocessing a credit card transaction dataset for outlier detection, correlation, and preparing the data for machine learning models.
This document discusses analyzing an e-commerce dataset using Python. It includes:
- Loading and exploring the dataset which has 11 numeric and 4 categorical variables
- Creating a data dictionary to document the variables
- Checking for missing data, outliers, and variable relationships
- Transforming, scaling, and selecting variables for modeling
- Building and comparing various models like linear regression, logistic regression, decision trees, and random forests to predict product weight and whether an order was delivered on time
The key steps are data loading and cleaning, exploratory analysis, feature engineering, and building/comparing predictive models. Various plots are generated to visualize relationships between variables.
Data Exploration (EDA)
- Relationship between variables
- Check for
- Multi-co linearity
- Distribution of variables
- Presence of outliers and its treatment
- Statistical significance of variables
- Class imbalance and its treatment
Feature Engineering
- Whether any transformations required
- Scaling the data
- Feature selection
- Dimensionality reduction
Assumptions
- Check for the assumptions to be satisfied for each of the models in
- Regression – SLR, Multiple Linear Regression, Logistic Regression
- Classification – Decision Tree, Random Forest, SVM, Bagged and boosted models
- Clustering – PCA (multi-co linearity), K-Means (presence of outliers, scaling, conversion to
numerical, etc.)
----------------------------- Interim Presentation Checkpoint----------------------------------------------------------
Model building
- Split the data to train and test.
- Start with a simple model which satisfies all the above assumptions based on your dataset.
- Check for bias and variance errors.
- To improve the performance, try cross-validation, ensemble models, hyperparameter
tuning, grid search
Evaluation of model
- Regression – RMSE, R-Squared value,
- Classification – Classification report with precision, recall, F1-score, Support, AUC, etc.
- Clustering – Inertia value
- Comparison of different models built and discussion of the same
- Time taken for the inferences/ predictions
Business Recommendations & Future enhancements
- How to improve data collection, processing, and model accuracy?
- Commercial value/ Social value / Research value
- Recommendations based on insights
----------------------------- Final Presentation Checkpoint----------------------------------------------------------
Dashboard
- EDA – Correlation matrix, pair plots, box blots, distribution plots
- Model
- Model Parameters
- Visualization of performance of the model with varying parameters
- Visualization of model Metrics
- Testing outcome
- Failure cases and explanation for the same
- Most successful and obvious cases
- Border cases
Customer Clustering for Retailer MarketingJonathan Sedar
This was a 90 min talk given to the Dublin R user group in Nov 2013. It describes how one might go about a data analysis project using the R language and packages, using an open source dataset.
Delve into the realm of predictive modeling for loan approval. Learn how data science is revolutionizing the lending industry, making the loan approval process faster, more accurate, and fairer. Discover the key factors that influence loan decisions and how predictive modelling is shaping the future of lending. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Machine learning project called loan prediction which is implemented using different algorithms that are in machine learning and accuracy in all algorithms has been calculated and the data set has been downloaded from google and the data is splitted for training and testing purpose where 90% for training and 10% for testing the more data is used for training to increase efficiency also the system will give accurate information based on the data that is trained before implementing algorithms the steps like data collection data analyzing and data cleaning has to be performed
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Explore in-depth insights into the intricate world of bank loan approval with this compelling data analysis project presented by Boston Institute of Analytics. Our talented students delve into the complexities of loan approval processes, leveraging advanced data analysis techniques to uncover patterns, trends, and factors influencing loan decisions. From evaluating credit scores and income levels to analyzing loan terms and default rates, this project offers a comprehensive examination of the key metrics and variables impacting bank loan approval. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on bank loan approval dynamics. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Credit card fraud detection using python machine learningSandeep Garg
COMPANY_NAME provides data-driven business transformation services using advanced analytics and artificial intelligence. It helps businesses contextualize data, generate insights from complex problems, and make data-driven decisions. The document then discusses using machine learning for credit card fraud detection. It explains supervised learning as inferring a function from labeled training and test data to map inputs to outputs with minimal error. Screenshots are provided of exploring and preprocessing a credit card transaction dataset for outlier detection, correlation, and preparing the data for machine learning models.
This document discusses analyzing an e-commerce dataset using Python. It includes:
- Loading and exploring the dataset which has 11 numeric and 4 categorical variables
- Creating a data dictionary to document the variables
- Checking for missing data, outliers, and variable relationships
- Transforming, scaling, and selecting variables for modeling
- Building and comparing various models like linear regression, logistic regression, decision trees, and random forests to predict product weight and whether an order was delivered on time
The key steps are data loading and cleaning, exploratory analysis, feature engineering, and building/comparing predictive models. Various plots are generated to visualize relationships between variables.
Data Exploration (EDA)
- Relationship between variables
- Check for
- Multi-co linearity
- Distribution of variables
- Presence of outliers and its treatment
- Statistical significance of variables
- Class imbalance and its treatment
Feature Engineering
- Whether any transformations required
- Scaling the data
- Feature selection
- Dimensionality reduction
Assumptions
- Check for the assumptions to be satisfied for each of the models in
- Regression – SLR, Multiple Linear Regression, Logistic Regression
- Classification – Decision Tree, Random Forest, SVM, Bagged and boosted models
- Clustering – PCA (multi-co linearity), K-Means (presence of outliers, scaling, conversion to
numerical, etc.)
----------------------------- Interim Presentation Checkpoint----------------------------------------------------------
Model building
- Split the data to train and test.
- Start with a simple model which satisfies all the above assumptions based on your dataset.
- Check for bias and variance errors.
- To improve the performance, try cross-validation, ensemble models, hyperparameter
tuning, grid search
Evaluation of model
- Regression – RMSE, R-Squared value,
- Classification – Classification report with precision, recall, F1-score, Support, AUC, etc.
- Clustering – Inertia value
- Comparison of different models built and discussion of the same
- Time taken for the inferences/ predictions
Business Recommendations & Future enhancements
- How to improve data collection, processing, and model accuracy?
- Commercial value/ Social value / Research value
- Recommendations based on insights
----------------------------- Final Presentation Checkpoint----------------------------------------------------------
Dashboard
- EDA – Correlation matrix, pair plots, box blots, distribution plots
- Model
- Model Parameters
- Visualization of performance of the model with varying parameters
- Visualization of model Metrics
- Testing outcome
- Failure cases and explanation for the same
- Most successful and obvious cases
- Border cases
The document discusses building a machine learning model to predict customer churn for a telecommunications company using a dataset containing customer characteristics. It describes preprocessing the data, exploring the features, training various classification models including logistic regression, support vector machines, random forests and decision trees, and evaluating model performance. Logistic regression achieved the best results with 79% accuracy at predicting whether customers will churn. Future work could include reducing more features and testing additional models to improve accuracy for predicting telecom customer churn.
This document discusses predicting customer churn for a telecommunications company. It begins with an introduction to the problem and dataset, which contains information on 7,043 customers. It then preprocesses the data, which has 19 variables on demographic, account, and service characteristics. Various machine learning algorithms are trained and evaluated on the data, with logistic regression achieving the best accuracy of 79%. The document concludes with opportunities for future improvement and acknowledgments.
Write a banking program that simulates the operation of your local ba.docxajoy21
The document describes a banking program that simulates basic banking operations. It includes classes like Account, SavingsAccount, CheckingAccount, Customer, Bank, and Transaction. Account has fields like customer, balance, account number, and transaction array. Customer has fields like name, address, age, etc. Bank has an account array to store accounts. The program allows adding accounts, making deposits, withdrawals, and processes transactions.
Cross selling credit card to existing debit card customersSaurabh Singh
The document describes a process for identifying existing debit card customers who may be good candidates for credit cards using cluster analysis. Transaction and customer data will be analyzed to group customers into clusters. Debit card customers in clusters that also include credit card holders will be identified as potential new credit card customers. Two campaign programs are proposed: offering credit cards when a debit customer makes an unusually large transaction, and incentivizing the remaining identified potential customers.
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
This document describes a study that used machine learning algorithms like logistic regression, random forest, and naive bayes to detect credit card fraud. The researchers collected a dataset of over 284,000 credit card transactions, which was imbalanced with very few fraudulent transactions. They used oversampling to balance the classes before splitting the data into training and test sets. Each algorithm was trained on the data and evaluated for metrics like accuracy, precision, and recall. The random forest model achieved the highest accuracy of 99.27% at detecting fraud transactions, indicating it is a promising approach for this application.
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
1) The document describes a study that uses machine learning algorithms like logistic regression, random forest, and naive bayes to detect credit card fraud.
2) The researchers collected a dataset of over 284,000 credit card transactions from a European bank, which they preprocessed and split into training and test sets.
3) They trained models using the different algorithms and evaluated the models based on metrics like loss, accuracy, and confusion matrices to determine which algorithm performed best at detecting fraudulent transactions.
The document summarizes an internship project using machine learning techniques like Newton's method and XGBoost to more accurately assess clients' insurance risks and determine the most suitable insurance plans. It involves creating a function that minimizes risk and maximizes expected returns using variables like age, medical history, and BMI from a dataset. Newton's method is then applied to the function to iteratively eliminate numerical errors and more accurately represent clients' risk scores, correctly assigning them to insurance plans. The results allow insurance companies to invest less in high-risk clients, lowering overall rates while still making accurate assessments.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Developed a Loan Eligibility Checker using machine learning with Logistic Regression classifier for classification. Analyzed customer data and assigned them an eligibility score based on factors like income, credit history, and loan amount. Our team worked on developing a Loan Eligibility Checker by using machine learning algorithms. We utilized classification as our main algorithm and chose the Logistic Regression classifier to classify our features. We pre-processed the data and handled missing values, encoded categorical variables, and scaled the data. We then split the data into training and testing sets and evaluated our model's performance.
The document describes the author's approach to building a machine learning pipeline for a Kaggle competition to predict product categories from tabular data. The pipeline includes: 1) Loading and processing the training, testing, and submission data, 2) Performing cross-validated model training and evaluation using algorithms like XGBoost, LightGBM and CatBoost, 3) Averaging the results to generate final predictions and create a submission file. The author aims to share details of algorithms, hardware performance, and results in subsequent blog posts.
- This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
- This will ensure that the consumers capable of repaying the loan are not rejected.
- Identification of such applicant's using EDA is the aim of this case study.
Using Python library such as numpy, scipy and pandas to carry out supervised learning operations like Support vector machine, decision tree and K-nearest neighbor.
Classification Algorithms with Attribute Selection: an evaluation study using...Eswar Publications
Attribute or feature selection plays an important role in the process of data mining. In general the data set contains more number of attributes. But in the process of effective classification not all attributes are relevant.
Attribute selection is a technique used to extract the ranking of attributes. Therefore, this paper presents a comparative evaluation study of classification algorithms before and after attribute selection using Waikato Environment for Knowledge Analysis (WEKA). The evaluation study concludes that the performance metrics of the classification algorithm, improves after performing attribute selection. This will reduce the work of processing irrelevant attributes.
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
Applying machine learning to Kaggle data set to predict which customers are most likely to become customers. Random Forest column importance graph is helpful to prioritize the best segments to target.
1. The document provides an introduction to formulas in Salesforce, breaking them down into three parts: data in, business logic, and value out.
2. It demonstrates how to use common functions like IF, AND, OR, and CASE to build formulas based on different use cases.
3. Examples show how to create formulas to calculate opportunity age, display priority flags for cases, and identify strategic opportunities.
The document describes a project to predict prices of used BMW cars using machine learning. It discusses exploring a dataset of car features and prices to understand relationships between variables. Categorical variables like transmission and fuel type are encoded numerically. Exploratory analysis finds price increases with years and engine size, but decreases with mileage. A random forest regressor is selected and trained on preprocessed data to predict prices, with R-squared and mean squared error used to evaluate fit.
The document discusses building a machine learning model to predict customer churn for a telecommunications company using a dataset containing customer characteristics. It describes preprocessing the data, exploring the features, training various classification models including logistic regression, support vector machines, random forests and decision trees, and evaluating model performance. Logistic regression achieved the best results with 79% accuracy at predicting whether customers will churn. Future work could include reducing more features and testing additional models to improve accuracy for predicting telecom customer churn.
This document discusses predicting customer churn for a telecommunications company. It begins with an introduction to the problem and dataset, which contains information on 7,043 customers. It then preprocesses the data, which has 19 variables on demographic, account, and service characteristics. Various machine learning algorithms are trained and evaluated on the data, with logistic regression achieving the best accuracy of 79%. The document concludes with opportunities for future improvement and acknowledgments.
Write a banking program that simulates the operation of your local ba.docxajoy21
The document describes a banking program that simulates basic banking operations. It includes classes like Account, SavingsAccount, CheckingAccount, Customer, Bank, and Transaction. Account has fields like customer, balance, account number, and transaction array. Customer has fields like name, address, age, etc. Bank has an account array to store accounts. The program allows adding accounts, making deposits, withdrawals, and processes transactions.
Cross selling credit card to existing debit card customersSaurabh Singh
The document describes a process for identifying existing debit card customers who may be good candidates for credit cards using cluster analysis. Transaction and customer data will be analyzed to group customers into clusters. Debit card customers in clusters that also include credit card holders will be identified as potential new credit card customers. Two campaign programs are proposed: offering credit cards when a debit customer makes an unusually large transaction, and incentivizing the remaining identified potential customers.
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
This document describes a study that used machine learning algorithms like logistic regression, random forest, and naive bayes to detect credit card fraud. The researchers collected a dataset of over 284,000 credit card transactions, which was imbalanced with very few fraudulent transactions. They used oversampling to balance the classes before splitting the data into training and test sets. Each algorithm was trained on the data and evaluated for metrics like accuracy, precision, and recall. The random forest model achieved the highest accuracy of 99.27% at detecting fraud transactions, indicating it is a promising approach for this application.
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
1) The document describes a study that uses machine learning algorithms like logistic regression, random forest, and naive bayes to detect credit card fraud.
2) The researchers collected a dataset of over 284,000 credit card transactions from a European bank, which they preprocessed and split into training and test sets.
3) They trained models using the different algorithms and evaluated the models based on metrics like loss, accuracy, and confusion matrices to determine which algorithm performed best at detecting fraudulent transactions.
The document summarizes an internship project using machine learning techniques like Newton's method and XGBoost to more accurately assess clients' insurance risks and determine the most suitable insurance plans. It involves creating a function that minimizes risk and maximizes expected returns using variables like age, medical history, and BMI from a dataset. Newton's method is then applied to the function to iteratively eliminate numerical errors and more accurately represent clients' risk scores, correctly assigning them to insurance plans. The results allow insurance companies to invest less in high-risk clients, lowering overall rates while still making accurate assessments.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Developed a Loan Eligibility Checker using machine learning with Logistic Regression classifier for classification. Analyzed customer data and assigned them an eligibility score based on factors like income, credit history, and loan amount. Our team worked on developing a Loan Eligibility Checker by using machine learning algorithms. We utilized classification as our main algorithm and chose the Logistic Regression classifier to classify our features. We pre-processed the data and handled missing values, encoded categorical variables, and scaled the data. We then split the data into training and testing sets and evaluated our model's performance.
The document describes the author's approach to building a machine learning pipeline for a Kaggle competition to predict product categories from tabular data. The pipeline includes: 1) Loading and processing the training, testing, and submission data, 2) Performing cross-validated model training and evaluation using algorithms like XGBoost, LightGBM and CatBoost, 3) Averaging the results to generate final predictions and create a submission file. The author aims to share details of algorithms, hardware performance, and results in subsequent blog posts.
- This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
- This will ensure that the consumers capable of repaying the loan are not rejected.
- Identification of such applicant's using EDA is the aim of this case study.
Using Python library such as numpy, scipy and pandas to carry out supervised learning operations like Support vector machine, decision tree and K-nearest neighbor.
Classification Algorithms with Attribute Selection: an evaluation study using...Eswar Publications
Attribute or feature selection plays an important role in the process of data mining. In general the data set contains more number of attributes. But in the process of effective classification not all attributes are relevant.
Attribute selection is a technique used to extract the ranking of attributes. Therefore, this paper presents a comparative evaluation study of classification algorithms before and after attribute selection using Waikato Environment for Knowledge Analysis (WEKA). The evaluation study concludes that the performance metrics of the classification algorithm, improves after performing attribute selection. This will reduce the work of processing irrelevant attributes.
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
Applying machine learning to Kaggle data set to predict which customers are most likely to become customers. Random Forest column importance graph is helpful to prioritize the best segments to target.
1. The document provides an introduction to formulas in Salesforce, breaking them down into three parts: data in, business logic, and value out.
2. It demonstrates how to use common functions like IF, AND, OR, and CASE to build formulas based on different use cases.
3. Examples show how to create formulas to calculate opportunity age, display priority flags for cases, and identify strategic opportunities.
The document describes a project to predict prices of used BMW cars using machine learning. It discusses exploring a dataset of car features and prices to understand relationships between variables. Categorical variables like transmission and fuel type are encoded numerically. Exploratory analysis finds price increases with years and engine size, but decreases with mileage. A random forest regressor is selected and trained on preprocessed data to predict prices, with R-squared and mean squared error used to evaluate fit.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
1. Problem Statement:
Your client is an Insurance company and they need your help in building a model to predict whether the policyholder (customer ) will
pay next premium on time or not.
By looking at the problem statement we can understand that "This is a classification problem"
Hypothesis Generation
1. Clients with high income will have higher chances of paying next premium
2. Clients with high dafault rate has higher chances of the not paying next premium
3. Clients with low income has higher chances of not paying next premium
4. Clients with medium income has higher chances of not paying premium if premium cost high
5. Clients with higher age has higher chances of not paying premium
In [2]: # Loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]: # It gives the directory
import o s
os.getcwd()
DataExtraction
In [6]: # Load the data sets
Test = pd.read_csv("test.csv")
Train = pd.read_csv( "train.csv")
Exploratory DataAnalysis
Steps in EDA:
1. Variable Identification
2. Univariate Analysis for Continious Variables
3. Univariate Analysis for Categorical Variables
4. Bivariate analysis for both Continious and categorical variables
5. Trating missing values
6. Outlier Treatment
7. Variable Transfor mati on
In [12]: # Variable Description
# id - Unique ID of the policy
# perc_premium_paid_by_cash_credit - Percentage of premium amount paid by cash or credit card
# age_in_days - Age in days of policy holder
# Income - Monthly Income of policy holder #
Count_3 -6_months_late - No of premiums late by 3 to 6 months # Count_6 -
12_months_late - No of premiums late by 6 to 12 months
# Count_more_than_12_months_late - No of premiums late by more than 12 months
# application_underwriting_score - No applications under the score of 90 are insured #
no_of_premiums_paid - Total premiums paid on time till now
# sourcing_channel - Sourcing channel for application
# residence_area_type - Area type of Residence (Urban/Rural)
In [14]: # Genrally data types int, float are continious variables but some times intiger varibles also in ca
tegorical in nature
Train["Count_3-6_months_late"]. value_counts()
In [18]: # If you notice age_in_days variable is in days, lets transform it to years
Train['age_in_days'] = Train["age_in_days"]/365
Note:
1. perc_premium _pai d_by_cash_cr edit this variable has no outliers
2. Age distribution is normal above 90 is a outlier
3. application_underwriting_score is skewed left side and score below 98 is an outlier
Univariate Analysis for Categorical Variables
Note
1.( Count_3- 6_m onths_late Count_6-12_m onths_l ate Count_more_than_12_m onths_late) most of the times these variables are 0, very
rarely people misses to pay for3-12 months
1. no_of_premi ums_pai d' most number of clients paid 8 times then trend keep decresing
2. sourcing_channel: 50% of the clients came from channel A
3. residence_area_type: Nearly 60% clients are from the urban area
4. Less then 10% clients are defaulters
In [41]: # Categorical - Continious Bivariate analysis
Train.groupby("Count_3-6_months_late")['Income'].mean() .plot .bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [42]: Train.groupby("Count_6-12_months_late")[ 'Income'].mean().plot.bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [43]: Train.groupby("Count_more_than_12_months_late")['Income'].mean() .plot .bar()
# This attribute has some clear trend higher the Income Higher the default rate
In [44]: Train.groupby("sourcing_channel")['Income'].mean() .plot.bar()
# Sorcing chanel E has high Income
In [45]: Train.groupby("residence_area_type")['Income' ].mean().plot.bar()
# Incomes of Rural and Urban Clients has almost same.
In [46]: Train.groupby("target")['Income'].mean() .plot.bar()
# Income less then 175k has higher chances of default rate
In [47]: Train.groupby("target")['Age']. mean().plot.bar()
# Age less then 50 has higher chances of default rate
In [48]: fig= plt.figure(figsize=(18,7))
Train.groupby("no_of_premiums_paid")['Age'].mean() .plot.bar()
# Age increses no of times of premium pay increses
Categorial - Categorial Bivariate Analysis
In [49]: # Create 2-way tables
pd.crosstab(Train['sourcing_channel' ], Train[ "target"])
# With this we can understad that sourcing chanel 'A' has low Income and high chances of default
# Overall percentage wise channel B has higher chances of default
In [50]: pd.crosstab(Train['residence_area_type'], Train["target"])
#Rural clients has higher chances of default
Missing valuestreatment
In [51]: Train.isnull().sum()
In [57]: Train.isnull().sum()
In [58]: # Drop un wanted columns
Train = Train.drop(['no_of_premiums_paid' ], axis = 1 )
In [59]: Test = Test.drop(['no_of_premiums_paid'], axis = 1)
Outlier Treatment
In [60]: # replace age above anything 90 with mean
import numpy as np
Train.loc[Train[ "Age"] >90 , 'Age'] = np.mean(Train["Age"])
DataTransformation
In [62]: # We have convert categorical variables to numbers, editing manually takes lot of time so we will us
e LabelEncoder function
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
Train["sourcing_channel"] = number.fit_transform(Train[ "sourcing_channel"].astype('str'))
Test["sourcing_channel"] = number .fit_transform(Test[ "sourcing_channel"] .astype("str"))
Model Building
In [65]: # Drop un corelated variables to our target variable in both test and train data set
x_train = Train.drop([ 'target','Age' , 'Income' , 'application_underwriting_score' ,'residence_area_typ
e','sourcing_channel' ], axis =1)
Out[3]: 'C: Users dell'
In [4]: # To change the directory of the system
os.chdir('C:Users dellDownloads')
In [5]: os.getcwd()
Out[5]: 'C: Users dellDownloads'
In [7]: # Check the data weather it loaded or not
Test.head()
Out[7]:
id perc_premium_paid_by_cash_credit age_in_days IncomeCount_3- Count_6- Count_more_than_12_mo
6_months_late 12_months_late
0 649 0.001 27384 51150 0.0 0.0
1 81136 0.124 23735 285140 0.0 0.0
2 70762 1.000 17170 186030 0.0 0.0
3 53935 0.198 16068 123540 0.0 0.0
4 15476 0.041 10591 200020 1.0 0.0
In [8]: Train.head()
Out[8]:
id perc_premium_paid_by_cash_credit age_in_days Income
Count_3- Count_6-
Count_more_than_12_m
6_months_late 12_months_late
0 110936 0.429 12058 355060 0.0 0.0
1 41492 0.010 21546 315150 0.0 0.0
2 31300 0.917 17531 84140 2.0 3.0
3 19415 0.049 15341 250510 0.0 0.0
4 99379 0.052 31400 198680 0.0 0.0
In [9]: Test.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 34224 entries, 0 to 34223
Data columns (total 11 columns):
id 34224 non-null int64
perc_premium_paid_by_cash_credit 34224 non-null float64
age_in_days 34224 non-null int64
Income 34224 non-null int64
Count_3-6_months_late 34193 non-null float64
Count_6-12_months_late 34193 non-null float64
Count_more_than_ 12_months_late 34193 non-null float64
application_underwriting_score 32901 non-null float64
no_of_premiums_paid 34224 non-null int64
sourcing_channel 34224 non-null object
residence_area_type 34224 non-null object
dtypes: float64( 5), int64(4), object(2)
memory usage : 2. 9+ MB
In [10]: Train.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 13 columns):
id 79853 non-null int64
perc_premium_paid_by_cash_credit 79853 non-null float64
age_in_days 79853 non-null int64
Income 79853 non-null int64
Count_3-6_months_late 79756 non-null float64
Count_6-12_months_late 79756 non-null float64
Count_more_than_ 12_months_late 79756 non-null float64
application_underwriting_score 76879 non-null float64
no_of_premiums_paid 79853 non-null int64
sourcing_channel 79853 non-null object
residence_area_type 79853 non-null object
premium 79853 non-null int64
target 79853 non-null int64
dtypes: float64(5), int64(6), object(2)
memory usage: 7.9+ MB
If you observe carefully Test data set have only 11 columns where as Train data set has 13 columns. we will remove the primium
column in Train data set
In [11]: Train = Train .drop(['premium'], axis =1)
Vaiable Identification
In [13]: # Identify continious and categorical variables
Train.dtypes
Out[13] : id int64
perc_premium_paid_by_cash_credit float64
age_in_day s int64
Income int64
Count_3-6_months_late float64
Count_6-12_months_late float64
Count_more_than_ 12_months_late float64
application_underwriting_score float64
no_of_premiums_paid int64
sourcing_channel object
residence_area_type object
target int64
dtype: object
By above output we notice that sourcing_channel, residence_ar ea_type are categorical variables
Out[14]: 0.0 66801
1.0 8826
2.0 2519
3.0 954
4.0 374
5.0 168
6.0 68
7.0 23
8.0 15
9.0 4
11.0 1
12.0 1
13.0 1
10.0 1
Name: Count_3-6_months_late, dtype: int64
In [15]: Train["Count_6-12_months_late"]. value_counts()
Out[15]: 0.0 75831
1.0 2680
2.0 693
3.0 317
4.0 130
5.0 46
6.0 26
7.0 11
8.0 5
10.0 4
9.0 4
14.0 2
11.0 2
13.0 2
17.0 1
12.0 1
15.0 1
Name: Count_6-12_months_late, dtype: int64
In [16]: Train["Count_more_than_12_months_late"].value_counts()
Out[16]: 0.0 76038
1.0 2996
2.0 498
3.0 151
4.0 48
5.0 13
6.0 6
7.0 3
8.0 2
11.0 1
Name: Count_more_than_12_months_late, dtype: int64
In [17]: Train["no_of_premiums_paid"]. value_counts()
Out[17]: 8 7184
9 7158
10 6873
7 6623
11 6395
6 5635
12 5407
13 4752
5 4215
14 3988
15 3264
4 2907
16 2678
17 2148
18 1799
3 1746
19 1355
20 1134
21 838
2 726
22 713
23 503
24 386
25 305
26 241
27 186
28 152
29 119
30 91
31 61
32 51
33 43
34 38
35 31
36 23
37 14
38 8
42 7
40 6
41 6
39 5
47 5
44 4
43 3
45 3
56 3
48 3
50 3
51 3
58 2
52 2
53 2
54 2
59 1
55 1
49 1
60 1
Name: no_of_premiums_paid, dtype: int64
In this section we noticed that below variables arecategorical in
nature
Count_3-6_m onths_late
Count_6-12_months_late
Count_more_than_12_months_late
application_underwriting_score
no_of_premi ums_pai d
sourcing_channel residence_ar ea_type
premium
targe
In [19]: Test['age_in_days'] = Test["age_in_days"]/365
In [20]: # Now rename 'age_in _days variable to Age'
Train = Train.rename(columns={"age_in_days": "Age"})
In [21]: Test = Test.rename(columns ={"age_in_days" : "Age"})
Univariate Analysis of Continious variables
In [22]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score"]].descri
be()
Out[22]:
count
mean
std
min
25%
50%
75%
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
79853.000000 79853.000000 7.985300e+04 76879.000000
0.314288 51.634786 2.088472e+05 99.067291
0.334915 14.270463 4.965826e+05 0.739799
0.000000 21.013699 2.403000e+04 91.900000
0.034000 41.024658 1.080100e+05 98.810000
0.167000 51.027397 1.665600e+05 99.210000
0.538000 62.016438 2.520900e+05 99.540000
max 1.000000 103.019178 9.026260e+07 99.890000
In [23]: Train['perc_premium_paid_by_cash_credit'].plot.hist()
Out[23]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc24e2550>
In [24]: Train['perc_premium_paid_by_cash_credit'].plot.box()
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc220d780>
In [25]: Train['Age'] .plot.hist()
Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc228bb00>
In [26]: Train['Age'] .plot.box()
Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc230e278>
In [27]: Train['application_underwriting_score'].plot.hist()
Out[27]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc23bcba8>
In [28]: Train['application_underwriting_score'].plot.box()
Out[28]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2432c18>
In [29]: # Creating frequency tables and bar plots for the categorical variables
Train['Count_3-6_months_late'].value_counts() .plot.bar()
Out[29]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2490978>
In [30]: Train['Count_6-12_months_late'] .value_counts() .plot.bar()
Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2a2d940>
In [31]: Train['Count_more_than_12_months_late'].value_counts(). plot.bar()
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2acc320>
In [32]: Train['application_underwriting_score'].value_counts(). plot.hist()
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2caab70>
In [33]: fig= plt.figure(figsize=(12 ,7))
Train['no_of_premiums_paid'].value_counts().plot.bar()
Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2ca04e0>
In [34]: Train['sourcing_channel'].value_counts().plot. bar()
Out[34]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc38047f0>
In [35]: Train['residence_area_type' ].value_counts().plot.bar()
Out[35]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc382f128>
In [36]: (Train[ 'residence_area_type'].value_counts()/len(Train[ 'residence_area_type'])) .plot .bar()
Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc1ebee10>
In [37]: Train['target'].value_counts().plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2dc1f60>
In [38]: (Train[ 'target'].value_counts() /len(Train['target'])).plot.bar()
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc36f9c18>
Continious - Continious Bivariate Analysis
In [39]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score" ,"target"
]].corr()
Out[39]:
perc_premium_paid_by_cash_credit
Age
Income
application_underwriting_score
target
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
1.000000 -0.259131 -0.031868 -0.142670
-0.259131 1.000000 0.029308 0.049888
-0.031868 0.029308 1.000000 0.085746
-0.142670 0.049888 0.085746 1.000000
-0.240980 0.095103 0.016541 0.068715
In [40]: Train.head()
Out[40]:
id perc_premium_paid_by_cash_credit Age Income
Count_3- Count_6-
Count_more_than_12_mon
6_months_late 12_months_late
0 110936 0.429 33.035616 355060 0.0 0.0
1 41492 0.010 59.030137 315150 0.0 0.0
2 31300 0.917 48.030137 84140 2.0 3.0
3 19415 0.049 42.030137 250510 0.0 0.0
4 99379 0.052 86.027397 198680 0.0 0.0
Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc37770b8>
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc399f0b8>
Out[43]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3a41588>
Out[44]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac50b8>
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3b20748>
Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac5c50>
Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3be6080>
Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3c41a20>
Out[49]:
target 0 1
sourcing_channel
A 2349 40785
B 1066 15446
C 903 11136
D 634 6925
E 46 563
Out[50] :
target 0 1
residence_area_type
Rural 1998 29672
Urban 3000 45183
Out[51]: id 0
perc_premium_paid_by_cash_credit 0
Age 0
Income 0
Count_3 -6_months_lat e 97
Count_6 -12_months_lat e 97
Count_more_than_12_months_late 97
application_underwriting_score 2974
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [52]: # Replace the application_underwriting_score with the mean
Train['application_underwriting_score'].fillna(99, inplace = True)
In [53]: # Replace in Test set too
Test['application_underwriting_score'].fillna(99, inplace = True)
In [54]: # Now drop null values in Train data set
Train = Train.dropna()
In [55]: #In test data set replace with 0
Test.fillna( 0,inplace = True)
In [56]: # Now check for null values in both test and train data sets
Test.isnull().sum()
Out[56] : id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
Count_6 -12_months_lat e
0
0
0
0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
dtype: int64
Out[57]: id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
0
0
0
Count_6 -12_months_late 0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [61]: np.power(Train[ 'Income'],1/5) .plot.hist()
Out[61]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc367d6d8>
In [63]: Train["residence_area_type" ] = number .fit_transform(Train["residence_area_type"].astype('str'))
Test["residence_area_type" ] = number.fit_transform(Test["residence_area_type"].astype("str"))
In [64]: Train.corr()
Out[64]:
id perc_premium_paid_by_cash_credit Age Income
Count_3-
6_months_late 12_m
id 1.000000 -0.004772 0.004306 -0.001816 -0.005660
perc_premium_paid_by_cash_credit -0.004772 1.000000 -0.255676 -0.031341 0.214470
Age 0.004306 -0.255676 1.000000 0.030214 -0.057878
Income -0.001816 -0.031341 0.030214 1.000000 -0.001403
Count_3-6_months_late -0.005660 0.214470 -0.057878 -0.001403 1.000000
Count_6-12_months_late -0.002125 0.214951 -0.072484 -0.017347 0.204228
Count_more_than_12_months_late 0.003424 0.168125 -0.059602 -0.012399 0.296085
application_underwriting_score -0.002084 -0.138657 0.043666 0.062699 -0.081463
sourcing_channel 0.001364 0.082878 -0.215420 0.059663 0.058662
residence_area_type 0.001803 -0.002013 0.000577 0.003470 0.001592
target -0.005365 -0.237210 0.093163 0.015911 -0.248900
In [72]: Test_1 = Test.drop(['Age', 'Income', 'application_underwriting_score','residence_area_type','sourcin
g_channel'], axis =1)
In [67]: y_train = Train['target']
In [68]: import sklearn
In [69]: from sklearn.tree import DecisionTreeClassifier
In [70]: model_1 = DecisionTreeClassifier()
In [71]: # Training The data set
model_1.fit(x_train,y_train)
Out[71]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
In [80]: # Create a new variable target in test set
Test["target"] = model_1.predict(Test_1)
In [75]: # score of our model on train data set
model_1.score(x_train,y_train)
Out[75]: 1.0
In [81]: Test_2 = Test[["id", "target" ]]
In [83]: Test_2. set_index( 'id') .head()
Out[83]:
target
id
649 1
81136 1
70762 1
53935 1
15476 1