How to create an explainable scorecard model using machine learning to optimize its performance with results and insights from applying it to a stock picking problem.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
The aim of this project is to help a telecom company with insights on customer behavior that would be useful for retention of customers. The specific goals expected to be achieved are given below
1. Identification of the top variables driving likelihood of churn
2. Build a predictive model to identify customers who have highest probability to terminate services with the company.
3. Build a lift chart for optimization of efforts by targeting most of the potential churns with least contact efforts. Here with 30% of the total customer pool, the model accurately provides 33% of total potential churn candidates.
Models tried to arrive at the best are
1. Simple Models like Logistic Regression & Discriminant Analysis with different thresholds for classification
2. Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
3. Ensemble of five individual models and predicting the output by averaging the individual output probabilities
4. Xgboost algorithm
Descriptive statistics helps users to describe and understand the features of a specific dataset, by providing short summaries and a graphic depiction of the measured data. Descriptive Statistical algorithms are sophisticated techniques that, within the confines of a self-serve analytical tool, can be simplified in a uniform, interactive environment to produce results that clearly illustrate answers and optimize decisions.
Isotonic Regression is a statistical technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible. Isotonic Regression is limited to predicting numeric output so the dependent variable must be numeric in nature…
Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. It is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
The aim of this project is to help a telecom company with insights on customer behavior that would be useful for retention of customers. The specific goals expected to be achieved are given below
1. Identification of the top variables driving likelihood of churn
2. Build a predictive model to identify customers who have highest probability to terminate services with the company.
3. Build a lift chart for optimization of efforts by targeting most of the potential churns with least contact efforts. Here with 30% of the total customer pool, the model accurately provides 33% of total potential churn candidates.
Models tried to arrive at the best are
1. Simple Models like Logistic Regression & Discriminant Analysis with different thresholds for classification
2. Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
3. Ensemble of five individual models and predicting the output by averaging the individual output probabilities
4. Xgboost algorithm
Descriptive statistics helps users to describe and understand the features of a specific dataset, by providing short summaries and a graphic depiction of the measured data. Descriptive Statistical algorithms are sophisticated techniques that, within the confines of a self-serve analytical tool, can be simplified in a uniform, interactive environment to produce results that clearly illustrate answers and optimize decisions.
Isotonic Regression is a statistical technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible. Isotonic Regression is limited to predicting numeric output so the dependent variable must be numeric in nature…
Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. It is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.
Ways to evaluate a machine learning model’s performanceMala Deep Upadhaya
Some of the ways to evaluate a machine learning model’s performance.
In Summary:
Confusion matrix: Representation of the True positives (TP), False positives (FP), True negatives (TN), False negatives (FN)in a matrix format.
Accuracy: Worse happens when classes are imbalanced.
Precision: Find the answer of How much the model is right when it says it is right!
Recall: Find the answer of How many extra right ones, the model missed when it showed the right ones!
Specificity: Like Recall but the shift is on the negative instances.
F1 score: Is the harmonic mean of precision and recall so the higher the F1 score, the better.
Precision-Recall or PR curve: Curve between precision and recall for various threshold values.
ROC curve: Graph is plotted against TPR and FPR for various threshold values.
Logistic regression measures the relationship between the categorical target variable and one or more independent variables It deals with situations in which the outcome for a target variable can have two or more possible types. The Multinomial-Logistic Regression Classification Algorithm is useful in identifying the relationships of various attributes, characteristics and other variables to a particular outcome.
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
Assessing Model Performance - Beginner's GuideMegan Verbakel
Introduction on how to assess the performance of a classifier model. Covers theories (bias-variance trade-off, over/under-fitting), data preparation (train/test split, cross-validation), common performance plots (e.g. ROC curve and confusion matrix), and common metrics (e.g. accuracy, precision, recall, f1-score).
MLX 2018 - Marcos López de Prado, Lawrence Berkeley National Laboratory Comp...Mehdi Merai Ph.D.(c)
Presented by: Marcos López de Prado, Lawrence Berkeley National Laboratory Computational Research Division
MLX FinTech Conference II, Toronto, May 2018.
More info at: https://www.machinelearningx.net
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
This overview discusses the predictive analytical technique known as Random Forest Regression, a method of analysis that creates a set of Decision Trees from a randomly selected subset of the training set, and aggregates by averaging values from different decision trees to decide the final target value. This technique is useful to determine which predictors have a significant impact on the target values, e.g., the impact of average rainfall, city location, parking availability, distance from hospital, and distance from shopping on the price of a house, or the impact of years of experience, position and productive hours on employee salary. Random Forest Regression is limited to predicting numeric output so the dependent variable has to be numeric in nature. The minimum sample size is 20 cases per independent variable. Random Forest Regression is just one of the numerous predictive analytical techniques and algorithms included in the Assisted Predictive Modeling module of the Smarten augmented analytics solution. This solution is designed to serve business users with sophisticated tools that are easy to use and require no data science or technical skills. Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms Report.
In Machine Learning in Credit Risk Modeling, we provide an explanation of the main Machine Learning models used in James so that Efficiency does not come at the expense of Explainability.
(Contact Yvan De Munck for more info or to receive other and future updates on the subject @yvandemunck or yvan@james.finance)
Ways to evaluate a machine learning model’s performanceMala Deep Upadhaya
Some of the ways to evaluate a machine learning model’s performance.
In Summary:
Confusion matrix: Representation of the True positives (TP), False positives (FP), True negatives (TN), False negatives (FN)in a matrix format.
Accuracy: Worse happens when classes are imbalanced.
Precision: Find the answer of How much the model is right when it says it is right!
Recall: Find the answer of How many extra right ones, the model missed when it showed the right ones!
Specificity: Like Recall but the shift is on the negative instances.
F1 score: Is the harmonic mean of precision and recall so the higher the F1 score, the better.
Precision-Recall or PR curve: Curve between precision and recall for various threshold values.
ROC curve: Graph is plotted against TPR and FPR for various threshold values.
Logistic regression measures the relationship between the categorical target variable and one or more independent variables It deals with situations in which the outcome for a target variable can have two or more possible types. The Multinomial-Logistic Regression Classification Algorithm is useful in identifying the relationships of various attributes, characteristics and other variables to a particular outcome.
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
Assessing Model Performance - Beginner's GuideMegan Verbakel
Introduction on how to assess the performance of a classifier model. Covers theories (bias-variance trade-off, over/under-fitting), data preparation (train/test split, cross-validation), common performance plots (e.g. ROC curve and confusion matrix), and common metrics (e.g. accuracy, precision, recall, f1-score).
MLX 2018 - Marcos López de Prado, Lawrence Berkeley National Laboratory Comp...Mehdi Merai Ph.D.(c)
Presented by: Marcos López de Prado, Lawrence Berkeley National Laboratory Computational Research Division
MLX FinTech Conference II, Toronto, May 2018.
More info at: https://www.machinelearningx.net
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
This overview discusses the predictive analytical technique known as Random Forest Regression, a method of analysis that creates a set of Decision Trees from a randomly selected subset of the training set, and aggregates by averaging values from different decision trees to decide the final target value. This technique is useful to determine which predictors have a significant impact on the target values, e.g., the impact of average rainfall, city location, parking availability, distance from hospital, and distance from shopping on the price of a house, or the impact of years of experience, position and productive hours on employee salary. Random Forest Regression is limited to predicting numeric output so the dependent variable has to be numeric in nature. The minimum sample size is 20 cases per independent variable. Random Forest Regression is just one of the numerous predictive analytical techniques and algorithms included in the Assisted Predictive Modeling module of the Smarten augmented analytics solution. This solution is designed to serve business users with sophisticated tools that are easy to use and require no data science or technical skills. Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms Report.
In Machine Learning in Credit Risk Modeling, we provide an explanation of the main Machine Learning models used in James so that Efficiency does not come at the expense of Explainability.
(Contact Yvan De Munck for more info or to receive other and future updates on the subject @yvandemunck or yvan@james.finance)
Review Parameters Model Building & Interpretation and Model Tunin.docxcarlstromcurtis
Review Parameters: Model Building & Interpretation and Model Tuning
1. Model Building
a. Assessments and Rationale of Various Models Employed to Predict Loan Defaults
The z-score formula model was employed by Altman (1968) while envisaging bankruptcy. The model was utilized to forecast the likelihood that an organization may fall into bankruptcy in a period of two years. In addition, the Z-score model was instrumental in predicting corporate defaults. The model makes use of various organizational income and balance sheet data to weigh the financial soundness of a firm. The Z-score involves a Linear combination of five general financial ratios which are assessed through coefficients. The author employed the statistical technique of discriminant examination of data set sourced from publically listed manufacturers. A research study by Alexander (2012) made use of symmetric binary alternative models, otherwise referred to as conditional probability models. The study sought to establish the asymmetric binary options models subject to the extreme value theory in better explicating bankruptcy.
In their research study on the likelihood of default models examining Russian banks, Anatoly et al. (2014) made use of binary alternative models in predicting the likelihood of default. The study established that preface specialist clustering or mechanical clustering enhances the prediction capacity of the models. Rajan et al. (2010) accentuated the statistical default models as well as inducements. They postulated that purely numerical models disregard the concept that an alteration in the inducements of agents who produce the data may alter the very nature of data. The study attempted to appraise statistical models that unpretentiously pool resources on historical figures devoid of modeling the behavior of driving forces that generates these data. Goodhart (2011) sought to assess the likelihood of small businesses to default on loans. Making use of data on business loan assortment, the study established the particular lender, loan, and borrower characteristics as well as modifications in the economic environments that lead to a rise in the probability of default. The results of the study form the basis for the scoring model. Focusing on modeling default possibility, Singhee & Rutenbar (2010) found the risk as the uncertainty revolving around an enterprise’s capacity to service its obligations and debts.
Using the logistic model to forecast the probability of bank loan defaults, Adam et al. (2012) employed a data set with demographic information on borrowers. The authors attempted to establish the risk factors linked to borrowers are attributable to default. The identified risk factors included marital status, gender, occupation, age, and loan duration. Cababrese (2012) employed three accepted data mining algorithms, naïve Bayesian classifiers, artificial neural network decision trees coupled with a logical regression model to formulate a prediction m ...
Credit Card Marketing Classification Trees Fr.docxShiraPrater50
Credit Card Marketing
Classification Trees
From Building Better Models with JMP® Pro,
Chapter 6, SAS Press (2015). Grayson, Gardner
and Stephens.
Used with permission. For additional information,
see community.jmp.com/docs/DOC-7562.
2
Credit Card Marketing
Classification Trees
Key ideas: Classification trees, validation, confusion matrix, misclassification, leaf report, ROC
curves, lift curves.
Background
A bank would like to understand the demographics and other characteristics associated with whether a
customer accepts a credit card offer. Observational data is somewhat limited for this kind of problem, in
that often the company sees only those who respond to an offer. To get around this, the bank designs a
focused marketing study, with 18,000 current bank customers. This focused approach allows the bank to
know who does and does not respond to the offer, and to use existing demographic data that is already
available on each customer.
The designed approach also allows the bank to control for other potentially important factors so that the
offer combination isn’t confused or confounded with the demographic factors. Because of the size of the
data and the possibility that there are complex relationships between the response and the studied
factors, a decision tree is used to find out if there is a smaller subset of factors that may be more
important and that warrant further analysis and study.
The Task
We want to build a model that will provide insight into why some bank customers accept credit card offers.
Because the response is categorical (either Yes or No) and we have a large number of potential predictor
variables, we use the Partition platform to build a classification tree for Offer Accepted. We are primarily
interested in understanding characteristics of customers who have accepted an offer, so the resulting
model will be exploratory in nature.1
The Data Credit Card Marketing BBM.jmp
The data set consists of information on the 18,000 current bank customers in the study.
Customer Number: A sequential number assigned to the customers (this column is hidden and
excluded – this unique identifier will not be used directly).
Offer Accepted: Did the customer accept (Yes) or reject (No) the offer.
Reward: The type of reward program offered for the card.
Mailer Type: Letter or postcard.
Income Level: Low, Medium or High.
# Bank Accounts Open: How many non-credit-card accounts are held by the customer.
1 In exploratory modeling, the goal is to understand the variables or characteristics that drive behaviors or particular outcomes. In
predictive modeling, the goal is to accurately predict new observations and future behaviors, given the current information and
situation.
3
Overdraft Protection: Does the customer have overdraft protection on their checking account(s)
(Yes or No).
Credit Rating: Low, Medium or High.
# Credit Cards Held: The number of cred ...
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
Primary Goals
1. To determine what factors are driving the lead conversion process.
2. To Identify which leads are more likely to convert to paid customers.
Data Description
3. Dataset consists of 4613 rows and 15 columns.
Modelling Strategies
4. Plan
4.1 Perform Dummy Encoding
4.2 List Variables for Modeling
4.3 Identify metric of interest to judge model's performance
5. Build
5.1 Build Logistic Regression Model (Preliminary Model)
5.2 Observe the metrics of the model
6. Improve
6.1 Identify the significant variables
6.2 Rebuild model
6.3 Observe the metrics of the models
7. Decide
7.1 Compare the results of Logistic Regression model (Base model) and Decision Tree Model
7.2 Conclude on best model for this project
8. Recommend
8.1 Determine factors driving the lead conversion process
8.2 Recommend what that may help to identify which leads are more likely to convert to paying customers
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
More on https://highlyscalable.wordpress.com/
Data Mining Problems in Retail is an analytical report that studies how retailers can make sense of their
data by adopting advanced data analysis and optimization techniques that enable automated decision
making in the area of marketing and pricing. The report analyzes dozens of practical case studies and
research reports and presents a systematic view on the problem.
We hope that this article will be useful for data scientists, marketing specialists, and business analysts
who are looking beyond the basic statistical and data mining techniques to build comprehensive
data-driven business optimization processes and solutions.
In-spite of large volumes of Contingent Credit Lines (CCL) in all commercial banks, the paucity of Exposure at Default (EAD) models, unsuitability of external data and inconsistent internal data with partial draw-downs has been a major challenge for risk managers as well as regulators in for managing CCL portfolios. This current paper is an attempt to build an easy to implement, pragmatic and parsimonious yet accurate model to determine the exposure distribution of a CCL portfolio. Each of the credit line in a portfolio is modeled as a portfolio of large number of option instruments which can be exercised by the borrower, determining the level of usage. Using an algorithm similar to basic the CreditRisk+ and Fourier Transforms we arrive at a portfolio level probability distribution of usage. We perform a simulation experiment using data from Moody\'s Default Risk Service, historical draw-down rates estimated from the history of defaulted CCLs and a current rated portfolio of such.
BA is used to gain insights that inform business decisions and can be used to automate and optimize business processes. Data-driven companies treat their data as a corporate asset and leverage it for a competitive advantage. Successful business analytics depends on data quality, skilled analysts who understand the technologies and the business, and an organizational commitment to data-driven decision-making.
Business analytics examples
Business analytics techniques break down into two main areas. The first is basic business intelligence. This involves examining historical data to get a sense of how a business department, team or staff member performed over a particular time. This is a mature practice that most enterprises are fairly accomplished at using.
This Machine Learning Algorithms presentation will help you learn you what machine learning is, and the various ways in which you can use machine learning to solve a problem. At the end, you will see a demo on linear regression, logistic regression, decision tree and random forest. This Machine Learning Algorithms presentation is designed for beginners to make them understand how to implement the different Machine Learning Algorithms.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. Real world applications of Machine Learning
2. What is Machine Learning?
3. Processes involved in Machine Learning
4. Type of Machine Learning Algorithms
5. Popular Algorithms with a hands-on demo
- Linear regression
- Logistic regression
- Decision tree and Random forest
- N Nearest neighbor
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Similar to Creating an Explainable Machine Learning Algorithm (20)
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Creating an Explainable Machine Learning Algorithm
1. Explainable Machine Learning
Creating an explainable ML algorithm with some results and
insights applying it to a picking stocks for possible
investment.
One of the major roadblocks in getting ML models into use is the difficulty in explaining
how the ML model works to users, management and independent model review
personnel. In comparison, statistical models are much easier to explain and a variety
of standard statistical tests are available to support model validation.
Statistical models include:
• Linear regression
• Logistic regression
• Discriminant models
• Clustering
For most of these, there are standard, Bayesian and regulated modules and packages
in SAS, R, Python, etc. that are widely accepted and easy to use.
On the other hand, it is more difficult to explain Neural Networks and the various
versions of Random Forests to people not familiar with them.
The Importance of Easily Understood Models
In highly regulated financial applications, banks for example, it is extremely difficult to
get models reviewed and implemented in short time frames. The more explanation that
is necessary to understand how the model works, where it might not work, how it was
developed the longer the model build to implementation cycle. For this reason, logistic
regression models are the model of choice for credit risk for the most part.
Widely Used Models
There are several modeling approaches that are widely used in classification in industry
that are easily explainable.
Page of1 16
2. Logistic Regression
For logistic regression the result of the combination is then input into a sigmoid to
obtain the maximum likelihood estimator for the probabilities of a 0 or 1 outcome.
(1)
usually written as:
(2)
Linear Regression
In some cases, the more simpler linear regression model is used.
(3)
where:
Mathematical Programming
Mathematical programming is another approach that is used to estimate a linear model. One
formulation is:
(4)
subject to:
(5)
(6)
(7)
where:
G0 and G1 are the sets of observations with an outcome of 0 or 1 respectively.
probability(y = 1) = exp(X * b)/(1.0 + exp(X * b))
p(y = 1) = 1.0/(1.0 + exp(−1.0 * X * b)
y = X * b
y ∈ [−1,1]
min z =
∑
i
ei
∑
j
(xi,j * wj) + ei > = 0 ∀i ∈ G1
∑
j
(xi,j * wj) − ei > = 0 ∀i ∈ G0
∑
j
wj = 1
Page of2 16
3. The math programming approach results in a regression model that minimizes the sum
of the absolute values of errors in underestimating an observation with an outcome of 1
in G1 or overestimating an outcome of 0 in G0.
The objective function can be modified to maximize the separation of the mean linear
scores of the sets G0 & G1 while penalizing the errors.
(8)
where and are the number of observations in the sets G0 and G1.
If desired the predictions from this model can be used as X values in a logistic model to
convert the predicted y values to probabilities.
Model Estimation
Advantages of linear regression approaches
Software exists that makes estimating the above models easy. SAS, R and Python all
have procedures that are well-documented and widely used.
Automatic variable selection is a feature on most procedures.
Methods of avoiding overfitting are readily available:
• Probabilities for entry into the model / keeping a variable in the model can be set.
• The maximum number of variables allowed in the model can limited.
• A set of variables that must be in the models can be specified.
• Regularization and cross-validation is built into some model estimation procedures.
• Mathematical programming restrictions avoid overfitting.
Scorecards
In credit risk applications, models are estimated over binned variables. This results in
an X design matrix of 0/1s.
There are several advantages of doing this:
max z = N0 *
∑
i∈G1
∑
j
(xi * wj) − N1 *
∑
i∈G0
∑
j
(xi * wj) −
∑
i
ei
N0 N1
Page of3 16
4. • Character variables can be grouped together by similar outcome probabilities.
• Missing data for any of the variables can be explicitly modeled in a 0/1 column of the
X matrix.
• Continuous variables are grouped together into bins over intervals. This allows for
the slope to change and for step changes in the predictors.
• The model weights can be transformed to integers then applied to the binned x
variables in a simple to understand way to generate the score.
• For example, if age is in a scorecard:
• If age is below 23 subtract 10 points.
• If age is above 40 add 5.
• The binned categories for a given x variable can be used to generate a
transformed x variable by running a regression model on the bins of the variable.
This can be used to transform a character variable with many categories into 1
predictor variable.
Usage
Scorecards estimated with logistic regression are used or can be used in a variety of
industries to manage risk.
• Financial institutions for application scoring.
• Financial institution for fraud modeling.
• Survival modeling.
• Stock picking.
• Gambling - picking the horses or winners of matches.
• Managing bookings and no-shows in the transportation and hospitality industries.
• Customer churn.
• Political vote prediction.
• etc.
A Role for ML?
So, given the wide use and acceptance models that produce an easily understandable
linear score which may or may not be transformed into a probability or scorecard what
role is there for ML in an environment that values explainability?
At first glance, it would seem that ML might add value in a couple of areas:
Binning: if the predictor variables are binned a ML algorithm could be programmed to
do the binning in a manner that is beneficial to the goal of producing a model with the
ability to generate good predictions. However, in many cases the binning produced by
a modeler or an algorithm is overridden by project sponsors due to the desire to
manage a particular segment or past experience.
Page of4 16
5. Feature selection: modeling software does an acceptable job on this task. Most
apply a greedy-type algorithm based on variables currently in the model to pick the
next variable in or remove a variable from the model. Variables that must be in the
model can also be specified. In the statistical models limited restrictions on weights
are available. Some algorithms use the 1-norm to limit the number of predictors in a
model. Bayesian approaches can also be used to limit the number of predictors or the
magnitude of weights. ML might be used then to select features in a non-sequential
manner learning which features are best in combination.
However, there exists another area where an explainable ML model might add value.
It is illustrated well by looking at a gambling problem.
Going to the Races
Are Horse Race Bettors More Sophisticated Than Bankers?
Bettors at race tracks take information (data) into consideration when placing their
bets. It may include:
• Pole position of the horse
Page of5 16
QUEUEING AT THE BETTING WINDOWS AT THE TRACK
6. • Jockey and horse past performances
• Number of horses in the race
• Type of race
• Length of race
• Odds on the payoff
• Kelly Criterion
Research has shown that in the US the betting pool on the favorites that generates the
odds is a fairly accurate estimation of winning probabilities (see Ziemba) even after the
noise that is added to the above list by bettors who bet on horses’ names or jockey
colors, etc. are factored in.
In the bulleted list above there is one item that isn’t quite like the others.
Sophisticated gamblers use the Kelly Criterion to decide whether and how much to
bet.
Kelly bet formula
(9)
where:
f is the fraction of the available cash pool to bet
p is the probability of winning
b is the amount lost in the odds ratio given a loss
g is the amount of profit in the odds ratio given a win
The Kelly bet formula maximizes the rate of growth of the bettor’s cash pool given the
known probability of winning and the known payoff odds to bet, g/b. If f < 0, no bet is
made, the Kelly bettor would be betting the wrong side of the odds. Any betting
strategy with betting > f will lead to gambler’s ruin.
Sophisticated bettors who use the Kelly Criteria or bet some fraction of f factor in
the payoff to their decision.
This strategy isn’t built into most scorecard building processes used by financial
institutions. Usually models are built that have 2 components, one for the probability of
outcomes and the other for the gain or loss associated with the outcome.
ML Model Objectives
A useful ML model that might be implemented by those with a bias towards
explainability would have the following features:
f = p − (1 − p) * b/g
Page of6 16
7. 1. Easily understood. A scorecard with integer weights is significantly more easily
explained and understood than a neural network or random forest.
2. Do something linear regression models don’t do. For example, combine both
the probability of financial gain with the expected gain or loss.
3. Flexibility. The algorithm should be flexible enough handle different objectives,
constraints, hyper parameters, etc.
Possible Applications
Marketing / Credit Application Scoring / Account Management
Marketing models could be improved by incorporating budget constraints into the
model and the expected return and any risk into the objective function.
Credit card holders fall into categories of usage:
• Revolvers — those who carry a balance from statement to statement
• Transactors — those who pay off the outstanding balance every month
• Dormants — those who have a card but do not use it
• Defaulters — those who use the credit but do not pay the balance
• Fraudsters — those who apply for credit cards but have no intention to pay their
balance.
The usual models that are built are probability of default models. These can be either
for account management or to make the accept / reject decision in application scoring.
It isn’t difficult to see that a model that incorporates the expected profit / loss in
account management or application scoring would be beneficial.
Changing the limit on credit cards and lines of credit are other examples.
Additionally, application scoring might be improved in the case where the application
scorecard is under development (a process that might take a year in a bank) but a
second score is desired to be applied to determine a subset of applications to reject of
a given size while minimizing the financial effect on profitability.
Repeated Actions with Costs, Benefits and Constraints
Operational constraints enter into some decisions. For example fraud referrals and
stock picking.
Possible fraudulent transactions may be referred to the fraud operations department.
Typically, there will be an upper limit on the number of transactions that may be worked
Page of7 16
8. in a day while there is a cost / benefit relationship between rejecting transactions and
preventing fraud that portfolio managers emphasize.
Some problems require making a decision over sequential time periods. For example,
every Friday picking 10 stocks to highlight for an investors report.
One More Possible Advantage of an ML Model
Every observation is important in the statistical and mathematical programming
approaches. Regression models use all the observations available to estimate or validate the
model based on the likelihood function and any penalty functions. Likewise with the
mathematical programming approach where the objective function is evaluated over all the
observations.
It is possible to design a ML model that only concentrates on learning features to include
in a scorecard and the weights that optimizes a more specific objective function.
Picking Stocks — Example
Statement of the Problem
Maximize the expected return over a 13 week (1 quarter) performance period by
picking 10 stocks to invest in over a 10-week period of picks using a scorecard over
available data where the weights in the card are integers.
(10)
subject to:
(11)
where:
(12)
(13)
is the total number of observations at week t.
t is the week the observation is from.
max z =
∑
t
(
∑
i∈t
(
∑
j
(xi,t * wj > cutofft) * profiti,t))
∑
i∈t
(
∑
j
(xi,t * wj > cutofft)) > = 10 ∀ t
cutofft = percentile(scoresi,t,100 − (1/Nt))
scorei =
∑
j
(xi * wj) ∀i
Nt
Page of8 16
9. Exclusions: Stocks with current prices < $8. Financial stocks due to too many
investment funds present in the data set.
ML Tasks
The algorithm will have to accomplish 4 tasks.
1. Transform the feature values into binned data.
2. Select a set of the features from (1) for possible model inclusion.
3. Determine a good performing set of features for the model.
4. Estimate a good performing set of integer weights.
Learning
At each iteration, the algorithm will then generate a guess solution based on solutions it
remembers, submit it to the oracle for evaluation (compute the profit on the picks),
decide whether or not to remember the solution, if the memory is full decide which
solution to purge and replace with the current solution.
This is where the opportunity to get creative and experiment is. So as not to ruin it for
those who would like to design an algorithm the details of the learning given are limited
to the paragraph above. Several different approaches can be implemented including
the popular population evolutionary algorithms or search procedures over the
combinations.
Hyperparameters:
Page of9 16
LEARNING FLOWCHART
10. Binning. The number of bins is set at 4 + 1 if missing values are present. A minimum
of 5% of the observations in the model sample have to fall in a bin for it to be a valid
bin.
Max Iterations: Stop after 5000 iterations are performed.
Other Stopping: Stop when the number of different combinations of features in the
memory falls below a certain level. Stop if 100 iterations are performed with no
improvement in the objective function.
Data
• Stock feature and performance history
• Snapshot
• Weekly at Friday close
• Snapshots taken at 13 individual weeks
• Exclusions
• Stocks from the financial sector -- Banks, Investment funds
• Stocks with a price < $8 at the snapshot point
• The price history is affected significantly by reverse splits
• Features
• 37 candidate features
• Technical / Fundamental / Valuation / General
• Performance History
• Performance Measure
• Return or percent change in Price 13 weeks after the snapshot
• Model Test Segmentation
• 10% of the stocks are held out as a validation sample chosen by
Ticker symbol
• Model sample: 22,852 — Validation sample: 2,903
Page of10 16
11. Results
Population mean return: 8.9%
Model picks return: 32.6%
Validation picks return: 17.9%
These plots show returns over weeks
sampled and the overall return
distribution.
Overall distribution of %
Return
Sample Returns
Picks in blue show significantly less
risk than the stocks that aren’t picked
(in red) on the sample of stock
performances the model was built on.
The validation which is significantly
smaller still shows increased returns
while at the same time reduced risk of
loss of investment funds.
Both the model and validation plots are
based on the pooled pick of 10 stocks
each week. The learning algorithm only
sees the results of the model sample
observations. Because of this it isn’t
unreasonable to expect the validation
sample to perform at a lower level. It
has to go deeper into the score
distribution.
Picking the top 10 scoring stocks each
week in the sample results in a
densities that indicate a higher
performance than the validation
sample performance might be
possible.
Page of11 16
12. The score distribution is peaked at
slightly less than 0. It is important
to remember that the only scores
taken into consideration in the ML
optimization are the top 10 each of
the 10 weeks in the population
sample.
In spite of that, the score
distribution is spread from -200 to
300.
The second plot shows the value of
the objective function for the model
sample and the validation sample.
The validation sample
is shown for informative
purposes. It does not
enter into the decision
to stop the model
iterations.
Page of12 16
13. Expectation Plots
The observations in the model sample are binned with cuts at each 5%. For each bin
the average score, probability of profit, average gain and average loss are plotted. The
average gain and loss are scaled by dividing by the maximum average gain to allow
plotting all 3 values on the same scale.
The model sample illustrates that the algorithm is performing as designed. The highest
score bin has the best performance as indicated by having both the highest probability
of a gain and the largest spread between average gain and average loss.
Key: Probabilty Loss Gain
The plot illustrates a problem that
exists in building a model to
maximize stock investment returns.
Often the highest returns are look a
lot like the most risky investments.
The validation sample has more
volatile expectations, but the same
observations can be made. The
best performance is at the high end
although there is a slight reversal in
the upper percentiles.
Page of13 16
14. Expected Margin by Score
The expected margin within each of the bins is computed from the values above.
(14)
The model sample margin shows an
almost linear pattern in the margin by
score bin.
The validation sample has a lot more
noise, but in general the model
performs well on the validation
sample.
margin = probgain * averagegain − probloss * averageloss
Page of14 16
15. Scorecard — Samples from Model
Price to Earnings
Performance over Last Quarter - % Change
Analyst Recommendation
Bin From To Points
1 -inf 10.55 -1
2 10.55 13.29 -7
3 13.29 53.97 10
4 53.97 inf 17
5 Missing Missing -17
Bin From To Points
1 -inf -43.11 -19
2 -43.11 -13.69 -1
3 -13.69 19.92 -1
4 19.92 inf 33
Bin Recommendation Points
1 1 -12
2 2 10
3 3 -2
4 4+ -10
5 Missing -11
Page of15 16
16. Conclusions
It is possible to build an easily explainable Machine Learning model in the form of a
scorecard that will improve decisions.
The objective function can be specified to target the area of operation and constraints
instead of using a loss function that is applied to all the data.
Both probability and profit can be built into the objective function.
An algorithm to accomplish this task can be set up to output a model in the form of an
easily understandable scorecard allowing end users to check if the model agrees with
their experience.
As the model learns, the solution set that is remembered can be used to provide an
ensemble of predictions if so desired.
End user input can be easily incorporated via binning boundaries or restrictions.
The algorithm does not follow a greedy approach of selecting one predictor
(regression) or making one partition (decision trees) then making successive decisions
on feature selection based on those already made but explores combinations of the
candidate feature set remembering those with the best performance.
As a bonus, solving targeted problems may result in a solution that rank orders well
enough over the entire population.
Questions
If you have any questions, suggestions for applications, or a problem and its data that
you would like to see if an algorithm like this would provide a solution for please feel
free to email me at
bill.fite@miner3.com
Page of16 16