This document discusses the problem of over-fitting in statistical models and how to avoid it. Specifically, it addresses:
1) How stepwise regression can lead to over-fitting by continuing to add predictors even when they do not truly improve prediction accuracy.
2) An example where stepwise regression was used to build a model for predicting stock returns using random noise as predictors, resulting in a model that fit the sample data well but failed to generalize.
3) Methods for detecting and avoiding over-fitting, including using the Bonferroni correction and holding back validation data to test model performance on new cases.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
Â
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionPranov Mishra
Â
This document summarizes a machine learning project to analyze sales performance and forecast sales for one of Unilever's declining brands. A team of data scientists used over 30 years of sales and market data to identify key drivers of sales and build forecasting models. Their best model achieved a 25% MAPE forecasting score by combining training and test data to account for the product's decline phase. Insights from the analysis will help Unilever's sales team improve business execution strategies.
This paper proposes using a "shrinkage" estimator as an alternative to the traditional sample covariance matrix for portfolio optimization. The shrinkage estimator combines the sample covariance matrix with a structured "shrinkage target" using a shrinkage constant to minimize distance from the true covariance matrix. The paper finds this shrinkage estimator significantly increases the realized information ratio of active portfolio managers compared to the sample covariance matrix. An empirical study on historical stock return data confirms the shrinkage method leads to higher ex post information ratios in portfolio optimization. However, the shrinkage target assumes identical pairwise correlations that may not fully reflect market characteristics.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
Â
- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
This document provides instruction on using the 1 variance test for hypothesis testing. It begins with an overview of why hypothesis testing is needed to build a transfer function model. It then reviews the 4-step process for hypothesis testing and provides a decision tree to help select the appropriate statistical test based on data type and characteristics. The document demonstrates how to perform a 1 variance test using Minitab through examples comparing standard deviation to a target value. It concludes by prompting the reader to apply the 1 variance test to factors identified in a previous lesson and consider how the results could influence organizational decisions and goals.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
Â
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionPranov Mishra
Â
This document summarizes a machine learning project to analyze sales performance and forecast sales for one of Unilever's declining brands. A team of data scientists used over 30 years of sales and market data to identify key drivers of sales and build forecasting models. Their best model achieved a 25% MAPE forecasting score by combining training and test data to account for the product's decline phase. Insights from the analysis will help Unilever's sales team improve business execution strategies.
This paper proposes using a "shrinkage" estimator as an alternative to the traditional sample covariance matrix for portfolio optimization. The shrinkage estimator combines the sample covariance matrix with a structured "shrinkage target" using a shrinkage constant to minimize distance from the true covariance matrix. The paper finds this shrinkage estimator significantly increases the realized information ratio of active portfolio managers compared to the sample covariance matrix. An empirical study on historical stock return data confirms the shrinkage method leads to higher ex post information ratios in portfolio optimization. However, the shrinkage target assumes identical pairwise correlations that may not fully reflect market characteristics.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
Â
- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
This document provides instruction on using the 1 variance test for hypothesis testing. It begins with an overview of why hypothesis testing is needed to build a transfer function model. It then reviews the 4-step process for hypothesis testing and provides a decision tree to help select the appropriate statistical test based on data type and characteristics. The document demonstrates how to perform a 1 variance test using Minitab through examples comparing standard deviation to a target value. It concludes by prompting the reader to apply the 1 variance test to factors identified in a previous lesson and consider how the results could influence organizational decisions and goals.
Detailed insight into Analytical Steps required for generating reliable insights from analysis - Univariate, Bivariate, Multivariate, OLS & Logistic Models, etc
Was put together to train friends and mentees. Based on personal learnings/research and no proprietary info, etc. and no claims on 100% accuracy. Also every institution/organization/team uses it own steps/methodologies, so please use the one relevant for you and this only for training purposes.
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib KeeminkPyData
Â
The document discusses various challenges and methods for causal inference from observational data. It begins with two use cases - estimating the savings from installing heat pumps and the profit increase from placing beer coolers in stores. Both experiments fail standard assumptions as the test and control groups are statistically different. The document then covers methods for estimating average treatment effects such as propensity score matching and regression adjustment. It also discusses estimating individual treatment effects using techniques like honest forests and counterfactual regression that learn balanced representations of the data. The goal is to remove bias from differences between treated and untreated groups to infer valid causal effects.
Hypothesis Testing: Central Tendency â Normal (Compare 1:1)Matt Hansen
Â
An extension on a series about hypothesis testing, this lesson reviews the 2 Sample T & Paired T tests as central tendency measurements for normal distributions.
By using statistical process control (SPC), managers can determine if variations in their data are due to normal fluctuations or issues with the underlying process. SPC involves calculating control limits based on the average and standard deviation of historical data to identify when new data points are significantly different. The most common type of control chart used is the X-bar and moving range chart, which plots average values over time and the differences between successive values to monitor for instability.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 2+ Factors)Matt Hansen
Â
An extension on hypothesis testing, this lesson reviews the Moodâs Median & Kruskal-Wallis tests as central tendency measurements for non-normal distributions.
This document discusses an integrated model for sensitivity analysis and scenario analysis using breakeven analysis for operational and investment risk analysis. It was developed by Prof. Sreedhara Ramesh Chandra and Dr. Krishna Banana. The model aims to address limitations in existing sensitivity, scenario, and breakeven analysis models by integrating the three approaches. It introduces proportions and percentages to more precisely determine variable values. It also establishes relationships between scenario values and measures sensitivity through changes from a predetermined relational constant value (sales revenue). The model allows consideration of all cash flow determinants and provides a direct link between operational and investment risk measurements to improve investment decisions.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 1:1)Matt Hansen
Â
This document provides instruction on using the Mann-Whitney test to compare the medians of two independent samples. It discusses when to use the Mann-Whitney test, how to run it in Minitab, and provides an example comparing the medians of two columns of sample data labeled MetricC1 and MetricC2. The results of running the Mann-Whitney test on this example are interpreted to determine if the medians are statistically different between the two samples. The document encourages applying the test to factors identified in a previous lesson and discussing how the results could impact an organization.
Hypothesis Testing: Central Tendency â Normal (Compare 2+ Factors)Matt Hansen
Â
An extension on a series about hypothesis testing, this lesson reviews the ANOVA test as a central tendency measurement for normal distributions. It also explains what residuals and boxplots are and how to use them with the ANOVA test.
The document describes forecasting expenditures for Australia's Pharmaceutical Benefits Scheme (PBS). PBS forecasts billions in drug subsidies annually using simple Excel methods. The author developed an automated exponential smoothing algorithm in Excel, reducing forecast errors from 15-20% to 0.6%. Monthly data on thousands of drug groups was used to automatically select exponential smoothing models based on AIC.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 1:Standard)Matt Hansen
Â
An extension on hypothesis testing, this lesson reviews the 1 Sample Sign & Wilcoxon tests as central tendency measurements for non-normal distributions.
This document discusses random function models used in geostatistical estimation. It begins by explaining that estimation requires an underlying model to make inferences about unknown values that were not sampled. Geostatistical methods clearly state the probabilistic random function model on which they are based. The document then provides examples to illustrate deterministic and probabilistic models. Deterministic models can be used if the generating process is well understood, but most earth science data require probabilistic random function models due to uncertainty between sample locations. These models conceptualize the data as arising from a random process, even though the true processes are not truly random. The key aspects of the model that need to be specified are the possible outcomes and their probabilities.
Go Predictive Analytics, LLC is a premier data mining and predictive analytics consulting company. We remove the barriers that loom large with creating and deploying data mining solutions for high ROI.
This document discusses various forecasting methods and principles. It covers:
- Qualitative methods like expert surveys, intentions surveys, and simulated interaction.
- Quantitative methods like extrapolation, rule-based forecasting, and simple regression which use numerical data.
- Checklists can improve forecasting by ensuring the latest evidence is included. The document provides a checklist for developing knowledge models.
- Forecasting principles like being conservative and choosing simple explanations are discussed.
- Estimating forecast uncertainty is important. Methods discussed include using empirical prediction intervals and decomposing errors by source.
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
Â
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
This document proposes a new approach for software project estimation that combines existing estimation techniques. It involves using case-based reasoning to retrieve similar past projects, reusing their estimates, and revising the estimates based on new parameters and delay-causing incidents. The approach allows parameters to be added dynamically during project execution to make estimates more context-sensitive and help converge to actual values. A prototype tool has been implemented to demonstrate calculating estimates by dynamically selecting parameters and computing similarity indexes between current and past projects.
Detailed insight into Analytical Steps required for generating reliable insights from analysis - Univariate, Bivariate, Multivariate, OLS & Logistic Models, etc
Was put together to train friends and mentees. Based on personal learnings/research and no proprietary info, etc. and no claims on 100% accuracy. Also every institution/organization/team uses it own steps/methodologies, so please use the one relevant for you and this only for training purposes.
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib KeeminkPyData
Â
The document discusses various challenges and methods for causal inference from observational data. It begins with two use cases - estimating the savings from installing heat pumps and the profit increase from placing beer coolers in stores. Both experiments fail standard assumptions as the test and control groups are statistically different. The document then covers methods for estimating average treatment effects such as propensity score matching and regression adjustment. It also discusses estimating individual treatment effects using techniques like honest forests and counterfactual regression that learn balanced representations of the data. The goal is to remove bias from differences between treated and untreated groups to infer valid causal effects.
Hypothesis Testing: Central Tendency â Normal (Compare 1:1)Matt Hansen
Â
An extension on a series about hypothesis testing, this lesson reviews the 2 Sample T & Paired T tests as central tendency measurements for normal distributions.
By using statistical process control (SPC), managers can determine if variations in their data are due to normal fluctuations or issues with the underlying process. SPC involves calculating control limits based on the average and standard deviation of historical data to identify when new data points are significantly different. The most common type of control chart used is the X-bar and moving range chart, which plots average values over time and the differences between successive values to monitor for instability.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 2+ Factors)Matt Hansen
Â
An extension on hypothesis testing, this lesson reviews the Moodâs Median & Kruskal-Wallis tests as central tendency measurements for non-normal distributions.
This document discusses an integrated model for sensitivity analysis and scenario analysis using breakeven analysis for operational and investment risk analysis. It was developed by Prof. Sreedhara Ramesh Chandra and Dr. Krishna Banana. The model aims to address limitations in existing sensitivity, scenario, and breakeven analysis models by integrating the three approaches. It introduces proportions and percentages to more precisely determine variable values. It also establishes relationships between scenario values and measures sensitivity through changes from a predetermined relational constant value (sales revenue). The model allows consideration of all cash flow determinants and provides a direct link between operational and investment risk measurements to improve investment decisions.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 1:1)Matt Hansen
Â
This document provides instruction on using the Mann-Whitney test to compare the medians of two independent samples. It discusses when to use the Mann-Whitney test, how to run it in Minitab, and provides an example comparing the medians of two columns of sample data labeled MetricC1 and MetricC2. The results of running the Mann-Whitney test on this example are interpreted to determine if the medians are statistically different between the two samples. The document encourages applying the test to factors identified in a previous lesson and discussing how the results could impact an organization.
Hypothesis Testing: Central Tendency â Normal (Compare 2+ Factors)Matt Hansen
Â
An extension on a series about hypothesis testing, this lesson reviews the ANOVA test as a central tendency measurement for normal distributions. It also explains what residuals and boxplots are and how to use them with the ANOVA test.
The document describes forecasting expenditures for Australia's Pharmaceutical Benefits Scheme (PBS). PBS forecasts billions in drug subsidies annually using simple Excel methods. The author developed an automated exponential smoothing algorithm in Excel, reducing forecast errors from 15-20% to 0.6%. Monthly data on thousands of drug groups was used to automatically select exponential smoothing models based on AIC.
Hypothesis Testing: Central Tendency â Non-Normal (Compare 1:Standard)Matt Hansen
Â
An extension on hypothesis testing, this lesson reviews the 1 Sample Sign & Wilcoxon tests as central tendency measurements for non-normal distributions.
This document discusses random function models used in geostatistical estimation. It begins by explaining that estimation requires an underlying model to make inferences about unknown values that were not sampled. Geostatistical methods clearly state the probabilistic random function model on which they are based. The document then provides examples to illustrate deterministic and probabilistic models. Deterministic models can be used if the generating process is well understood, but most earth science data require probabilistic random function models due to uncertainty between sample locations. These models conceptualize the data as arising from a random process, even though the true processes are not truly random. The key aspects of the model that need to be specified are the possible outcomes and their probabilities.
Go Predictive Analytics, LLC is a premier data mining and predictive analytics consulting company. We remove the barriers that loom large with creating and deploying data mining solutions for high ROI.
This document discusses various forecasting methods and principles. It covers:
- Qualitative methods like expert surveys, intentions surveys, and simulated interaction.
- Quantitative methods like extrapolation, rule-based forecasting, and simple regression which use numerical data.
- Checklists can improve forecasting by ensuring the latest evidence is included. The document provides a checklist for developing knowledge models.
- Forecasting principles like being conservative and choosing simple explanations are discussed.
- Estimating forecast uncertainty is important. Methods discussed include using empirical prediction intervals and decomposing errors by source.
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
Â
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
This document proposes a new approach for software project estimation that combines existing estimation techniques. It involves using case-based reasoning to retrieve similar past projects, reusing their estimates, and revising the estimates based on new parameters and delay-causing incidents. The approach allows parameters to be added dynamically during project execution to make estimates more context-sensitive and help converge to actual values. A prototype tool has been implemented to demonstrate calculating estimates by dynamically selecting parameters and computing similarity indexes between current and past projects.
- The document describes a project to predict customer churn for a telecom company using classification algorithms. It analyzes a dataset of 3333 customers to identify variables that contribute to churn and builds models using KNN and C4.5.
- The C4.5 model achieved higher accuracy (94.9%) than KNN (87.1%) on the test data. Key variables for predicting churn were found to be day minutes, customer service calls, and international plan.
- The model can help the telecom company prevent churn by focusing retention efforts on at-risk customers identified through these important variables.
Performed predictive Data analytics for âBlack Friday Sales Datasetâ wherein the company wants to predict the purchase amount against the products using Rapid Miner Tool.
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
BIG MART SALES PREDICTION USING MACHINE LEARNINGIRJET Journal
Â
This document describes a study that uses machine learning to predict sales at Big Mart stores. The researchers collected data on 8542 products from Kaggle and used the XGBoost regressor model to predict sales. They preprocessed the data by handling missing values, removing unnecessary attributes, data visualization, cleaning, label encoding, and splitting into training and testing sets. The XGBoost model was trained on the preprocessed data and evaluated using metrics like RMSE and R-squared. The model achieved accurate sales predictions that can help Big Mart better plan strategies to increase profits and outcompete rivals.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Disrupting Risk Management through Emerging TechnologiesDatabricks
Â
The document discusses how emerging technologies can disrupt credit risk management by 2025, noting banks will need fundamentally different risk functions to handle new demands. It describes what credit risk management is and some ways emerging technologies like machine learning, analytics tools, and interactive insights bots could be leveraged to perform deep 6W analysis, zero-touch forecasting, monitoring, and "what-if" scenario modeling at scale to help risk managers address what is at stake. Sample interactions with an interactive insights bot are provided to demonstrate how it could provide executives quick insights and predictions by feature in response to natural language requests.
This document discusses how companies can improve their sales and operations planning (S&OP) processes through predictive analytics, scenario planning, and risk management. It recommends that companies use digital modeling, simulation, and probabilistic predictive analytics to evaluate different scenarios and supply chain designs without experimenting on live operations. Incorporating risk management into S&OP allows companies to develop response plans for uncertain events and improve long-term sustainability and competitive advantage.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
Â
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
An Introduction to Simulation in the Social Sciencesfsmart01
Â
This document provides an introduction to simulation design in the social sciences. It discusses why simulations are used, including to confirm theoretical results, explore unknown theoretical environments, and generate statistical estimates. It outlines the key stages of simulations, including specifying the model, assigning parameters, generating data, calculating and storing results, and repeating the process. Finally, it provides examples of simulations and discusses necessary programming tools and considerations for simulation design.
1. The document outlines a six-step process for developing scoring models: research design, data checking and variable creation, creating analysis files, calibrating the scoring model, model evaluation, and model implementation.
2. Several modeling techniques are discussed including linear regression, logistic regression, and neural networks. Key factors in choosing a technique include the target variable type and the software environment.
3. Model evaluation is done using lift tables and gains tables to assess how well the model ranks and selects customers. Graphs of these tables help understand model performance in selecting respondents and generating revenue or profit.
This document discusses predictive analytics and provides an overview of Oracle's predictive analytics tools.
It argues that predictive analytics is commonly misunderstood as only predicting the future, but can also be used to predict the present based on existing data patterns. It proposes a new conceptual classification of predictive analytics into "predicting the present" and "shaping the future". The document then provides examples of how Oracle Data Mining can be used to predict things in the present like customer preferences, fraud detection, and credit scoring. It also discusses how Oracle Real-Time Decisions integrates predictive analytics into real-time processes.
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
Â
This document compares three anomaly detection methods: moving averages, mean absolute deviation (MAD), and DBSCAN. It provides examples of how each method can be used, discusses their advantages and disadvantages, and compares them on execution time, efficiency, and number of false alerts. Moving averages have high efficiency and fewer false alerts but do not work for volatile data. MAD has medium efficiency but can produce more false alerts and only works for normal distributions. DBSCAN has medium execution time and efficiency with an average number of false alerts, but does not work well for varying cluster densities or high dimensions.
Post Graduate Admission Prediction SystemIRJET Journal
Â
This document presents a post graduate admission prediction system built using machine learning algorithms. The system analyzes factors like GRE scores, TOEFL scores, undergraduate GPA, research experience etc. to predict the universities a student is likely to get admission in. Various machine learning models like multiple linear regression, random forest regression, support vector machine and logistic regression are implemented and evaluated on an admission prediction dataset. Logistic regression achieved the highest accuracy of 97%. A web application called PostPred is developed using the logistic regression model to help students predict suitable universities to apply to based on their profile.
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...Thomas Lee
Â
Expert Judgment is the foundation of many risk assessment methodologies. But research is robust on the inaccuracy of Expert Judgment with regards to rare eventsâand large data breach events are rare. Regression models, which are a statistical characterization of cross-company historical events are substantially more accurate than expert judgment or even models with expert judgment as a foundation.
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLINGIRJET Journal
Â
This document discusses methods for detecting credit card fraud using predictive modeling. It analyzes credit card transaction data from September 2013 involving European cardholders using logistic regression, predictive modeling, and decision tree techniques. Logistic regression is used to predict binary outcomes and provide fraud detection. Predictive modeling is employed to create reusable models trained on historical data. Decision trees are also used to analyze the data and make classifications. The proposed approach provides around 90% accuracy in detecting fraudulent transactions according to a confusion matrix analysis.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Â
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Â
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
Â
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Â
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind fĂźr viele in der HCL-Community seit letztem Jahr ein heiĂes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und LizenzgebĂźhren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer mĂśglich. Das verstehen wir und wir mĂśchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lÜsen kÜnnen, die dazu fßhren kÜnnen, dass mehr Benutzer gezählt werden als nÜtig, und wie Sie ßberflßssige oder ungenutzte Konten identifizieren und entfernen kÜnnen, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnÜtigen Ausgaben fßhren kÜnnen, z. B. wenn ein Personendokument anstelle eines Mail-Ins fßr geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren LÜsungen. Und natßrlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Ăberblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und ĂźberflĂźssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps fßr häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Â
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Project Management Semester Long Project - Acuityjpupo2018
Â
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Â
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Ivantiâs Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There weâll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
Â
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Â
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Â
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Â
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
Â
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Â
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Â
An English đŹđ§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech đ¨đż version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
2. Overview
Stepwise models
Select most predictive features from a list that you provide
of candidate features, incrementally improving the fit of the
model by as much as possible at each step.
When automated, the search continues so long as the
feature improves the model enough as gauged by its p-
value.
Over-fitting1
If the search is allowed to choose predictors too âeasilyâ,
stepwise selection will identify predictors that ought not be
in the model, producing an artificially good fit when in fact
the model has been getting worse and worse.
Bonferroni rule
The Bonferroni rule lets us halt the search without having
to set aside a validation sample, allowing us use all the data
for finding a predictive model rather than a subset.
Though automatic, you should still use your knowledge of
the context to offer more informed choices of features to
consider for the modeling.
1
For another example of over-fitting when modeling stock returns, see BAUR pages 220-227.
Statistics 622 8-2 Fall 2009
3. From previous classesâŚ
Cost of uncertainty
An accurate estimate of mean demand improves profits.
Suggests that we should use more predictors in models,
including more combinations of features that capture
synergies among the features (interactions).
Stepwise regression
Automates the tedious process of working through the
various interactions and other candidate features.
Problem: Over-confidence
The combination of
Desire for more accurate predictions
+
Automated searches that maximize fitted R2
Creates the possibility that our predictions are not so
accurate as we think.
Over-fitting results when the modeling process leads us to
build a model that captures random patterns in the data that
will not be present in predicting new cases. The fit of the
model looks better on paper than in reality
Other situations with over-confidence
Subjective confidence intervals
Winners curse in auctions
Two methods for recognizing and avoiding over-fitting
Bonferroni p-values, which do not require the use of a
validation sample in order to test the model
Cross-validation, which requires setting aside data to test
the fit of a model.
Statistics 622 8-3 Fall 2009
4. Over-fitting
False optimism
Is your model as good as it claims? Or, has your hard work
to improve its fit to the data exaggerated its accuracy?
âOptimization
capitalizes on When we use the same data to both fit and evaluate a
chanceâ model, we get an âoptimisticâ impression of how well the
model predicts. This process that leads to an exaggerated
sense of accuracy is known as over-fitting.
When a model has been over-fit, predictors that appear
significant from the output do not in fact improve the
modelâs ability to predict the new cases.
Perhaps many of the predictors that are in a model have
arrived by chance alone because we have considered so
many possible models.
Over-fitting
Adding features to a model that improve its fit to the
observed data, but that degrade the ability of a model to
predict new cases.
Iterative refinement of a model (either manually or by an
automated algorithm) in order to improve the usual
summaries (e.g., R2 and p-values) typically generates a
better fit to the observed data that pick the predictors than
will be had when predicting new data.
No good deed goes unpunished!
Itâs the process, not the model
Over-fitting does not happen if we pick a large group of
predictors and simply fit one big model, without iteratively
trying to improve its fit.
Statistics 622 8-4 Fall 2009
5. An Example of Over-fitting (nyse_2003.jmp)
Stock market analysis
Over-fitting is common in domains in which there is a lot
of pressure to obtain accurate predictions, as in the case of
predicting the direction of the stock market.
Data: daily returns on the NYSE composite index in
October and November 2003.
Objective: Build a model to predict what will happen in
December 2003, using a battery of 12 trading rules (labeled
X1 to X12).
These are a few very basic technical trading rules.
Model selection criteria
Many numerical criteria have been proposed that can be
used as an alternative to maximizing R2 to judge the quality
of a good model.
This table lists several well-known criteria. To use these in
forward stepwise, control the forward search by using these
âProb-to-enterâ values.
Name Prob-to-Enter Approximate t- Idea
stat for inclusion
Adjusted .33 |t| > 1 Decrease RMSE
R2
AIC, Cp .16 |t| > â2 Unbiased estimate of
prediction accuracy
BIC Depends on n |t| > ½ log n Bayesian probability
Bonferroni 1/m |t| > â(2 log m) Minimize worst case,
family wide error rate
Statistics 622 8-5 Fall 2009
6. Search domain for the example
Consider interactions among 12 exogenous features. The
total number of features available to stepwise is then
m = 12 + 12 + 12 Ă 11/2 = 24 + 66 = 90
Wide data set
There are 42 trading days in October and November. With
interactions, we have more features than cases to use.
m = 90 > n = 42
Hence we cannot fit the saturated model with all features.2
AIC criterion for forward search
Set âProb to Enterâ = 0.16 and run the search forward.
The stepwise search never stops!
A greedy search becomes gluttonous when offered so many
choices relative to the number of cases that are available.
2
You can show that often the best model is the so-called âsaturatedâ model that has every feature
included as a predictor in the fit. But, you can only do this when you have more cases than
features, typically at least 3 per predictor (a crude rule of thumb for the ratio n/m).
Statistics 622 8-6 Fall 2009
7. To avoid the cascade, make it harder to add a predictor;
reducing the âProb to enterâ to 0.10 gives this result:
The search stops after adding 20 predictors.
Optionally, following a common convention, we can âclean upâ
the fit and make it appear more impressive by stepping
backward to remove collinear predictors that are redundant.
The backward elimination removes 3 predictors.
Statistics 622 8-7 Fall 2009
8. Make the model and obtain the usual summary.
This âSummary of Fitâ suggests a great model.
Any diagnostic procedure that ignores how we chose the
features to include in this model finds no problem. All
conclude that this is a great-fitting model, one that is highly
statistically significant.
Look at all of the predictors whose p-value < 0.0001.
These easily meet the Bonferroni threshold, when applied
after the fact.
Summary of Fit
RSquare 0.949
Root Mean Square Error 0.191
Mean of Response 0.177
Observations (or Sum Wgts) 42
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 17 16.214437 0.953790 26.1361
Error 24 0.875838 0.036493 Prob > F
C. Total 41 17.090274 <.0001
Parameter Estimates
Term Est Std Err t Ratio Prob>|t|
Intercept -0.090 0.058 -1.56 0.1317
Exogenous 6 0.093 0.036 2.60 0.0156
Exogenous 9 0.256 0.046 5.59 <.0001
Exogenous 10 0.326 0.058 5.62 <.0001
(Exogenous 2-0.19088)*(Exogenous 3+0.07326) 0.192 0.035 5.52 <.0001
(Exogenous 2-0.19088)*(Exogenous 5-0.11786) 0.181 0.043 4.19 0.0003
(Exogenous 3+0.07326)*(Exogenous 5-0.11786) -0.209 0.038 -5.45 <.0001
(Exogenous 5-0.11786)*(Exogenous 6-0.07955) 0.178 0.030 5.88 <.0001
(Exogenous 8+0.13772)*(Exogenous 8+0.13772) 0.087 0.031 2.78 0.0105
(Exogenous 1+0.21142)*(Exogenous 9-0.32728) -0.412 0.048 -8.66 <.0001
(Exogenous 2-0.19088)*(Exogenous 9-0.32728) 0.198 0.044 4.51 0.0001
(Exogenous 5-0.11786)*(Exogenous 9-0.32728) 0.384 0.062 6.18 <.0001
(Exogenous 6-0.07955)*(Exogenous 10+0.03726) 0.183 0.036 5.05 <.0001
(Exogenous 7-0.23689)*(Exogenous 10+0.03726) 0.252 0.057 4.45 0.0002
(Exogenous 10+0.03726)*(Exogenous 10+0.03726) 0.202 0.027 7.38 <.0001
(Exogenous 2-0.19088)*(Exogenous 11+0.04288) -0.115 0.047 -2.46 0.0215
(Exogenous 6-0.07955)*(Exogenous 11+0.04288) 0.132 0.057 2.30 0.0304
(Exogenous 10+0.03726)*(Exogenous 12+0.18472) 0.263 0.046 5.69 <.0001
Statistics 622 8-8 Fall 2009
9. Visualization
The surface contour shows that thereâs a lot of curvature in the
fit of the model, but unlike the curvature seen in several prior
examples, the data do not seem to show visual evidence of the
curvature.
No pair of predictors appears particularly predictive,
although the overall model is.
This plot shows the curvature of the prediction formula
using predictors 8 and 10 along the bottom.3
3
Save the prediction formula from your regression model. Then select Graphics > Surface Plot
and fill the dialog for the variables with the prediction formula as well as the column that holds
the response data. To produce such a plot, you need a recent version of JMP.
Statistics 622 8-9 Fall 2009
10. Common Sense Test: Hold-back some data
Question
Is this fit an example of the ability of multiple regression to
find âhidden effectsâ that simpler models miss?
Thereâs no real substance to rely upon to find an
explanation for the model. We have too many explanatory
variables than we can sensibly interpret.
Simple idea (cross-validation)
Reserve some data in order to test the model, such as the
next month of returns.
Fit model to a training/estimation sample, then predict
cases in test/validation sample.
Catch-22
How much to reserve, or set aside, for checking the model?
No clear-cut answer.
Save a little. This choice leaves too much variation in your
measure of how well the model has done. A model
might look good simply by chance. If we were to only
reserve, say, 5 cases to test the model, then it might
âget luckyâ and predict these 5 well, simply by
chance.
Save a lot. This choice leaves too few cases available to
find good predictors. We end up with a good estimate
of the performance of a poor model. When trying to
improve a model or find complex effects, weâll do
better with more data to identify the effects.
Statistics 622 8-10 Fall 2009
11. Example of Predicting Stocks
What happens in December?
The model that looks so good on paper flops miserably
when put to this simple test. The fitted equation predicts the
estimation cases remarkably well, but produces large
prediction errors when extended out-of-sample to the next
month.
Plot of the prediction errors.
Left: in-sample errors, residuals from the fitted model.
Right: out-of-sample errors in the forecast period.4
The residuals are small during the estimation period
(October â November), in contrast to the size of the errors
when the model is used to predict the returns on the NYSE
during December.
5
4
3
Prediction Error
2
1
0
-1
-2
-3 October November December
-4
20031001
20031101
20031201
20040101
This model has been over-fit,Cal_Date
producing poor forecasts for
December. The usual summary statistics conceal the selection
process that was used to identify the model.
4
The horizontal gaps between the dots are the weekends or holidays.
Statistics 622 8-11 Fall 2009
12. What are those other predictors?
Random noise!
The 12 basic features X1, X2, ⌠X12 that were called
âtechnical trading rulesâ are in fact columns of simulated
samples from normal distributions.5
Any model that uses these as predictors over-fits the data.
But the final model looks so good!
True, but the out-of-sample predictions show how poor it
is. A better prediction would be to use the average of the
historical data instead.
In this example, we know (because the âexogenous rulesâ
are simulated random noise) that the true coefficients for
these variables are all zero.
Why doesnât the final overall F-ratio find the problem?
The standard test statistics work âonceâ, as if you
postulated one model before you saw the data.
Stepwise tries hundreds of variables before choosing these.
Finding a p-value less than 0.05 is not unusual if you look
at, say, 100 possible features. Among these, youâd expect
to find 5 whose p-value < 0.05 by chance alone.
Cannot let stepwise procedure add such variables
In this example, the first step picks the worst variable: one
that adds actually adds nothing but claims to do a lot.
The effect of adding this spurious predictor is to bias the
estimate of error variation. That is, the RMSE is now
smaller than it should be.
The bias inflates the t-statistics for every other feature.
5
Thereby giving away my opinion of many technical trading rules.
Statistics 622 8-12 Fall 2009
13. Source of the cascade
Suppose stepwise selection incorrectly picks a predictor
that it should not have, one for which β = 0.
The reason that it picks the wrong predictor is that, by
chance, this predictor explains a lot of variation (has a large
correlation with the response, here stock returns). The
predictor is useless out-of-sample but looks good within the
estimation sample.
As a result, the model looks better while at the same time
actually performing worse. The result is a biased estimate
of the amount of unexplained variation. RMSE gets
smaller when in fact the model fits worse; it should be
larger â not smaller â after adding this feature.
The biased RMSE, being too small, makes all of the other
features look better; t-statistics of features that are not in
the model suddenly get larger than they should be.
These inflated t-stats make it easier to add other useless
features to the model, forming a cascade as more spurious
predictors join the model. The EverReady bunny.
Protection from Over-fitting
Many have been âburnedâ by using a method like stepwise
regression and over-fitting. A frequently-heard complaint:
âThe model looked fine when we built it, but when we
rolled it out in the field it failed completely. Statistics is
useless. Lies, damn lies, statistics.â
Protections from over-fitting include the following:
(a) Avoid automatic methods
Sure, and why not use an abacus, slide rule, and normal
table while youâre at it? Itâs not the computer per se, but
Statistics 622 8-13 Fall 2009
14. rather the shoddy way that we have used the automatic
search. The same concerns apply to tedious manual
searches as well.
(b) Arrogant: Stick to substantively-motivated predictors
Are you so confident that you know all there is to know
about which factors affect the response?
Particularly troubling when it comes to interactions.
Even so, you can use stepwise selection after picking a
model as a diagnostic. That is, use stepwise to learn
whether a substantively motivated model has missed
structure.
Start with a non-trivial substantively motivated model. It
should include the predictors that your knowledge of the
domain tells you belong. Then run stepwise to see whether
it finds other things that might be relevant.
(c) Cautious: Use a more stringent threshold
Add a feature only when the results are convincing that the
feature has a real effect, not a coincidence.
We can do this by using the Bonferroni rule. If you have a
list of m candidate features, then set âProb to enterâ =
0.05/m.
Statistics 622 8-14 Fall 2009
15. Bonferroni = Right Answer + Added Bonus
What happens in the stock example?
Set the Prob-to-enter threshold to 0.05 divided by m,
number of features that are being considered.
In this example, the number of considered features is
12 ârawâ + 12 âsquaresâ + 12Ă11/2 âinteractionsâ= 90
âProb to enterâ = 0.05/90 = .00056
Remove all of the predictors from the stepwise dialog,
change the âProb to enterâ field to 0.00056, and click go.6
The search finds the right answer: it adds nothing! No
predictor enters the model, and weâre left with a regression
with just an intercept.
None should be in the model; the ânull modelâ is the truth.7
The âtechnical trading rulesâ used as predictors are random
noise, totally unrelated to the response.
Added bonus
The use of the Bonferroni rule for guiding the selection
process avoids the need to reserve a validation sample in
order to test your model and avoid over-fitting.
Just set the appropriate âProb to enterâ and use all of the
data to fit the model. A larger sample allows the modeling
to identify more subtle features that would otherwise be
missed.
6
JMP rounds the value input for p-to-enter that is shown in the box in the stepwise dialog, even
though the underlying code will use the value that you have entered.
7
Some of the predictors in the stepwise model claim to have p-values that pass the Bonferroni
rule. Once stepwise introduces noise into the regression, it can add more and more and these look
fine. You need to use Bonferroni before adding the variables, not after.
Statistics 622 8-15 Fall 2009
16. Other Applications of the Bonferroni Rule
You can (and generally should) use the Bonferroni rule in
other situations in regression as well.
Any time that you look at a collection of p-values to judge
statistical significance, consider using a Bonferroni
adjustment to the p-values.
Testing in multiple regression
Suppose you fit a multiple regression with 5 predictors.
No selection or stepwise, just fit the model with these
predictors.
How should you judge the statistical results?
Two-stage process
(1) Check the overall F-ratio, shown in the Anova summary
of the model. This tests whether the R2 of the model is
large given the number of predictors in the fitted model
and the number of observations.
(2) If the overall F-ratio is statistically significant, then
consider the individual t-statistics for the coefficients
using a Bonferroni rule for these.
Suppose the model as a whole is significant, and you have
moved to the individual slopes. If you are looking at p-
values of a model with 5 predictors, then compare them to
0.05/5 = 0.01 before you get excited about finding a
statistically significant effect.
Tukey comparisons
The use of Tukey-Kramer comparisons among several
means is an alternative way to avoid claiming artificial
statistical significance in the specific case of comparing
many averages.
Statistics 622 8-16 Fall 2009
17. Detecting Over-fitting with a Validation Sample
Bonferroni is not always possible.
Some methods do not allow this type of control on over-
fitting because they do not offer p-values.
Reserve a validation sample
It is common in time series modeling to set aside future
data to check the predictions from your model. We did it
with the stocks without giving it much thought.8
Divide the data set into two batches, one for fitting the
model and the second for evaluating the model.
The validation sample should be âlocked awayâ excluded
from the modeling process, and certainly not âshownâ to
the search procedure.
Software issues
JMPâs âColumn Shuffleâ command makes this separation
into two batches easy to do. For example:
This formula defines a column that labels a random sample
of 50 cases (rows) as validation cases, with the rest labeled
as estimation cases.9
Then use the âExcludeâ & âHideâ commands from the
rows menu to set aside and conceal the validation cases.
8
At some point with time series models, you wonât be able to set aside data. If youâre trying to
predict tomorrow, do you really want to use a model built to data that is a month older?
9
Only 47 cases appear in the validation sample in the next example because it so happened that 3
excluded outliers fall among the validation cases.
Statistics 622 8-17 Fall 2009
18. Questions when using a validation sample
1. How many observations should I put into the validation
sample.
2. How can I use the validation sample to identify over-
fitting?
In the blocks example introduced in Module 7, we have n =
200 runs to build a model.10 That produces the following
paradox:
If we set aside, say, half for validation, then weâll have a
hard time finding good predictors.
On the other hand, if we only set aside, say, 10 cases for
validation, maybe these may be insufficient to give a
valid impression of how well the model has done. A fit
might do well on these 10 by chance.
Multi-fold cross-validation
A better alternative, if we had the software needed to
automate the process, repeats the validation process over
and over.
5-fold cross-validation:
Divide data into subsets, each with 20% of the cases.
Fit your model on 4 subsets, then predict the other. Do
this 5 times, each time omitting a different subset.
Accumulate the prediction errors.
Repeat!
10
So, why not go back to the client and say âI need more data.â Getting data is expensive unless
its already been captured in the system. Often, as in this example, the features for each run have
to be found by manually searching back through records.
Statistics 622 8-18 Fall 2009
19. Controlling Stepwise with a Validation Sample (block.jmp)
Prior version of the cost-accounting model had 15 predictors
with an R2 of 69% and RMSE of $5.80.
Using the Bonferroni rule to control the stepwise search gives
the model shown on the next pageâŚ
It is hard to count how many predictors JMP can choose
from because categorical terms get turned into several
dummy variables. We can estimate m by counting the
number of âscreensâ needed to show the candidate features.
With m â 385 features to consider, the Bonferroni threshold
for the âProb to enterâ criterion is
0.05/385 = 0.00013
The resulting model appears on the next page. The claimed
model is more parsimonious and does not claim the precision
produced by the prior search.
The model has 4 predictors, with R2 = 0.47, RMSE = $6.80
It also avoids weird variables like the type of music!
Statistics 622 8-19 Fall 2009
20. Actual by Predicted PlotÂ
80
70
Ave_Cost Actual
60
50
40
30
20
20 30 40 50 60 70 80
Ave_Cost Predicted P<.0001
RSq=0.47 RMSE=6.8343
Summary of Fit
RSquare 0.465
RSquare Adj 0.454
Root Mean Square Error 6.834
Mean of Response 39.694
Observations (or Sum Wgts) 197.000
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 4 7800.251 1950.06 41.7500
Error 192 8967.959 46.71 Prob > F
C. Total 196 16768.210 <.0001
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 20.22 1.84 10.97 <.0001
Labor_hrs 38.68 4.17 9.27 <.0001
(Abstemp-4.6)*(Abstemp-4.6) 0.07 0.01 6.09 <.0001
(Cost_Kg-1.8)*(Materialcost-2.3) 0.86 0.15 5.69 <.0001
(Manager{J-R&L}+0.22) * -372.50 89.07 -4.18 <.0001
(Brkdown/units-0.00634)
Statistics 622 8-20 Fall 2009
21. Leverage plots suggest that the model has found some
additional highly leveraged points that were not identified
previously.
What should we do about these?
What can we learn from these?
Ave_Cost Leverage Residuals
Abstemp*Abstemp
70
60
50
40
30
20
35 40 45 50 55 60 65
Abstemp*Abstemp Leverage, P<
.0001
Cost_Kg*Materialcost
Ave_Cost Leverage Residuals
70
60
50
40
30
20
35 40 45 50 55
Cost_Kg*Materialcost Leverage,
P<.0001
Statistics 622 8-21 Fall 2009
22. Visualization of the model reveals some of the structure of
the model..11 These plots are more interesting if you color-
code the points for old and new plants.
Do you see the two groups of points?
11
JMP will produce a surface plot only for models produced by Fit Model.
Statistics 622 8-22 Fall 2009
23. Back to Business
Allure of fancy tools
It is easy to become so enamored by fancy tools that you
may lose sight of the problem that youâre trying to solve.
The client wants a model that predicts the cost of a
production run.
Weâve now learned enough to be able to return to the client
with questions of our own. Weâre doing much better than
the naĂŻve initial model (5 predictors, R2 = 0.30 versus the
improved model with only 4 predictors yet higher R2 =
0.47).
What questions should you ask the client in order to
understand whatâs been found by the model?
What are those leveraged outliers?
Whatâs up with temperature controls? Do these have the
same effect in both plants. (Youâll have to do some data
analysis to answer this one.)
What do you make of the categorical factor?
In other wordsâŚ
Stepwise methods leave ample opportunity to exploit what
you know about the context⌠You can design more
sensible features to consider by using what you âknowâ
about the problem.
Ideally, by simplifying the search for additional predictors,
stepwise methods (or other search technologies) allow you
to have more time to think about the modeling problem.
Here are a few substantively motivated comments:
Statistics 622 8-23 Fall 2009
24. The features 1/Units and Breakdown/Units make more
sense (and are more interpretable) as ways of tracking
fixed costs.
Similarly, why use Cost/Kg when you can figure out the
material cost as the product cost/kg Ă weight?
Finally, make note of the so-called nesting of managers
within the different plants. Consider the following table:
Plant By Manager
Count JEAN LEE PAT RANDY TERRY
NEW 40 0 0 0 30 70
OLD 0 44 42 41 0 127
40 44 42 41 30 197
Jean and Terry work in the new plant, with the others
working in the old plant. Can you compare Jean to Lee,
for example? Or does that amount to comparing the two
plants?
These two features, Manager and Plant, are confounded
and cannot be separated by this analysis. (We can,
however, compare Jean to Terry since they do work in
the same plant.)
Statistics 622 8-24 Fall 2009
25. Appendix: Bonferroni Method
The Bonferroni Inequality
The Bonferroni inequality (a.k.a., Booleâs inequality) gives a
simple upper bound for the probability of a union of events. If
you simply ignore the double counting, then it follows that
m
P(E1 or E 2 orďor E m ) ⤠â P(E j )
j=1
In the special case that all of the events have equal probability
p = P(Ej), we get the special case
âŹ
P(E1 or E 2 orďor E m ) ⤠m p
Use in Model Selection
In model selection for stepwise regression, we start with a list
âŹ
of m possible features of the data that we consider for use in
the model. Often, this list will include interactions that we
want to have considered in the model, but are not really very
sure about.
If the list of possible predictors is large, then we need to avoid
âfalse positivesâ, adding a variable to the model that is not
actually helpful. Once the modeling begins to add unneeded
predictors, it tends to âcascadeâ by adding more and more.
Weâll avoid this by trying to never add a predictor thatâs not
helpful.
Statistics 622 8-25 Fall 2009
26. Bonferroni Rule for p-values
Let the events E1 through Em denote errors in the modeling,
adding the jth variable when it actually does not affect the
response. The chance for making any error when we consider
all m of these is then
P(some false positive) = P(E1 or E 2 orďor E m )
â¤mp
If we add a feature as a predictor in the model only if its p-
value is smaller than 0.05/m, say, then the chance for
âŹincorrectly including a predictor is less then
0.05
P(some false positive) ⤠m = 0.05
m
Thereâs only a 5% chance of making any mistake.
Itâs really pretty good
âŹ
Some would say that using this so-called âBonferroni ruleâ is
too conservative: it makes it too hard to find useful predictors.
Itâs actually not so bad.
(1) For example, suppose that we have m = 1000 possible
features to sort through. Then the Bonferroni rule says to only
add a feature if its p-value is smaller than 0.05/1000, 0.00005.
That seems really small at first, but convert it to a t-ratio.
How large (in absolute size) does the t-ratio need to be in
order for the p-value to be smaller than 0.00005? The answer
is about 4.6.
In other words, once the t-ratio is larger than around 5, a
model selection procedure will add the variable. A t-ratio
of 5 does not seem so unattainable. Sure, it requires a large
Statistics 622 8-26 Fall 2009
27. effect, but with so many possibilities, we need to be
careful.
(2) Another way to see that Bonferroni is pretty good is to put
a lower bound on the probability of a false positive. If all of
the events are independent, then
P(some false positive) = 1â P(none)
= 1â P(E1c and E 2 and ď and E m )
c c
= 1â P(E1c ) Ă P(E 2 ) Ăď Ă P(E m )
c c
= 1â (1â p) m
= 1â e m log(1â p )
⼠1â eâm p
and the last step follows because log(1+x) ⤠x.
Combined with the Bonferroni inequality, we have (for
⏠independent tests)
1â eâm p ⤠P(some false positive) ⤠m p
This table summarizes the implications. It shows that as n
grows and p gets smaller, the bounds from these inequalities
are really very tight.
âŹ
m p mp Bounds
50 0.01 0.50 0.39 â 0.50
50 0.005 0.25 0.22 â 0.25
100 0.0001 0.01 0.0095 â 0.0100
Statistics 622 8-27 Fall 2009