This document introduces a Python package called loss_mob for automated variable transformation in property and casualty loss models. The package uses monotonic binning algorithms to transform raw predictor variables into new variables with improved linearity with the target variable (e.g. loss amount). It screens variables to identify important drivers, applies different binning algorithms including gradient boosting machine binning, and handles missing values. The document demonstrates the package's functionality on a motor third-party liability claims dataset, screening variables, transforming variables, and visualizing the results. The package aims to automate and streamline tedious data preparation tasks to allow modelers to focus on modeling methodology.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations, using several variable selection techniques to determine the most predictive factors of violent crime. These include best random subset selection (BRSS), which approximates best subset selection by randomly selecting variable combinations. BRSS identified factors like immigration, ethnicity, family structure, and income as best predicting violent crime rates. Model performance was evaluated using metrics like R2, and BRSS had strong out-of-sample prediction, outperforming some other common techniques.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations on income, family structure, ethnicity, and other factors to predict violent crime rates. Several variable selection techniques are applied including forward selection, backward elimination, lasso, elastic net, best random subset selection (BRSS), decision trees, and random forests. BRSS, which approximates best subset selection, identified 15 variables as most predictive of violent crime and had strong out-of-sample performance. Analysis of 1000 training and test splits found that BRSS, random forests, and decision trees consistently outperformed other techniques in terms of out-of-sample predictive accuracy
Machine Learning Model for M.S admissionsOmkar Rane
The document describes building a machine learning model to predict admissions for a Master's program. It loads student data, preprocesses it by imputing missing values, splits it into training and test sets, trains several models and evaluates their accuracy via cross-validation. Logistic regression achieved the best results with 77.5% accuracy. The trained logistic regression model is used to make predictions on new student data.
This document describes using an ARIMA model to forecast future passenger sales for an airline company based on historical sales data. It finds that an ARIMA(0,3) model best fits the differenced and logged time series data. Graphs show the actual sales data from 1949 to 1960 aligns well with the sales forecasted by the ARIMA(0,3) model over that same time period. The model is then used to forecast passenger sales in 1961.
Feature scaling is a technique used in machine learning to standardize the range of independent variables or features of data. There are several common feature scaling methods including standardization, min-max scaling, and mean normalization. Standardization transforms the data to have a mean of 0 and standard deviation of 1. Min-max scaling scales features between 0 and 1. Mean normalization scales the mean value to zero. The document then provides the formulas and R code examples for implementing each of these scaling methods.
I presented these slides at a meeting of ACM data mining group. I discuss using data mining to improve performance of an existing trading system. The presentation was video taped. You can see the video at:
http://fora.tv/2009/05/13/Michael_Bowles_Neural_Nets_and_Rule-Based_Trading_Systems
if you have any questions or comments contact me: mike@mbowles.com or
http://www.linkedin.com/in/mikebowles
This document describes the development of a logistic regression model to predict whether a television show will be canceled based on ratings data and other variables. Several covariates are evaluated in univariate and multivariate regression analyses. Continuous variables are examined for nonlinearity, and fractional polynomials are used to account for nonlinearity. The resulting preliminary main effects model includes demographic and overall viewership, previous year's demographic viewership, whether the show is scripted, and the television network as predictors of cancellation.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations, using several variable selection techniques to determine the most predictive factors of violent crime. These include best random subset selection (BRSS), which approximates best subset selection by randomly selecting variable combinations. BRSS identified factors like immigration, ethnicity, family structure, and income as best predicting violent crime rates. Model performance was evaluated using metrics like R2, and BRSS had strong out-of-sample prediction, outperforming some other common techniques.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations on income, family structure, ethnicity, and other factors to predict violent crime rates. Several variable selection techniques are applied including forward selection, backward elimination, lasso, elastic net, best random subset selection (BRSS), decision trees, and random forests. BRSS, which approximates best subset selection, identified 15 variables as most predictive of violent crime and had strong out-of-sample performance. Analysis of 1000 training and test splits found that BRSS, random forests, and decision trees consistently outperformed other techniques in terms of out-of-sample predictive accuracy
Machine Learning Model for M.S admissionsOmkar Rane
The document describes building a machine learning model to predict admissions for a Master's program. It loads student data, preprocesses it by imputing missing values, splits it into training and test sets, trains several models and evaluates their accuracy via cross-validation. Logistic regression achieved the best results with 77.5% accuracy. The trained logistic regression model is used to make predictions on new student data.
This document describes using an ARIMA model to forecast future passenger sales for an airline company based on historical sales data. It finds that an ARIMA(0,3) model best fits the differenced and logged time series data. Graphs show the actual sales data from 1949 to 1960 aligns well with the sales forecasted by the ARIMA(0,3) model over that same time period. The model is then used to forecast passenger sales in 1961.
Feature scaling is a technique used in machine learning to standardize the range of independent variables or features of data. There are several common feature scaling methods including standardization, min-max scaling, and mean normalization. Standardization transforms the data to have a mean of 0 and standard deviation of 1. Min-max scaling scales features between 0 and 1. Mean normalization scales the mean value to zero. The document then provides the formulas and R code examples for implementing each of these scaling methods.
I presented these slides at a meeting of ACM data mining group. I discuss using data mining to improve performance of an existing trading system. The presentation was video taped. You can see the video at:
http://fora.tv/2009/05/13/Michael_Bowles_Neural_Nets_and_Rule-Based_Trading_Systems
if you have any questions or comments contact me: mike@mbowles.com or
http://www.linkedin.com/in/mikebowles
This document describes the development of a logistic regression model to predict whether a television show will be canceled based on ratings data and other variables. Several covariates are evaluated in univariate and multivariate regression analyses. Continuous variables are examined for nonlinearity, and fractional polynomials are used to account for nonlinearity. The resulting preliminary main effects model includes demographic and overall viewership, previous year's demographic viewership, whether the show is scripted, and the television network as predictors of cancellation.
Mixed Numeric and Categorical Attribute Clustering AlgorithmAsoka Korale
A Matlab implementation of a Mixed Numeric and Categorical attribute clustering algorithm for digital marketing segmentations. Validated via distribution analysis and segment profile. Algorithm performance characterized through convergence of intermediate variables and parameters.
Statistical Models for Proportional OutcomesWenSui Liu
The document discusses statistical models for proportional outcomes that take values between 0 and 1. It notes disagreements among developers on the best approach, with most supporting ordinary least squares regression due to simplicity. However, OLS is inappropriate due to the bounded range of values and heteroscedasticity. The document evaluates different statistical methods on real-world data to determine the best performing approach.
Gradient boosting for regression problems with example basics of regression...prateek kumar
The document discusses using gradient boosting for regression problems. Gradient boosting builds an additive model in a stage-wise fashion to minimize a loss function. It uses decision trees as weak learners that are added sequentially. The document demonstrates implementing gradient boosting in Python to predict Boston housing prices based on various attributes. It loads the dataset, trains a gradient boosting regressor model on 80% of the data, and evaluates the model on the remaining 20% with metrics showing good performance.
The team evaluated various machine learning classifiers on the MNIST handwritten digits dataset. They found that preprocessing like de-skewing improved classifier accuracy. Dimensionality reduction using PCA captured most variance with around 50 components. Linear classifiers achieved around 85% accuracy, while KNN and neural networks performed best at 97% accuracy. Deskewing helped reduce confusion between certain digits for all classifiers.
This document discusses forecasting a time series dataset from January 1996 to December 2006 using autoregressive models. It finds that an AR(12) model fits the data well but performs poorly in forecasting month-to-month changes. To address this, it examines ways the volatility/variance of the data has increased over time. It finds the data exhibits enlarging month-to-month changes rather than accelerating growth. The document considers AR(12), AR(12) with time trend, and ARMA models and selects the AR(12) model with an adjustment for increasing volatility as the most efficient approach.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Model Presolve, Warmstart and Conflict Refining in CP OptimizerPhilippe Laborie
The IBM constraint programming optimization system CP Optimizer was designed to provide automatic search and a simple modeling of scheduling problems. It is used in industry for solving operational planning and scheduling problems. We present three features that we recently added to CP Optimizer to accelerate problem solving and make the solver more interactive. These are model presolve, warm-start and conflict refinement. The aim of model presolve is to reformulate and group constraints to obtain a stronger model that will be solved more rapidly. We give examples of some interesting model reformulations. Search warm-start starts search from a known - possibly incomplete - solution given by the user in order to further improve it or to help to guide the engine towards a first solution. Finally the conflict refiner helps to identify a reason for an inconsistency by providing a minimal subset of an infeasible model. All these features are illustrated on concrete examples.
The document discusses the Julia programming language. It provides information on Julia's popularity compared to other languages like Python and R. It highlights several use cases for Julia in fields like finance, science, and engineering. It also demonstrates basic Julia code for tasks like data analysis, plotting, and numerical computing. Overall, the document serves as an introduction to the Julia language and provides examples of its capabilities.
This document outlines course material for a phylogenetics and sequence analysis course. It discusses building phylogenetic trees using distance, parsimony, and maximum likelihood methods. It also covers statistical methods like Bayesian phylogenetics for calculating trees. Software for building trees and summarizing results are presented, including MrBayes, BEAST, and DendroPy. The document provides guidance on evaluating convergence and summarizing Bayesian analyses. Model selection using programs like jModelTest and proper formatting of input sequence data are also covered.
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET Journal
1. The document discusses using data mining tools like Naive Bayes, Decision Trees, K-Nearest Neighbors and Support Vector Machines to forecast rainfall using a dataset containing weather variables.
2. The algorithms were tested on a dataset containing 679 instances of weather data from Jaipur, India from 2016-2018. Naive Bayes achieved an accuracy of 80.56%, Decision Trees achieved 94.10% accuracy, KNN achieved 93.96% accuracy, and SVM achieved 93.66% accuracy.
3. The most accurate models for rainfall prediction based on this dataset and analysis were Decision Trees and K-Nearest Neighbors, which both achieved over 93% accuracy in their forecasts.
An Introduction to Simulation in the Social Sciencesfsmart01
This document provides an introduction to simulation design in the social sciences. It discusses why simulations are used, including to confirm theoretical results, explore unknown theoretical environments, and generate statistical estimates. It outlines the key stages of simulations, including specifying the model, assigning parameters, generating data, calculating and storing results, and repeating the process. Finally, it provides examples of simulations and discusses necessary programming tools and considerations for simulation design.
This document presents a study that analyzes network traffic data to detect user behavior patterns, including both normal and intrusive patterns. It uses the KDDCUP99 dataset and applies various feature selection and data preprocessing algorithms. A model is developed using evolutionary neural networks and genetic algorithms to identify trends and anomalies in user behavior over time. The model is able to accurately classify behavior patterns in the network with over 92% accuracy based on testing. Future work could involve using deep learning techniques to further improve the algorithm training.
This document presents a model for profiling mobile telecom subscribers based on their credit behavior. It discusses using subscriber profiling for credit management, monitoring payments, and targeting promotions. Key attributes for credit profiling include network tenure, payment delay, payment gap, and revenue. A fuzzy c-means clustering algorithm is used to segment subscribers into clusters based on these attributes. The cluster centroids are analyzed to identify valuable subscriber segments to retain and opportunity segments to positively influence.
Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Dbms plan - A swiss army knife for performance engineersRiyaj Shamsudeen
This document discusses dbms_xplan, a tool for performance engineers to analyze execution plans. It provides options for displaying plans from the plan table, shared SQL area in memory, and AWR history. Dbms_xplan provides more detailed information than traditional tools like tkprof, including predicates, notes, bind values, and plan history. It requires privileges to access dictionary views for displaying plans from memory and AWR. The document also demonstrates usage examples and output formats for dbms_xplan.analyze.
The document describes building regression and classification models in R, including linear regression, generalized linear models, decision trees, and random forests. It uses examples of CPI data to demonstrate linear regression and predicts CPI values in 2011. For classification, it builds a decision tree model on the iris dataset using the party package and visualizes the tree. The document provides information on evaluating and comparing different models.
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...Fabricio de França
The document proposes an artificial immune network called dopt-aiNet for solving multimodal optimization problems in dynamic environments. dopt-aiNet is inspired by the immune system and uses clonal selection, mutation, and suppression techniques to maintain diversity and track moving optima. Numerical experiments show that dopt-aiNet outperforms other algorithms in terms of accuracy, convergence speed, and ability to track changing optima using fewer function evaluations. The paper discusses areas for future work such as improving suppression algorithms and studying the impact of different mutation operators.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
More Related Content
Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning
Mixed Numeric and Categorical Attribute Clustering AlgorithmAsoka Korale
A Matlab implementation of a Mixed Numeric and Categorical attribute clustering algorithm for digital marketing segmentations. Validated via distribution analysis and segment profile. Algorithm performance characterized through convergence of intermediate variables and parameters.
Statistical Models for Proportional OutcomesWenSui Liu
The document discusses statistical models for proportional outcomes that take values between 0 and 1. It notes disagreements among developers on the best approach, with most supporting ordinary least squares regression due to simplicity. However, OLS is inappropriate due to the bounded range of values and heteroscedasticity. The document evaluates different statistical methods on real-world data to determine the best performing approach.
Gradient boosting for regression problems with example basics of regression...prateek kumar
The document discusses using gradient boosting for regression problems. Gradient boosting builds an additive model in a stage-wise fashion to minimize a loss function. It uses decision trees as weak learners that are added sequentially. The document demonstrates implementing gradient boosting in Python to predict Boston housing prices based on various attributes. It loads the dataset, trains a gradient boosting regressor model on 80% of the data, and evaluates the model on the remaining 20% with metrics showing good performance.
The team evaluated various machine learning classifiers on the MNIST handwritten digits dataset. They found that preprocessing like de-skewing improved classifier accuracy. Dimensionality reduction using PCA captured most variance with around 50 components. Linear classifiers achieved around 85% accuracy, while KNN and neural networks performed best at 97% accuracy. Deskewing helped reduce confusion between certain digits for all classifiers.
This document discusses forecasting a time series dataset from January 1996 to December 2006 using autoregressive models. It finds that an AR(12) model fits the data well but performs poorly in forecasting month-to-month changes. To address this, it examines ways the volatility/variance of the data has increased over time. It finds the data exhibits enlarging month-to-month changes rather than accelerating growth. The document considers AR(12), AR(12) with time trend, and ARMA models and selects the AR(12) model with an adjustment for increasing volatility as the most efficient approach.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Model Presolve, Warmstart and Conflict Refining in CP OptimizerPhilippe Laborie
The IBM constraint programming optimization system CP Optimizer was designed to provide automatic search and a simple modeling of scheduling problems. It is used in industry for solving operational planning and scheduling problems. We present three features that we recently added to CP Optimizer to accelerate problem solving and make the solver more interactive. These are model presolve, warm-start and conflict refinement. The aim of model presolve is to reformulate and group constraints to obtain a stronger model that will be solved more rapidly. We give examples of some interesting model reformulations. Search warm-start starts search from a known - possibly incomplete - solution given by the user in order to further improve it or to help to guide the engine towards a first solution. Finally the conflict refiner helps to identify a reason for an inconsistency by providing a minimal subset of an infeasible model. All these features are illustrated on concrete examples.
The document discusses the Julia programming language. It provides information on Julia's popularity compared to other languages like Python and R. It highlights several use cases for Julia in fields like finance, science, and engineering. It also demonstrates basic Julia code for tasks like data analysis, plotting, and numerical computing. Overall, the document serves as an introduction to the Julia language and provides examples of its capabilities.
This document outlines course material for a phylogenetics and sequence analysis course. It discusses building phylogenetic trees using distance, parsimony, and maximum likelihood methods. It also covers statistical methods like Bayesian phylogenetics for calculating trees. Software for building trees and summarizing results are presented, including MrBayes, BEAST, and DendroPy. The document provides guidance on evaluating convergence and summarizing Bayesian analyses. Model selection using programs like jModelTest and proper formatting of input sequence data are also covered.
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET Journal
1. The document discusses using data mining tools like Naive Bayes, Decision Trees, K-Nearest Neighbors and Support Vector Machines to forecast rainfall using a dataset containing weather variables.
2. The algorithms were tested on a dataset containing 679 instances of weather data from Jaipur, India from 2016-2018. Naive Bayes achieved an accuracy of 80.56%, Decision Trees achieved 94.10% accuracy, KNN achieved 93.96% accuracy, and SVM achieved 93.66% accuracy.
3. The most accurate models for rainfall prediction based on this dataset and analysis were Decision Trees and K-Nearest Neighbors, which both achieved over 93% accuracy in their forecasts.
An Introduction to Simulation in the Social Sciencesfsmart01
This document provides an introduction to simulation design in the social sciences. It discusses why simulations are used, including to confirm theoretical results, explore unknown theoretical environments, and generate statistical estimates. It outlines the key stages of simulations, including specifying the model, assigning parameters, generating data, calculating and storing results, and repeating the process. Finally, it provides examples of simulations and discusses necessary programming tools and considerations for simulation design.
This document presents a study that analyzes network traffic data to detect user behavior patterns, including both normal and intrusive patterns. It uses the KDDCUP99 dataset and applies various feature selection and data preprocessing algorithms. A model is developed using evolutionary neural networks and genetic algorithms to identify trends and anomalies in user behavior over time. The model is able to accurately classify behavior patterns in the network with over 92% accuracy based on testing. Future work could involve using deep learning techniques to further improve the algorithm training.
This document presents a model for profiling mobile telecom subscribers based on their credit behavior. It discusses using subscriber profiling for credit management, monitoring payments, and targeting promotions. Key attributes for credit profiling include network tenure, payment delay, payment gap, and revenue. A fuzzy c-means clustering algorithm is used to segment subscribers into clusters based on these attributes. The cluster centroids are analyzed to identify valuable subscriber segments to retain and opportunity segments to positively influence.
Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Dbms plan - A swiss army knife for performance engineersRiyaj Shamsudeen
This document discusses dbms_xplan, a tool for performance engineers to analyze execution plans. It provides options for displaying plans from the plan table, shared SQL area in memory, and AWR history. Dbms_xplan provides more detailed information than traditional tools like tkprof, including predicates, notes, bind values, and plan history. It requires privileges to access dictionary views for displaying plans from memory and AWR. The document also demonstrates usage examples and output formats for dbms_xplan.analyze.
The document describes building regression and classification models in R, including linear regression, generalized linear models, decision trees, and random forests. It uses examples of CPI data to demonstrate linear regression and predicts CPI values in 2011. For classification, it builds a decision tree model on the iris dataset using the party package and visualizes the tree. The document provides information on evaluating and comparing different models.
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...Fabricio de França
The document proposes an artificial immune network called dopt-aiNet for solving multimodal optimization problems in dynamic environments. dopt-aiNet is inspired by the immune system and uses clonal selection, mutation, and suppression techniques to maintain diversity and track moving optima. Numerical experiments show that dopt-aiNet outperforms other algorithms in terms of accuracy, convergence speed, and ability to track changing optima using fewer function evaluations. The paper discusses areas for future work such as improving suppression algorithms and studying the impact of different mutation operators.
Similar to Variable Transformation in P&C Loss Models Based on Monotonic Binning (20)
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
2. Opportunities in P&C Modeling
A tremendous effort has been spent on the data preparation and exploration that
can be automated and streamlined.
Let machines deal with tedious data works so as to allow modelers focus on modeling
methodology and statistical inference.
Model
Deve. Data
Data Screen
Anomaly
Treatment
Data
Transform
Predictive
Ranking
Filter redundant
data fields;
Retain relevant
information;
Impute missing
values;
Winsorize data
outliers;
Recode special
values.
Explore data
distribution;
Identify best
transformation to
improve linearity;
Access variable
predictiveness;
Identify important
model drivers;
Data Preparation Consumed 50+% Time in Model Development
Heterogeneous
Data Sources:
Credit
Vehicle
Telematic
Geographic
3. Banking Practice
In retail credit risk models, Weight of Evidence transformation* has been widely
used to improve the efficiency of model development:
𝑊𝑜𝐸𝑖 = 𝐿𝑛
# 𝑜𝑓 𝑌 = 1 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝑌 = 0 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 1
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑌 = 0
The number of categories (i.e. bins) is derived from discretization on the 𝑿 vector, with
missing values being handled differently.
In consideration of regulatory scrutiny and model interpretation, a strict monotonicity is
assumed between 𝑿 and 𝑾𝒐𝑬𝑿.
All monotonic functions of 𝑿, e.g. logarithm, exponential, or linear, should converge to
the same monotonic 𝑾𝒐𝑬𝑿 transformation.
Odds in 𝒊𝒕𝒉
Category Overall Odds
* https://pypi.org/project/py-mob/
4. Adoption in P&C Models
In light of P&C loss models, a modified approach is proposed to mimic the idea of
𝑾𝒐𝑬 transformation, as shown below.
𝐹 𝑋𝑖 = 𝐿𝑛
𝐿𝑜𝑠𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ
𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
# 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠 𝑖𝑛 𝑖𝑡ℎ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝐶𝑎𝑠𝑒𝑠
The interpretation of 𝑭 𝑿𝒊 is intuitive in that it is the ratio between the average loss in
the 𝑖𝑡ℎ category and the overall average loss.
With missing values fallen into a standalone category or combined with a similar
neighbor, no special treatment (i.e. imputation) is necessary anymore.
Since the transformation is to project raw values of 𝑿 into the data space of 𝒀 based on
the permutation, concerns around outliers in each 𝑿 have been neutralized.
Average Loss in 𝒊𝒕𝒉
Category Overall Average Loss
5. Outline of Loss_Mob Package
The Python package loss_mob (https://pypi.org/project/loss-mob) is my weekend
project with the attempt to tackle the most tedious yet critical task in P&C loss
model development.
Core Functionality
Variable Information Binning Algorithms Utility Functions
Coefficient of Variation
Spearman and Distance
Correlation Coefficients
Mutual Information Score
Gini Coefficient
Fine Binning Based on GBM
or Isotonic Regression;
Coarse Binning Based on
Density or Value Range;
Customized Binning Based
on User Inputs
Tabulation of Binning Result;
Application of Binning
Outcome to New Data;
Verification of Data
Transformation;
sMAPE for Model Perf.
6. Demo Based on MTPL Data
French Motor Third-Part Liability (MTPL) Claims dataset from OpenML* is used in
the subsequent demo.
import loss_mob as mob, pandas as pd, numpy as np, statsmodels.api as sm
dt = mob.get_mtpl() # https://github.com/dutangc/CASdatasets
dt.keys()
# dict_keys(['idpol', 'claimnb', 'exposure', 'area', 'vehpower', 'vehage', 'drivage', 'bonusmalus’,
# 'vehbrand','vehgas’, 'density', 'region', 'claimamount', 'purepremium’])
pd.DataFrame(dt).head(3)
… vehpower vehage drivage bonusmalus vehbrand vehgas density region claimamount purepremium
… 7 1 61 50 B12 Regular 27000 R11 303.00 404.0000
… 12 5 50 60 B12 Diesel 56 R25 1981.84 14156.0000
… 4 0 36 85 B12 Regular 4792 R11 1456.55 10403.9286
* https://www.openml.org
7. Variable Screening
The screen() function assesses the association between each 𝑿 and 𝒀.
The consistent magnitude between Spearman and Distance correlations indicates a
strong linear association in the context of GLM.
# variable list to screen
vlst = ["vehpower", "vehage", "drivage", "bonusmalus", "density"]
# screen through each attribute
summ = [{"variable": _, **mob.screen(dt[_], dt["purepremium"])} for _ in vlst]
# sort the summary by distance correlation
pd.DataFrame(sorted(summ, key = lambda x: -x["distance correlation"]))
variable … coefficient of variation spearman correlation distance correlation gini coefficient
bonusmalus … 0.261651 0.057169 0.043454 0.364684
drivage … 0.310719 -0.004906 0.014289 0.319361
density … 2.208544 0.020221 0.011069 0.075396
vehage … 0.804375 0.019526 0.010801 0.093274
vehpower … 0.317741 0.002307 0.003570 0.026760
8. Variable Screening in Parallel
Scalability is at the heart of development philosophy.
Functions in loss_mob can be easily parallelized and scaled to ~1000+ predictors.
# first, define a wrapper to be consumed by the parallel map
def pscreen(v):
return({"variable": v, **mob.screen(dt[v], dt["purepremium"])})
# next, load necessary modules
from multiprocessing import Pool, cpu_count
from contextlib import closing
with closing(Pool(processes = cpu_count())) as pool:
psum = pool.map(pscreen, vlst)
pool.terminate()
pd.DataFrame(sorted(psum, key = lambda x: -x["gini coefficient"])).head(3)
variable ... spearman correlation distance correlation gini coefficient
Bonusmalus ... 0.057169 0.043454 0.364684
drivage ... -0.004906 0.014289 0.319361
vehage ... 0.019526 0.010801 0.093274
10. Visual of Variable Transformation
Because 𝐹(𝑋), i.e. 𝑵𝒆𝒘𝑿, is strictly linear with respect to 𝐿𝑛(𝑌), the linearity of model
predictors in GLM has been be enhanced.
Each category of 𝐹(𝑋) is an aggregation based on a segment of records. As a result, the
model estimated with transformed 𝑿 should be more stable and less prone to overfitting.
11. Treatment of Missing Values - I
Case I - The binning algorithm groups all missing values into a standalone category
and then assigns a value to 𝑵𝒆𝒘𝑿 based on the corresponding average loss.
np.random.seed(1)
test_x = np.where(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8, np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 0 | 135182 | 135182 | 35578527.4354 | 263.1898 | -0.37584005 | numpy.isnan($X$) |
| 1 | 307519 | 0 | 67711301.2865 | 220.1857 | -0.55424410 | $X$ <= 50.0 |
| 2 | 77102 | 0 | 19992914.3670 | 259.3047 | -0.39071162 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43211 | 0 | 15596493.6976 | 360.9380 | -0.06000929 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 890 | 0 | 414193.4095 | 465.3859 | 0.19415125 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 35072 | 0 | 18566866.2822 | 529.3929 | 0.32301519 | $X$ > 70.0 and $X$ <= 79.0 |
| 6 | 56945 | 0 | 46318023.6829 | 813.3817 | 0.75248495 | $X$ > 79.0 and $X$ <= 96.0 |
| 7 | 153 | 0 | 227944.8955 | 1489.8359 | 1.35770566 | $X$ > 96.0 and $X$ <= 99.0 |
| 8 | 21939 | 0 | 55449516.2623 | 2527.4405 | 1.88624679 | $X$ > 99.0 |
12. Treatment of Missing Values - II
Case II - When no loss was incurred for missing values, all records with missing
values will be merged into a category with the lowest averaged loss.
test_x = np.where(np.logical_and(np.random.uniform(size = len(dt["bonusmalus"])) > 0.8,
np.array(dt["purepremium"]) == 0), np.nan, np.array(dt["bonusmalus"]))
mob.view_bin(mob.gbm_bin(test_x, dt["purepremium"]))
| bin | freq | miss | ysum | yavg | newx | rule |
|-------|--------|--------|---------------|-----------|-------------|-------------------------------------|
| 1 | 439893 | 130194 | 80777461.9272 | 183.6298 | -0.73579385 | $X$ <= 50.0 or numpy.isnan($X$) |
| 2 | 77806 | 0 | 25208175.3327 | 323.9876 | -0.16801052 | $X$ > 50.0 and $X$ <= 61.0 |
| 3 | 43781 | 0 | 17814900.3274 | 406.9094 | 0.05987494 | $X$ > 61.0 and $X$ <= 69.0 |
| 4 | 901 | 0 | 505020.2406 | 560.5108 | 0.38013292 | $X$ > 69.0 and $X$ <= 70.0 |
| 5 | 33871 | 0 | 19639592.0866 | 579.8350 | 0.41402801 | $X$ > 70.0 and $X$ <= 76.0 |
| 6 | 1550 | 0 | 933807.2930 | 602.4563 | 0.45229955 | $X$ > 76.0 and $X$ <= 79.0 |
| 7 | 57623 | 0 | 52056038.4906 | 903.3899 | 0.85743868 | $X$ > 79.0 and $X$ <= 96.0 |
| 8 | 157 | 0 | 238409.0720 | 1518.5291 | 1.37678185 | $X$ > 96.0 and $X$ <= 99.0 |
| 9 | 21991 | 0 | 60879166.6831 | 2768.3674 | 1.97729742 | $X$ > 99.0 and $X$ <= 139.0 |
| 10 | 440 | 0 | 1803209.8657 | 4098.2042 | 2.36958856 | $X$ > 139.0 |
14. Variable Importance after Transformation
Because the monotonic binning provides a rank order capability of each attribute,
outcomes can be leveraged to calculate the Gini-Coefficient in order to evaluate the
predictability of each predictor after transformation.
Gini outcome is highly consistent with Distance Correlation.
# calculate gini-coefficient for each binned attribute
gout = [{"variable": _, “gini”: mob.bin_gini(bout[_])} for _ in vlst]
# sort all attributes by gini-coefficients
pd.DataFrame(sorted(gout, key = lambda x: -x[“gini”]))
variable gini gini before binning
bonusmalus 0.373600 0.364684
drivage 0.335541 0.319361
vehage 0.130189 0.093274
density 0.129020 0.075396
vehpower 0.076282 0.026760
Gini improved
after binning
15. Functions to apply transformations to new data and to verify the outcome.
Transforming New Data
bin1 = mob.qtl_bin(dt["bonusmalus"], dt["purepremium"])
# score new data based on the binning outcome
out1 = mob.cal_newx(dt['bonusmalus'], bin1)
mob.head(out1, 3)
# {'x': 50, 'bin': 1, 'newx': -0.60031106}
# {'x': 60, 'bin': 3, 'newx': -0.24666305}
# {'x': 85, 'bin': 4, 'newx': 0.51388758}
mob.chk_newx(out1)
| bin | newx | freq | dist | xrng |
|-------|-------------|--------|------------|--------------------------------|
| 1 | -0.60031106 | 384156 | 56.6591% | 50 <==> 50 |
| 2 | -0.34039004 | 68334 | 10.0786% | 51 <==> 57 |
| 3 | -0.24666305 | 80831 | 11.9217% | 58 <==> 68 |
| 4 | 0.51388758 | 82308 | 12.1396% | 69 <==> 85 |
| 5 | 1.25057961 | 62384 | 9.2010% | 86 <==> 230 |
16. Estimate a Tweedie GLM with forementioned predictors without any transformation.
Only 1 variable is statistically significant.
Model Fitting without Transformation
Y = dt["purepremium"]
# use raw variables
X1 = sm.add_constant(pd.DataFrame({v: dt[v] for v in vlst}), prepend = True)
m1 = sm.GLM(Y, X1, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 3.8697 0.575 6.729 0.000 2.743 4.997
bonusmalus 0.0344 0.005 6.792 0.000 0.024 0.044 1.3222
drivage -0.0055 0.006 -0.907 0.364 -0.017 0.006 1.3018
vehage -0.0024 0.013 -0.183 0.855 -0.029 0.024 1.0165
density -7.882e-06 1.92e-05 -0.411 0.681 -4.55e-05 2.98e-05 1.0194
vehpower 0.0124 0.037 0.338 0.735 -0.060 0.085 1.0084
17. Estimate a Tweedie GLM with same predictors after transformation.
There are 3 statistically significant variables.
Model Fitting with Transformation
bout = dict((v, mob.iso_bin(dt[v], dt["purepremium"])) for v in vlst)
xout = dict((v, mob.cal_newx(dt[v], bout[v])) for v in vlst)
X2 = sm.add_constant(pd.DataFrame(dict((v, [_["newx"] for _ in xout[v]]) for v in vlst)), prepend = True)
m2 = sm.GLM(Y, X2, family = sm.families.Tweedie(sm.families.links.Log(), var_power = 1.8)).fit()
==============================================================================
coef std err z P>|z| [0.025 0.975] VIF
------------------------------------------------------------------------------
const 5.9635 0.066 91.001 0.000 5.835 6.092
bonusmalus 0.4727 0.115 4.115 0.000 0.248 0.698 1.3983
drivage 0.6632 0.126 5.254 0.000 0.416 0.911 1.3794
vehage 0.0726 0.228 0.319 0.750 -0.374 0.519 1.0119
density 0.4690 0.227 2.063 0.039 0.023 0.915 1.0215
vehpower 0.5291 0.417 1.269 0.205 -0.288 1.347 1.0055
18. A performance comparison between the model without variable transformation and
the model with variable transformation is provided below.
Model Performance
Statistical Metrics Without Transformation With Transformation
AIC 848,321 821,825
Gini 0.3847 0.4103
sMAPE 1.9724 1.9744
MAE 734.6655 717.6874
D2 Tweedie Score 0.0393 0.0553
19. Distance correlation is a dependence measure between two paired vectors.
Appendix I: Distance Correlation
Distance Correlation Spearman Correlation
Source: github.com/vnmabus/dcor
20. Appendix II: Core Functions of Loss_Mob
loss_mob
|-- qtl_bin() : Iterative discretization based on quantiles of X.
|-- los_bin() : Revised iterative discretization for records with Y > 0.
|-- iso_bin() : Discretization driven by the isotonic regression.
|-- val_bin() : Revised iterative discretization based on unique values of X.
|-- rng_bin() : Revised iterative discretization based on the equal-width range of X.
|-- kmn_bin() : Iterative discretization based on the k-means clustering of X.
|-- gbm_bin() : Discretization based on the gradient boosting machine (GBM).
|-- cus_bin() : Customized discretization based on pre-determined cut points.
|-- view_bin() : Displays the binning outcome in a tabular form.
|-- cal_newx() : Applies the variable transformation to a numeric vector based on binning outcome.
|-- chk_newx() : Verifies the transformation generated from the cal_newx() function.
|-- mi_score() : Calculates the Mutual Information (MI) score between X and Y.
|-- screen() : Calculates Spearman and Distance Correlations between X and Y.
|-- bin_gini() : Calculates the gini-coefficient between X and Y based on the binning object.
|-- num_gini() : Calculates the gini-coefficient between raw values of X and Y.
|-- smape() : Calculates the sMAPE value between Y and Yhat.
`-- get_mtpl() : Extracts French Motor Third-Part Liability Claims dataset from OpenML.