SlideShare a Scribd company logo
1 of 14
A Modified Logistic Regression Approach Enhanced by
New Interactions and Scaling Detections through
Random Forest and GBM
Yulin Ning, Senior Director
Next Gen Analytics
Global Analytics Platform and Capabilities
Citi Global Consumer Bank
Predictive Analytics World, New York
Oct 30th, 2017
Presenter’s View Only – does not reflect Citigroup’s View
- 1 -
Who Are We and What Our Strategy Is
Infrastructure / Data
Processes
Platforms and CapabilitiesCiti Global Consumer
Priorities
POC Management
Framework
Ongoing Industry
Research
Talent / Data
Scientist Disciplines
Capabilities
Development / POCs
Innovative Culture
Next Generation
Analytics Strategy
Customer
Centricity
Global
Digital Bank
- 2 -
Today’s Topic
A Modified Logistic Regression Approach Enhanced by New
Interactions and Scaling Detections through Random Forest and
GBM - A practical adoption of machine learning into predictive analytic discipline
Problem statement: algorithm selection intertwines with variable selection methods
Literature review
Key components of predictive analytics
Algorithm selection Intertwining with variable selection
Use Random Forest (RF) and GBM to enhance variable selection
A modified logistic regression approach to incorporate benefit from Random Forest and GBM
- 3 -
Problem Statement: Different Variable Selection Methods
Generated Different Drivers During our POC Work
 A variety of methods were explored by our 3 MAs (Liana,Jonathan, and Jason) to do the initial variables/drivers
selections for one of our POC
 Stepwise variable selection and information value method based on logistic regression
 Variable importance based on Random Forest and GBM
 Different methods generate different results, reflecting the needs to re-fining the modeling options to ensure scaled,
un-biased, and conditional variable importance
*Learnings from POC
Binary Indicator Variables
(a total of 17 Variables)
Category variables
Continuous variables
(7 Variables are exactly
the same, 22 similar type,
but not exactly the same)
Continuous Balance
Variables(5 Variables)
 15 Closed Account Indicators
 6 Other Indicators
 16 are count related variables
 3 Balance related fields
• Only 60% Overlap for similar type of
variables
• Only 14% variable are exactly the same
• Binary variables are skewed to the left and
continuous variables are skewed to the
right
 10 Balance variables
 13 count related fields
 4 Indicators
 3 Categories
Logistic
Random
Forest
- 4 -
Our Experience Triggered us to Do: Literature Review on
Variable Selection Methods Comparisons
Although there are a lot of different options for algorithms, the good news is that only a few handful
algorithms have been consistently out-performing.
1. Random Forest is found to perform better than logistic regression in literature
Why and how to use random forest variable importance measures, Carolin Strobl (LMU M¨unchen) and Achim Zeileis (WU Wien), 2008, Dortmund
Elements of Optimal Predictive Modeling Success in Data Science: An Analysis of Survey Data for the ‘Give Me Some Credit’ Competition Hosted on Kaggle, Dhruv
Sharma, March 2, 2013
Random forests have been found to outperform standard logistic regression models by 5-7% and well suited for credit scoring out of the box due to their ability to deal with
corrected variables which confound modeling process using regression methods (Sharma, 2009, Sharma, 2011)
2. Kaggle Competition Results also indicate random forest is a favorite approach by data
scientists a few years earlier
3. GBM is a relative new phenomena
4. Deep learning gains popularity for unstructured data handling recently
Key Components of Predictive Analytics
Feature Generations
• Event
• Interactions
• Non-linear Transformation
• Peer Statistics
• ……
Algorithms
• Decision Trees
• Logistic Regression
• Support Vector Machine
• Random Forest
• GBM
• Deep Learning
• AI
• ……
Computing Proficiency
Scalable parallel processing
In-memory pull (Spark)
Iterations for converging
Variable Selection
Adaptive Feedback Loop
Adaptive learning
Real Time Delivery
Real time business benefit
Profile Data
Transaction Data
Event
Contextual
Interaction
Digital
Social
Value of Data Value of Algorithms
Driver Analysis
Value of Time
- 6 -
Algorithm Selection Intertwining with Variable Selection
and Fundamental Algorithm Assumptions
 Algorithm selection impacts final variable selected and assumptions made
Variable
Selection
Model Fitting
Model
Validations
Assumptions
Algorithm
Selection
Key Points
 Distribution
 Categorical
 Non-linear relationship
 Type of relationship
- 7 -
Fundamental Understandings: Random Forest’s
Strengths and Weaknesses versus Logistic Regression
Logistic Regression Random Forest
Strength • Conventional methods and easy to
execute
• Un-biased estimate for binary variable
• The most powerful aspect about random forests is
variable importance ranking which estimates the
predictive value of variables by scrambling the variable
and seeing by how much the model performance drops.
• Variable importance captured non linear as well as
interaction impacts among variables
• A probability is associated with many different trees,
variable selection will be more stable
• Variable importance reflects both the significant level as
well as the weight of the coefficient
Weakness • Too much focus on the significant level,
less focus on the weight of the
coefficients
• It is up to modeler to detect and
incorporate the non-linear transformation
as well as interactions
• Default option is biased towards continuous variables,
less favor of categorical and binary variables
• Unbiased solution is very complex and computationally
intensive (cforest package)
Ensemble Approach is Needed
• Incorporate binary variable impact
• Incorporate non linear relationship
• Incorporate interactions
• Scaling continuous variables from Random Forest approach to reduce biases
 No one method is perfect, ensemble approach tends to perform better
 Un-biased, conditional, and scaling in Random Forest is very computationally intensive and is less practical
 A simplified modified logistic regression is more appropriate by using Random Forest as a variable selection and non-
linear transformation tool
- 8 -
Fundamental Understandings: Random Forest as
Compared with GBM
Key Observations:
 Random sampling with replacement
 Multiple datasets, same algorithms or
parameter controls
 Random variable columns and random row
selections (un-biased)
 Variable importance is derived from dropping
the variable, a backward calculation
……
Random with replacement Independent sampling
Average of Predictions
……
Random sampling from residual
Additive of Predictions
Key Observations:
 Random sampling without replacement
 Sequential residual datasets
 Weak leaners
 Same independent variables, different models
 Additive of predictions
- 9 -
Fundamental Understandings: GBM’s Strengths and
Weaknesses versus Random Forest
Random Forest GBM
Common • Both are ensemble learning methods and predict from individual trees. They differ in the way the trees
are built - order and the way the results are combined.
• No scaling of inputs, so you do not need to do careful features normalization, can learn higher order
interaction quickly, scalable for computing. Random Forest and GBM tree based machine learning
approaches intrinsically enact their feature selections.
• Random Forest trains each tree
independently. This randomness helps
reduce biases and less likely to overfit.
• Fewer parameters to tune
• GBM build trees one at a time, where each new tree
helps to correct errors made by previously trained tree.
With each tree added
• More parameters to tune, such as number of trees, depth
of trees and learning rate, and the each tree built is
generally shallow.
Data Drawn • Random with replacement
• Same algorithm, different samples
• Average
• Random without replacement
• Residual
• Additive prediction
Trade-offs • Low Bias + High Variance • High Bias + Low Variance
Strength • Easier to tune
• Un-biased estimate for binary variable
• A well-tuned model out-perform RF method
• Gradient boosting does a better job of handling
multicollinearity than RF
• GBM have tendency on concentrating on a few variables
Weakness • Do not do well with time
• Some times misinterpreted
• Easy to over fit
• More parameters to tune
• Take longer to train because of sequential nature
- 10 -
Leverage RF/GBM to do Variable Selection and Linear
Transformations of Important Interactions
Reduce Interactions quickly to a selected
set such as top 10 from all possible 400
variables
1. Develop a short list of Candidates for Non-Linear Relationships: Leverage Random Forest and GBM
Approach to narrow down variables quickly and to pick up those variables that are not selected by logistic
regression
Non linear Relationship
Non Linear Detection based
on Multiple Layer or Cutoffs
in the Decision Trees
Interactions
Effective Interaction Items
Detection through fewer
selected Candidates
Scaling
Variable Transformation based
on Ranking rather than Raw
Value (non-parametric Approach)
Categorical transformation will help long
tail distributions
Identify different transformation, different
cutoffs and create more binary variables
2. Exploring All of the Following Transformations to Linearize the non-linear relationship
3. Test the above transformations, then feed into logistic regression with other significant variables together
*Learnings from POC
- 11 -
Our Recommended Solution: Improve Logistic Regression by Leveraging Random Forest and
GBM Variable Selection Approach to Incorporate Non-linear, Interactions, and Non Parametric
Ranking, at the Same Time, Keep the Binary Variables Impacts
Recommended Approach
Logistic
Approach
Variable
Selection
Random Forest
Approach
Logistic
Regression
Modified
Logistic
Regression
Current Approach
Select top Variables
based on variable
importance ranking
Interaction items are
tested stepwise
A limited sets of
interactions are
incorporated in next
steps
Non-linear
transformation and
scaling are done as well
Keep Binary
Variables from
logistic regression
Improved
Performance
P-value
Information Value
Current
Performance
Recommended Steps
1. Still use logistic regression to capture binary variable impacts
2. Using the random forest and GBM ranking results generate a manageable sets of candidate variables to identify
interaction effects within a small search space
3. Use the variable importance in an ordered search for interaction variables and build a stepwise regression based on
validation performance
4. Use normalized or scaled, categorical variables for Random Forest selected continuous variables to reduce biases
5. Use the above enhanced variable list to run logistic regression again
- 12 -
 Benefit from Recommended Approach: An 0-10% Lift is Observed if we leverage
Random Forest for Non-linear relationship and Variables Interaction Detection
Benefit from Recommended Approach: An 0-10% Lift is
Observed
- 13 -
Conclusions
 Although there are a lot of different options for algorithms, the good news is that only a
few handful algorithms have been consistently out-performing. We did not need to start
from scratch. Although there are many different ways of coding, R, Java, Python, the
underlying algorithm are mostly similar. A lot of differences happened on the iteration
process, data re-engineering, and feature creations.
 Model variable selection process is a key component of predictive analytics. Whereas
logistic regression depends on feature selection/discovery being done beforehand,
Random Forest and GBM tree based machine learning approaches intrinsically enact
their feature selections.
 However, default options for Random Forest and GBM are biased towards continuous
variables, less favor of categorical and binary variables. Unbiased solution is very
computational intensive.
 We proposed a new approach that combines the strengths of intrinsic feature selection
from Random Forest and GBM to detect non-linear relationship, scaling, and interaction
items and apply these non-linear transformations into traditional approach. A 0-10% lift
is observed.

More Related Content

What's hot

Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarInstitute of Contemporary Sciences
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationScientificRevenue
 
Data scientist Methods | Artificial Intelligence | Rahul Gulab Singh
Data scientist Methods  | Artificial Intelligence | Rahul Gulab SinghData scientist Methods  | Artificial Intelligence | Rahul Gulab Singh
Data scientist Methods | Artificial Intelligence | Rahul Gulab SinghRahul Singh
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data ScienceJohn B. Rollins, Ph.D.
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making processPeter R Breach
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning TechniquesTara ram Goyal
 

What's hot (20)

Machine Learning For Stock Broking
Machine Learning For Stock BrokingMachine Learning For Stock Broking
Machine Learning For Stock Broking
 
AlogoAnalytics Company Presentation
AlogoAnalytics Company PresentationAlogoAnalytics Company Presentation
AlogoAnalytics Company Presentation
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Image Analytics for Retail
Image Analytics for RetailImage Analytics for Retail
Image Analytics for Retail
 
Machine Learning in ICU mortality prediction
Machine Learning in ICU mortality predictionMachine Learning in ICU mortality prediction
Machine Learning in ICU mortality prediction
 
Image Analytics In Healthcare
Image Analytics In HealthcareImage Analytics In Healthcare
Image Analytics In Healthcare
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
 
Data Science for Retail Broking
Data Science for Retail BrokingData Science for Retail Broking
Data Science for Retail Broking
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Analytics in Online Retail
 
Machine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case StudyMachine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case Study
 
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case studyMachine Learning and Multi Drug Resistant(MDR) Infections case study
Machine Learning and Multi Drug Resistant(MDR) Infections case study
 
Causal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous OptimizationCausal Inference, Reinforcement Learning, and Continuous Optimization
Causal Inference, Reinforcement Learning, and Continuous Optimization
 
Data scientist Methods | Artificial Intelligence | Rahul Gulab Singh
Data scientist Methods  | Artificial Intelligence | Rahul Gulab SinghData scientist Methods  | Artificial Intelligence | Rahul Gulab Singh
Data scientist Methods | Artificial Intelligence | Rahul Gulab Singh
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Image Analytics: Caption Generation/Image Descriptions
Image Analytics: Caption Generation/Image DescriptionsImage Analytics: Caption Generation/Image Descriptions
Image Analytics: Caption Generation/Image Descriptions
 
Business idea
Business ideaBusiness idea
Business idea
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 

Similar to 1555 track 2 ning_using our laptop

DSO530 Group project
DSO530 Group projectDSO530 Group project
DSO530 Group projectlibbx1008
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...DrPArivalaganASSTPRO
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classificationrahulmonikasharma
 
RM MLM PPT March_22nd 2023.pptx
RM MLM PPT March_22nd 2023.pptxRM MLM PPT March_22nd 2023.pptx
RM MLM PPT March_22nd 2023.pptxAliMusa44
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
Generalized Linear Model and it Challenges
Generalized Linear Model and it ChallengesGeneralized Linear Model and it Challenges
Generalized Linear Model and it ChallengesElBak1
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - BengaluruKunal Jain
 
NPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfNPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfMr. Moms
 
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender SystemsExperiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender SystemsHossein A. (Saeed) Rahmani
 

Similar to 1555 track 2 ning_using our laptop (20)

DSO530 Group project
DSO530 Group projectDSO530 Group project
DSO530 Group project
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Feature selection
Feature selectionFeature selection
Feature selection
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
RM MLM PPT March_22nd 2023.pptx
RM MLM PPT March_22nd 2023.pptxRM MLM PPT March_22nd 2023.pptx
RM MLM PPT March_22nd 2023.pptx
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Generalized Linear Model and it Challenges
Generalized Linear Model and it ChallengesGeneralized Linear Model and it Challenges
Generalized Linear Model and it Challenges
 
Data reduction
Data reductionData reduction
Data reduction
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
NPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfNPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdf
 
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender SystemsExperiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
 

More from Rising Media, Inc.

1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptop1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptopRising Media, Inc.
 
1620 keynote olson_using our laptop
1620 keynote olson_using our laptop1620 keynote olson_using our laptop
1620 keynote olson_using our laptopRising Media, Inc.
 
1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptopRising Media, Inc.
 
1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptopRising Media, Inc.
 
1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptopRising Media, Inc.
 
1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptopRising Media, Inc.
 
855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptopRising Media, Inc.
 
1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareableRising Media, Inc.
 
905 keynote peele_using our laptop
905 keynote peele_using our laptop905 keynote peele_using our laptop
905 keynote peele_using our laptopRising Media, Inc.
 

More from Rising Media, Inc. (20)

1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptop1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptop
 
Matt gershoff
Matt gershoffMatt gershoff
Matt gershoff
 
Keynote adam greco
Keynote adam grecoKeynote adam greco
Keynote adam greco
 
1620 keynote olson_using our laptop
1620 keynote olson_using our laptop1620 keynote olson_using our laptop
1620 keynote olson_using our laptop
 
1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop
 
1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop
 
1415 track 2 richardson
1415 track 2 richardson1415 track 2 richardson
1415 track 2 richardson
 
1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop
 
1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop
 
915 e metrics_claudia perlich
915 e metrics_claudia perlich915 e metrics_claudia perlich
915 e metrics_claudia perlich
 
855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop
 
1615 plack using our laptop
1615 plack using our laptop1615 plack using our laptop
1615 plack using our laptop
 
1530 rimmele do not share
1530 rimmele do not share1530 rimmele do not share
1530 rimmele do not share
 
1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable
 
1115 fiztgerald schuchardt
1115 fiztgerald schuchardt1115 fiztgerald schuchardt
1115 fiztgerald schuchardt
 
1000 kondic do not share
1000 kondic do not share1000 kondic do not share
1000 kondic do not share
 
905 keynote peele_using our laptop
905 keynote peele_using our laptop905 keynote peele_using our laptop
905 keynote peele_using our laptop
 
Stephen morse sharable
Stephen morse sharableStephen morse sharable
Stephen morse sharable
 
Elder shareable
Elder shareableElder shareable
Elder shareable
 
1115 ramirez using our laptop
1115 ramirez using our laptop1115 ramirez using our laptop
1115 ramirez using our laptop
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

1555 track 2 ning_using our laptop

  • 1. A Modified Logistic Regression Approach Enhanced by New Interactions and Scaling Detections through Random Forest and GBM Yulin Ning, Senior Director Next Gen Analytics Global Analytics Platform and Capabilities Citi Global Consumer Bank Predictive Analytics World, New York Oct 30th, 2017 Presenter’s View Only – does not reflect Citigroup’s View
  • 2. - 1 - Who Are We and What Our Strategy Is Infrastructure / Data Processes Platforms and CapabilitiesCiti Global Consumer Priorities POC Management Framework Ongoing Industry Research Talent / Data Scientist Disciplines Capabilities Development / POCs Innovative Culture Next Generation Analytics Strategy Customer Centricity Global Digital Bank
  • 3. - 2 - Today’s Topic A Modified Logistic Regression Approach Enhanced by New Interactions and Scaling Detections through Random Forest and GBM - A practical adoption of machine learning into predictive analytic discipline Problem statement: algorithm selection intertwines with variable selection methods Literature review Key components of predictive analytics Algorithm selection Intertwining with variable selection Use Random Forest (RF) and GBM to enhance variable selection A modified logistic regression approach to incorporate benefit from Random Forest and GBM
  • 4. - 3 - Problem Statement: Different Variable Selection Methods Generated Different Drivers During our POC Work  A variety of methods were explored by our 3 MAs (Liana,Jonathan, and Jason) to do the initial variables/drivers selections for one of our POC  Stepwise variable selection and information value method based on logistic regression  Variable importance based on Random Forest and GBM  Different methods generate different results, reflecting the needs to re-fining the modeling options to ensure scaled, un-biased, and conditional variable importance *Learnings from POC Binary Indicator Variables (a total of 17 Variables) Category variables Continuous variables (7 Variables are exactly the same, 22 similar type, but not exactly the same) Continuous Balance Variables(5 Variables)  15 Closed Account Indicators  6 Other Indicators  16 are count related variables  3 Balance related fields • Only 60% Overlap for similar type of variables • Only 14% variable are exactly the same • Binary variables are skewed to the left and continuous variables are skewed to the right  10 Balance variables  13 count related fields  4 Indicators  3 Categories Logistic Random Forest
  • 5. - 4 - Our Experience Triggered us to Do: Literature Review on Variable Selection Methods Comparisons Although there are a lot of different options for algorithms, the good news is that only a few handful algorithms have been consistently out-performing. 1. Random Forest is found to perform better than logistic regression in literature Why and how to use random forest variable importance measures, Carolin Strobl (LMU M¨unchen) and Achim Zeileis (WU Wien), 2008, Dortmund Elements of Optimal Predictive Modeling Success in Data Science: An Analysis of Survey Data for the ‘Give Me Some Credit’ Competition Hosted on Kaggle, Dhruv Sharma, March 2, 2013 Random forests have been found to outperform standard logistic regression models by 5-7% and well suited for credit scoring out of the box due to their ability to deal with corrected variables which confound modeling process using regression methods (Sharma, 2009, Sharma, 2011) 2. Kaggle Competition Results also indicate random forest is a favorite approach by data scientists a few years earlier 3. GBM is a relative new phenomena 4. Deep learning gains popularity for unstructured data handling recently
  • 6. Key Components of Predictive Analytics Feature Generations • Event • Interactions • Non-linear Transformation • Peer Statistics • …… Algorithms • Decision Trees • Logistic Regression • Support Vector Machine • Random Forest • GBM • Deep Learning • AI • …… Computing Proficiency Scalable parallel processing In-memory pull (Spark) Iterations for converging Variable Selection Adaptive Feedback Loop Adaptive learning Real Time Delivery Real time business benefit Profile Data Transaction Data Event Contextual Interaction Digital Social Value of Data Value of Algorithms Driver Analysis Value of Time
  • 7. - 6 - Algorithm Selection Intertwining with Variable Selection and Fundamental Algorithm Assumptions  Algorithm selection impacts final variable selected and assumptions made Variable Selection Model Fitting Model Validations Assumptions Algorithm Selection Key Points  Distribution  Categorical  Non-linear relationship  Type of relationship
  • 8. - 7 - Fundamental Understandings: Random Forest’s Strengths and Weaknesses versus Logistic Regression Logistic Regression Random Forest Strength • Conventional methods and easy to execute • Un-biased estimate for binary variable • The most powerful aspect about random forests is variable importance ranking which estimates the predictive value of variables by scrambling the variable and seeing by how much the model performance drops. • Variable importance captured non linear as well as interaction impacts among variables • A probability is associated with many different trees, variable selection will be more stable • Variable importance reflects both the significant level as well as the weight of the coefficient Weakness • Too much focus on the significant level, less focus on the weight of the coefficients • It is up to modeler to detect and incorporate the non-linear transformation as well as interactions • Default option is biased towards continuous variables, less favor of categorical and binary variables • Unbiased solution is very complex and computationally intensive (cforest package) Ensemble Approach is Needed • Incorporate binary variable impact • Incorporate non linear relationship • Incorporate interactions • Scaling continuous variables from Random Forest approach to reduce biases  No one method is perfect, ensemble approach tends to perform better  Un-biased, conditional, and scaling in Random Forest is very computationally intensive and is less practical  A simplified modified logistic regression is more appropriate by using Random Forest as a variable selection and non- linear transformation tool
  • 9. - 8 - Fundamental Understandings: Random Forest as Compared with GBM Key Observations:  Random sampling with replacement  Multiple datasets, same algorithms or parameter controls  Random variable columns and random row selections (un-biased)  Variable importance is derived from dropping the variable, a backward calculation …… Random with replacement Independent sampling Average of Predictions …… Random sampling from residual Additive of Predictions Key Observations:  Random sampling without replacement  Sequential residual datasets  Weak leaners  Same independent variables, different models  Additive of predictions
  • 10. - 9 - Fundamental Understandings: GBM’s Strengths and Weaknesses versus Random Forest Random Forest GBM Common • Both are ensemble learning methods and predict from individual trees. They differ in the way the trees are built - order and the way the results are combined. • No scaling of inputs, so you do not need to do careful features normalization, can learn higher order interaction quickly, scalable for computing. Random Forest and GBM tree based machine learning approaches intrinsically enact their feature selections. • Random Forest trains each tree independently. This randomness helps reduce biases and less likely to overfit. • Fewer parameters to tune • GBM build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added • More parameters to tune, such as number of trees, depth of trees and learning rate, and the each tree built is generally shallow. Data Drawn • Random with replacement • Same algorithm, different samples • Average • Random without replacement • Residual • Additive prediction Trade-offs • Low Bias + High Variance • High Bias + Low Variance Strength • Easier to tune • Un-biased estimate for binary variable • A well-tuned model out-perform RF method • Gradient boosting does a better job of handling multicollinearity than RF • GBM have tendency on concentrating on a few variables Weakness • Do not do well with time • Some times misinterpreted • Easy to over fit • More parameters to tune • Take longer to train because of sequential nature
  • 11. - 10 - Leverage RF/GBM to do Variable Selection and Linear Transformations of Important Interactions Reduce Interactions quickly to a selected set such as top 10 from all possible 400 variables 1. Develop a short list of Candidates for Non-Linear Relationships: Leverage Random Forest and GBM Approach to narrow down variables quickly and to pick up those variables that are not selected by logistic regression Non linear Relationship Non Linear Detection based on Multiple Layer or Cutoffs in the Decision Trees Interactions Effective Interaction Items Detection through fewer selected Candidates Scaling Variable Transformation based on Ranking rather than Raw Value (non-parametric Approach) Categorical transformation will help long tail distributions Identify different transformation, different cutoffs and create more binary variables 2. Exploring All of the Following Transformations to Linearize the non-linear relationship 3. Test the above transformations, then feed into logistic regression with other significant variables together *Learnings from POC
  • 12. - 11 - Our Recommended Solution: Improve Logistic Regression by Leveraging Random Forest and GBM Variable Selection Approach to Incorporate Non-linear, Interactions, and Non Parametric Ranking, at the Same Time, Keep the Binary Variables Impacts Recommended Approach Logistic Approach Variable Selection Random Forest Approach Logistic Regression Modified Logistic Regression Current Approach Select top Variables based on variable importance ranking Interaction items are tested stepwise A limited sets of interactions are incorporated in next steps Non-linear transformation and scaling are done as well Keep Binary Variables from logistic regression Improved Performance P-value Information Value Current Performance Recommended Steps 1. Still use logistic regression to capture binary variable impacts 2. Using the random forest and GBM ranking results generate a manageable sets of candidate variables to identify interaction effects within a small search space 3. Use the variable importance in an ordered search for interaction variables and build a stepwise regression based on validation performance 4. Use normalized or scaled, categorical variables for Random Forest selected continuous variables to reduce biases 5. Use the above enhanced variable list to run logistic regression again
  • 13. - 12 -  Benefit from Recommended Approach: An 0-10% Lift is Observed if we leverage Random Forest for Non-linear relationship and Variables Interaction Detection Benefit from Recommended Approach: An 0-10% Lift is Observed
  • 14. - 13 - Conclusions  Although there are a lot of different options for algorithms, the good news is that only a few handful algorithms have been consistently out-performing. We did not need to start from scratch. Although there are many different ways of coding, R, Java, Python, the underlying algorithm are mostly similar. A lot of differences happened on the iteration process, data re-engineering, and feature creations.  Model variable selection process is a key component of predictive analytics. Whereas logistic regression depends on feature selection/discovery being done beforehand, Random Forest and GBM tree based machine learning approaches intrinsically enact their feature selections.  However, default options for Random Forest and GBM are biased towards continuous variables, less favor of categorical and binary variables. Unbiased solution is very computational intensive.  We proposed a new approach that combines the strengths of intrinsic feature selection from Random Forest and GBM to detect non-linear relationship, scaling, and interaction items and apply these non-linear transformations into traditional approach. A 0-10% lift is observed.