EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Population Stability Index(PSI) for Big Data World
1. Population Stability Index(PSI)
How to apply PSI, a statistics that’s widely used for scorecard
validations, to a Big Data problem.
A presentation for American Statistical Association, Orange County, CA Chapter.
By
JEOMOAN KURIAN
Director-Risk Management Analytics, Mitsubishi UFJ Union Bank
Jeo.kurian@gmail.com
1
2. Let’s start with a super model
• A logistics regression model that predicts the probability of defaults.
• At least 3 versions are now active:Ver2 &Ver8 are most common. Ver9 is the most recent.
What tells us the model needs a revision?
800
2
3. Front End Validation-Oversimplified
(1) (2) (3) (4) (5) (6) (7) (8) (9)
FICO
Range
# Dev.
Sample
# Recent
Sample
% Dev.
Sample
% Recent
Sample
Change
(5) - (4)
Ratio
(5) / (4)
WoE
Ln (7)
PSI
Portion
(6)*(7)
<500 4000 2000 17.2% 11.8% -5.4% 0.687 -0.376 0.020
500-620 2331 1200 10.0% 7.1% -2.9% 0.707 -0.347 0.010
621-660 2448 500 10.5% 2.9% -7.6% 0.280 -1.271 0.096
661-700 2614 3000 11.2% 17.7% 6.5% 1.576 0.455 0.029
701-740 2916 2700 12.5% 15.9% 3.4% 1.271 0.240 0.008
740-780 2241 1900 9.6% 11.2% 1.6% 1.164 0.152 0.002
781-820 2664 2400 11.4% 14.1% 2.7% 1.237 0.213 0.006
820+ 4086 3269 17.5% 19.3% 1.7% 1.098 0.094 0.002
TOTAL 23,300 16,969 100% 100% 0.154PSI (Sum of Column 9) =
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
<500 500-620 621-660 661-700 701-740 740-780 781-820 820+
%Dev Sample(Expected) % Recent Sample
Population Stability Index
0%
8%
16%
24%
32%
40%
0-10% 11-20% 21-30% 31-40% 41-50% 51+%
Debt to Income Ratio Distribution
% Dev. Sample % Recent Sample
Characteristics Analysis: Let’s look
at one of the explanatory
variables that caused the change.
Development sample had more people with high
debt levels. May be a recession effect.
How PSI helps?
• Early indicator that something changed
compared to a baseline.
• A statistic that represent a set of data.
• 0.1 or Less: Little or no change
• 0.1 to 0.25: Some changes that
require close monitoring.
• 0.25 or higher: A major shift that
require review.
3
4. Marketing Analytics Big Data Question
How to automatically validate a file’s content and detect a bad file?
Clickstreams
Weblogs
Social Media
With multiple marketing channels
and disparate data sources, the data
scene is messy. Text/XML files are
often delivered bad.
Direct
Mail/email
ERP/In-store
CRM/Online
Vendor Files
FILE SOURCE ETL/STAGING STORAGE/HADOOP
Big Challenge: How to validate
each input file for completeness ?
File structure is intact but content
is not what’s expected.
• Bad data impact model
outcomes and results in
inefficient processes(Channel
attribution and subsequent
spending).
• Its expensive to clean up the
data at a later point.
PSI ?
• Display ad channel was 20% last lime
but dropped to 2% this time. ETL does
not detect this as a technical problem.
• In-store sale dropped to 20% as
compared to 70% last month.
• Reversal of trend is difficult to detect
while loading data but its important to
review such instances before it’s
loaded.
4
5. 0%
10%
20%
30%
40%
50%
Adwords BingAds Display Flash Ads Retarget Video
Web Channel Categories
% Lat 3 Months % Recent Month
File validation using PSI: Advertisement channels
Population Stability Index
No records received from Adwords sub-channel.
This need a review before we proceed to data
loading step.
So how PSI helped?
• Set a threshold, say 0.25, to trigger a possible
data issue review.
• Provides a statistic to evaluate the content
quality and compare with previous months.
• Every significant variance from the expectation
will lead to a higher PSI number.
• A moving average benchmark will self adjust
gradual migration from one channel to
another.
• A configurable benchmark will help to handle
the expected scenarios . Say no email channel
expected this month.
(1) (2) (3) (4) (5) (6) (7) (8) (9)
Channel
# records
Previous
Three
Months
# records
Recent
Month
%Prev.
Three
Months
% Recent
Month
Change
(5) - (4)
Ratio
(5) / (4)
WoE
Ln (7)
PSI
Portion
(6)*(7)
Social 8,000 2,000 10.1% 12.8% 2.7% 1.266 0.236 0.006
Web 24,000 1,600 30.4% 10.3% -20.1% 0.338 -1.086 0.219
Email 4,000 1,000 5.1% 6.4% 1.3% 1.266 0.236 0.003
Print 3,000 1,000 3.8% 6.4% 2.6% 1.688 0.524 0.014
Instore 40,000 10,000 50.6% 64.1% 13.5% 1.266 0.236 0.032
TOTAL 79,000 15,600 100% 100% 0.267PSI (Sum of Column 9) =
0.0%
14.0%
28.0%
42.0%
56.0%
70.0%
Social Web Email Print Instore
%Prev 3 Months(Expected) % Recent Month
Characteristics Analysis: Let’s look
at what in web channel caused
the change.
0
5