Détection de profils, application en santé et en économétrie geissler

7 Juillet 2014
Christophe Geissler,
Quinten,
IAF.
DÉTECTION DE PROFILS:
A¨PPLICATIONS EN SANTE ET EN
ECONOMETRIE

1) QUINTEN EN BREF
2) SCIENCES DE LA VIE ET PREDICTION
3) ETUDES DE CAS
4) COMPARAISON DE METHODES
5) SUJETS DE RECHERCHE
PLAN
Hommage à AK, ~780-850

QUINTEN IN SHORT
A company providing data-oriented strategic advisory.
Since 2008, over 100 missions for more than 25 clients
Historical focus on Life Sciences and Healthcare
Now extending to CRM, Insurance and Investment
18 employees, self-financed, annual average growth of 40%
80% of the revenue reinvested in R&D each year, including a proprietary learning
technology
Active member of several technology clusters: Medicen
3

THE HEALTHCARE SECTOR AS ADVANCED ALGORITHMIC
PRESCRIPTOR ?
The prediction/classification needs in life sciences have evolved.
Huge increase of available variables
Limited size of samples (often < 1000) for economic reasons
These needs are not fully met by predictive approaches.
Need for evidence-based methods
Trade-off between predictive power and auditability of recommendations
Exponential increase in computation capacity open the way for exploration-
based methods
With an increasing risk of overfitting the data
Correlation with similar trends in CRM.
Customer profiling: data gathering is key.
5

ALGORITHMIC NEEDS IN EPIDEMIOLOGICAL STUDIES
Databases have large sets of variables (#V >> #Obs)
Practitioners often wish to get rid of a priori selection (or hierarchization) of
variables
Poor tractability by most kinds of regression models
Using ‘sparsity’, ie penalizing complexity in order to simplify the model, does not fully solve the
problem
Leaving the cartesian paradigm: a single ((very)complex) function driving globally the
entirety of the visible phenomena
For a heuristic approach: accepting the possibility of multiple, local, partially
correlated causes to be discovered: the ‘profiles’.
Interpretability of the profiles and descriptive parsimony are mandatory: no black-
box or randomized results.
6

PREDICTION VS DESCRIPTION IN SUPERVISED
METHODS
Supervised problems, ie where there training data are ‘labeled’ by a variable Y to be
explained. Y is the ‘interest phenomenon’.
Y can be a boolean (treatment outcome) or a continuous variable (loss amount, etc).
Explanatory variables X = (Xi)i=1..V in RV, continuous or discrete with possibly missing values.
Predictor: a function Ŷ = F (X) : RV  Dom(Y) verifying: Var(Ŷ – Y | X) < Var (Y)
Explanatory power: capacity to ‘simply’ describe the sets F-1 ([s, 1]), i.e answering the
question ‘Who are the strong responders ?’
Simplicity can be formalized, always imply the number of variables involved in the
predictors.
Simplicity is key when targeting large sets of ‘new’ individuals (not in the training sample).
7

THE PREDICTIVE VS EXPLANATORY TRADE-OFF
8
Problem: separating
‘nicely’ red from blue
points in R2.
Dark colors in the
training sample, light
colors in the test
sample.

THE PREDICTIVE VS EXPLANATORY TRADE-OFF
9
Running four prediction techniques on the
previous set.
Colored areas depending on the predicted
value.
How many words are needed to describe
the dark shaded areas ?
Poor response of linear separators (SVM)
indicate that more dimensions could be
needed in order to improve the description.

PROFILE SEARCH VS DECISION TREES
10
Decisions trees look for optimal cut-offs on explanatory variables: partition of space in non-overlapping regions.
Profile search allows for some controlled degree of intersection.
Toy data-base with a phenomenon taking place on two
overlapping rectangles on variables a and b, hidden
among 250 random variables. CART response: up to 14 levels to partition space

507 patients
Typology 1
6,4 % AEX
507 patients
Typology 2
10% AEX
808 patients
Typology 3
13% AEX
USE CASE IN HEALTHCARE
CLUSTERING : A NON SUPERVISED APPROACH
Database : 2000 patients / 1000 variables
Patient without
Adverse Event X
Patient with
Adverse Event X
10% got the Adverse Event X (200 patients)
Singular value
Decomposition
(SVD) : Clustering
(PCA, K-Means ...)
11
Are there various typologies of patients in this database ?
Do these typologies show any deviations with regard to Adverse Event X ?
Are these difference important enough to avoid treating some typologies ?

ASSOCIATIVE RULES DISCOVERY: QFINDER ALGORITHM
Identification and characterization of singular profiles
Database : 2000 patients / 1000 variables
Patient without
Adverse Event X
Patient with
Adverse Event X
10% got the Adverse Event X (200 patients)
Data processing
(QFinder)
12
What are the various profiles of patients with the highest risk of Adverse Event X ?
What are the key characteristics of each of these profiles ?
How to prevent Adverse Event X ?
Age > 56
Average Daily Dose = High
Treatment duration > 50 days
126 patients
47% Adverse Event X
108 patients
60% Adverse Event X
Gender : female
Diabetes =Yes
Menopause = Yes
59 patients
75% Adverse Event X
Blood Pressure = High
Dyslipidemia = Yes
Interpretable and actionable results
Optimality of recommendations

MANY CRITERIA HAVE LITTLE OR NO INFLUENCE
EXAMPLE OF PROFILE
Detection of mutually influent factors not seen by regressions
ACTION : AVOID THE HIGH DOSE ON PATIENTS > 56 TREATED > 50 DAYS
AVOID TREATING MORE THAN 50 DAYS PATIENTS > 56 WITH THE HIGH DOSE
10%
Database size :
2000 patients
(100%)
Average rate of adverse events : 10%
13
90%
65%
Size :
739
patients(37%)
AGE > 56
11% 89%
69%
Size :
936
patients(47%)
TREATMENT DURATION > 50 days
8% 92%
Size :
647
patients(32%)
AVERAGE DAILY DOSE : HIGH
13% 87%
HOWEVER Q-FINDER WAS ABLE TO DETECT THEIR COMBINED
INFLUENCE WHEN RELEVANT
Profile size :
126
patient(6,3%)
Patients matching the following characteristics :
Are 4,7 more likely to trigger adverse events
AGE > 56
TREATMENT DURATION > 50 days
AVERAGE DAILY DOSE : HIGH
84%47% 53%

USING PROFILE DETECTION IN INVESTMENT
14
Using machine learning for the detection of recurrent biases on the returns of main assets
classes (interest rates, equity indices, currencies).
Empirical facts:
Financial markets are interaction hubs for investors having a huge diversity in horizon and
risk aversion.
Fluctuations can therefore be caused by a large number of potential factors.
The influence of these factors is not uniform through time.
GLM-type approaches are too difficult to calibrate and yield unstable results.
Retained approach:
Search for signifiant profiles, characterized by conditions on a limited number of variables.
Profiles can be partially intersected. No predefined hierarchy on the variables.
Creating derived variables from primary variables: stationarity and variety.

Présentation commerciale 2014 15
Exemple:
• Y(t) = D Bund (1month) / stdev (D Bund (1 month))
• 250 explanatory variables:
• Eurozone, US economic indicators
• Interest rates levels and dynamics
• Central money data
• Inflationary anticipations (inflation swaps)
• Risk premia on equity markets
• Energy prices
• Volatilities, correlations
• Training period: 1999-2013.
Average (Y(t), <Training period>) = +0.15 s
-15
-10
-5
0
5
10
19991101
20000428
20001026
20010426
20011024
20020423
20021021
20030421
20031017
20040415
20041014
20050412
20051011
20060411
20061009
20070409
20071008
20080404
20081001
20090401
20090929
20100326
20100924
20110323
20110921
20120321
20120919
Dbund = f(t)

16
Stylized fact 1: Sharp drop in German equities  increase in risk aversion  rise in
German Govt Bonds.
Validating hypothesis:
X1 = Decile (D (E/P_ratio (Dax) – Bobl yield)).
Interpretation: 3 month variation in German equity risk
premium
r = Correlation (X1, Y) = 9%, R2 = 0.8% :
Decile analysis: E(Y | X1)
Non linearity
General trend conform with intuition -0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 - 2 1 - 3 2 - 4 3 - 5 4 - 6 5 - 7 6 - 8 7 - 9 8 - 10
E(Dbund) = f(Dprime Dax)

17
Stylized fact 2: Growth acceleration in monetary aggregates  future rise in inflation
loss in Govt Bonds.
Hypothesis validation:
X2 : Decile (D M3 (3 month)) .
r = Correlation (X2, Y) = -1.5%, R2 = 0.4%.
Decile analysis:
Non linearity
General trend conform with intuition
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 - 3 1 - 4 2 - 5 3 - 6 4 - 7 5 - 8 6 - 9 7 - 10 8 - 11
E(DBund) = f(DM3)

18
When: X1 >= 5 D (DAX Risk Premium) > 5th decile
AND
X2 in [2, 6] D (M3) between 2nd and 6th decile
Then:
E(Y | X1,X2) = +0.83s, True on 21.5% of observations between 1999 and 2012.
These conditions form a market profile. Information ratio: 0.83 x (21.5% x 260/20)0.5 = 1.05
Strong synergy between variables: +90% increase in conditional expectation on Bund performance .
0
1
2
3
4
5
6
7
8
9
-1.5 s
-1.0 s
-0.5 s
0.0 s
0.5 s
1.0 s
1.5 s
0
1
2
3
4
5
6
7
8
9
Espéranceconditionnelle
Influence combinée des deux variables
1.0 s-1.5 s
0.5 s-1.0 s
0.0 s-0.5 s
-0.5 s-0.0 s
-1.0 s--0.5 s
-1.5 s--1.0 s
Combined influence
0
1
19991101
20000107
20000316
20000524
20000802
20001010
20001219
20010226
20010507
20010713
20010921
20011129
20020206
20020417
20020625
20020902
20021108
20030117
20030328
20030605
20030813
20031021
20031229
20040308
20040514
20040722
20040930
20041208
20050216
20050425
20050704
20050909
20051118
20060126
20060406
20060614
20060822
20061031
20070109
20070319
20070525
20070803
20071012
20071220
20080227
20080506
20080715
20080920
20081128
20090205
20090416
20090624
20090902
20091110
20100118
20100325
20100603
20100811
20101020
20101228
20110307
20110513
20110722
20110929
20111207
20120215
20120425
20120703
20120911
20121119
20130125
Occurrences historiques du profil
113 independent occurrences in 14 years

MANAGING THE RISK OF OVERFITTING
19
Parameter Role Influence on
overfitting risk
P Size of training sample P↑: risk↓
ρ Average (coding compression rate
of variables = #modalities / P)
ρ↓: risk↓
y Proportion of 1’s in dependent
variable
y↑: risk↓
k Maximum profile complexity k↓: risk↓
V Total number of variables V↓: risk↓
ε Maximum admissible probability of
finding any configuration by
random search
ε↓: risk↓
0
10
20
30
40
50
60
Nb max
Coding compression of variables
Maximum number of profiles
#V=1
#V=2
#V=3
#V=4

RISK AND REWARDS OF
COMBINATORIAL EXPLORATION
No preselection of variables, no hierarchy, localized search: more freedom is
granted
No free lunch: computation time increases (linear in #Obs, polynomial in #V)
But parallel computation and cloud-computing are perfectly adapted
Risk of overfitting must be carefully controlled
The richness of the descriptive language must be kept at a parsimonious level
in order to prevent ‘nugget-fishing’: interesting maths behind the scene.

CURRENT RESEARCH AREAS
Improving the dynamic aggregation of predictors:
Using prediction as a topology on data: COBRA algorithm (G. Biau, B. Guedj).
Weighting schemes based on regret (Lugosi, Stoltz) or regularity (Wintenberger).
Embedding time stationarity requirements in profile search.
Incremental production of backtests.
Visualization of an audit trail between variables and final recommendations.
GPU calculations
…

CONTACT
22
11, rue Galvani 75017 Paris, France
+33 (0)1 45 74 33 05
http://www.quinten-france.com
@QuintenFrance
Christophe GEISSLER
33 (0)6 08 60 46 14
c.geissler@quinten-france.com

Détection de profils, application en santé et en économétrie geissler

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Détection de profils, application en santé et en économétrie geissler

Similar to Détection de profils, application en santé et en économétrie geissler (20)

More from Kezhan SHI

More from Kezhan SHI (16)

Recently uploaded

Recently uploaded (20)

Détection de profils, application en santé et en économétrie geissler