SlideShare a Scribd company logo
1 of 49
Download to read offline
Introduction Multilevel Distribution
Distributed multilevel matrix completion for
medical databases
Julie Josse
Ecole Polytechnique, INRIA
Women in Machine Learning & Data Science Meetup
January 25, 2018
1 / 33
Introduction Multilevel Distribution
Overview
1 Introduction
2 Single imputation for mixed multilevel data
3 Distribution
2 / 33
Introduction Multilevel Distribution
Outline
1 Introduction
2 Single imputation for mixed multilevel data
3 Distribution
3 / 33
Introduction Multilevel Distribution
Research activities
• Dimensionality reduction methods to visualize complex data
(PCA based): multi-sources, textual, arrays, questionnaire
• Missing values - matrix completion
• Low rank estimation, selection of regularization parameters
• Fields of application: bio-sciences (agronomy, sensory
analysis), health data (hospital data)
• R community: book R for Statistics, R foundation, R
taskforce Forwards, JSS papers and R packages:
FactoMineR explore continuous, categorical, multiple contingency
tables (correspondence analysis), combine clustering and PC, ..
MissMDA for single and multiple imputation, PCA with missing
denoiseR to denoise data
4 / 33
Introduction Multilevel Distribution
Collaborators
Imputation of mixed data with multilevel SVD
Distributed computation
Genevieve Robin, François Husson, Balasubramanian Narasimhan
Polytechnique, Agrocampus, Stanford (stat/biomedical)
5 / 33
Introduction Multilevel Distribution
Public Assistance - Paris Hospitals
Traumabase: 15000 patients/ 250 variables/
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85 NR NR 180 110
2 Lille Other 33 m 80 1.8 24.69 130 62
3 Pitie Salpetriere Gun 26 m NR NR NR 131 62
4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89
6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86
7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66
9 HEGP White weapon 16 m 98 1.92 26.58 118 54
10 Toulon White weapon 20 m NR NR NR 124 73
11 Bicetre Fall 61 m 84 1.7 29.07 144 105
...................
SpO2 Temperature Lactates Hb Glasgow Transfusion ...........
1 97 35.6 <NA> 12.7 12 yes
2 100 36.5 4.8 11.1 15 no
3 100 36 3.9 11.4 3 no
4 100 36.7 1.66 13 15 yes
6 100 36 NM 14.4 15 no
7 100 36.6 NM 14.3 15 yes
9 100 37.5 13 15.9 15 yes
10 100 36.9 NM 13.7 15 no
11 100 36.6 1.2 14.2 14 no
............
⇒ Predict the Glasgow score, whether to start a blood transfusion,
to administer fresh frozen plasma, etc...
⇒ (Logistic) regressions with categorical/continuous values
6 / 33
Introduction Multilevel Distribution
Public Assistance - Paris Hospitals
Traumabase: 15000 patients/ 250 variables/ 6 hospitals
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85 NR NR 180 110
2 Lille Other 33 m 80 1.8 24.69 130 62
3 Pitie Salpetriere Gun 26 m NR NR NR 131 62
4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89
6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86
7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66
9 HEGP White weapon 16 m 98 1.92 26.58 118 54
10 Toulon White weapon 20 m NR NR NR 124 73
11 Bicetre Fall 61 m 84 1.7 29.07 144 105
...................
SpO2 Temperature Lactates Hb Glasgow Transfusion ...........
1 97 35.6 <NA> 12.7 12 yes
2 100 36.5 4.8 11.1 15 no
3 100 36 3.9 11.4 3 no
4 100 36.7 1.66 13 15 yes
6 100 36 NM 14.4 15 no
7 100 36.6 NM 14.3 15 yes
9 100 37.5 13 15.9 15 yes
10 100 36.9 NM 13.7 15 no
11 100 36.6 1.2 14.2 14 no
............
⇒ Predict the Glasgow score, whether to start a blood transfusion,
to administer fresh frozen plasma, etc...
⇒ (Logistic) regressions with missing categorical/continuous values
6 / 33
Introduction Multilevel Distribution
Public Assistance - Paris Hospitals
⇒ 6 hospitals: multilevel structure (patients within hospital)
Hopital effect: lack of standardization
⇒ Missing values
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85 NR NR 180 110
2 Lille Other 33 m 80 1.8 24.69 130 62
3 Pitie Salpetriere Gun 26 m NR NR NR 131 62
4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89
6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86
7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66
9 HEGP White weapon 16 m 98 1.92 26.58 118 54
...................
SpO2 Temperature Lactates Hb Glasgow Transfusion ...........
1 97 35.6 <NA> 12.7 12 yes
2 100 36.5 4.8 11.1 15 no
3 100 36 3.9 11.4 3 no
4 100 36.7 1.66 13 15 yes
6 100 36 NM 14.4 15 no
7 100 36.6 NM 14.3 15 yes
9 100 37.5 13 15.9 15 yes
1) Handling missing values in multilevel data
7 / 33
Introduction Multilevel Distribution
Public Assistance - Paris Hospitals
⇒ 6 hospitals: multilevel structure (patients within hospital)
Hopital effect: lack of standardization
2) Data not aggregated but which stay on each site
⇒ Missing values
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85 NR NR 180 110
2 Lille Other 33 m 80 1.8 24.69 130 62
3 Pitie Salpetriere Gun 26 m NR NR NR 131 62
4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89
6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86
7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66
9 HEGP White weapon 16 m 98 1.92 26.58 118 54
...................
SpO2 Temperature Lactates Hb Glasgow Transfusion ...........
1 97 35.6 <NA> 12.7 12 yes
2 100 36.5 4.8 11.1 15 no
3 100 36 3.9 11.4 3 no
4 100 36.7 1.66 13 15 yes
6 100 36 NM 14.4 15 no
7 100 36.6 NM 14.3 15 yes
9 100 37.5 13 15.9 15 yes
1) Handling missing values in multilevel data
7 / 33
Introduction Multilevel Distribution
Missing values
are everywhere: unanswered questions in a survey, lost
data, damaged plants, machines that fail...
The best thing to do with missing values is not to have
any” Gertrude Mary Cox.
⇒ Still an issue with "big data"
Data integration: data from different sources
i
I
1
1
k1
1
1
kI
ki
1 J
Xi
XI
X1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Multilevel structure: sporadically - systematic missing values (one
variable missing in one hospital) 8 / 33
Introduction Multilevel Distribution
Solutions to handle missing values
Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim &
Shao (2013); Carpenter & Kenward (2013); van Buuren (2015)
⇒ Modify the estimation process to deal with missing values.
Maximum likelihood: EM algorithm to obtain point estimates +
Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability
One specific algorithm for each statistical method...
⇒ Imputation (multiple) to get a completed data set on which you
can perform any statistical method (Rubin, 1976)
Famous imputation based on SVD (PCA) - quantitative
9 / 33
Introduction Multilevel Distribution
Solutions to handle missing values
Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim &
Shao (2013); Carpenter & Kenward (2013); van Buuren (2015)
⇒ Modify the estimation process to deal with missing values.
Maximum likelihood: EM algorithm to obtain point estimates +
Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability
One specific algorithm for each statistical method...
⇒ Imputation (multiple) to get a completed data set on which you
can perform any statistical method (Rubin, 1976)
Famous imputation based on SVD (PCA) - quantitative
Extension to imputation with multilevel SVD + categorical
9 / 33
Introduction Multilevel Distribution
Outline
1 Introduction
2 Single imputation for mixed multilevel data
3 Distribution
10 / 33
Introduction Multilevel Distribution
PCA reconstruction
-2.00 -2.74
-1.56 -0.77
-1.11 -1.59
-0.67 -1.13
-0.22 -1.22
0.22 -0.52
0.67 1.46
1.11 0.63
1.56 1.10
2.00 1.00
-2.16 -2.58
-0.96 -1.35
-1.15 -1.55
-0.70 -1.09
-0.53 -0.92
0.04 -0.34
1.24 0.89
1.05 0.69
1.50 1.15
1.67 1.33
X
-3 -2 -1 0 1 2 3
-3-2-10123
x1
x2
^μ X F ^μ
V'
≈
⇒ Minimizes distance between observations and their projection
⇒ Approx Xn×p with a low rank matrix k < p A 2
2 = tr(AA ):
argminµ X − µ 2
2 : rank (µ) ≤ k
SVD X: ˆµPCA = Un×kDk×kVp×k
= Fn×kVp×k
F = UD PC - scores
V principal axes - loadings
11 / 33
Introduction Multilevel Distribution
PCA reconstruction
-2.00 -2.74
NA -0.77
-1.11 -1.59
-0.67 -1.13
-0.22 NA
0.22 -0.52
0.67 1.46
NA 0.63
1.56 1.10
2.00 1.00
-2.16 -2.58
-0.96 -1.35
-1.15 -1.55
-0.70 -1.09
-0.53 -0.92
0.04 -0.34
1.24 0.89
1.05 0.69
1.50 1.15
1.67 1.33
X
-3 -2 -1 0 1 2 3
-3-2-10123
x1
x2
^μ X F ^μ
V'
≈
⇒ Minimizes distance between observations and their projection
⇒ Approx Xn×p with a low rank matrix k < p A 2
2 = tr(AA ):
argminµ X − µ 2
2 : rank (µ) ≤ k
SVD X: ˆµPCA = Un×kDk×kVp×k
= Fn×kVp×k
F = UD PC - scores
V principal axes - loadings
11 / 33
Introduction Multilevel Distribution
Missing values in PCA
⇒ PCA: least squares
argminµ Xn×p − µn×p
2
2 : rank (µ) ≤ k
⇒ PCA with missing values: weighted least squares
argminµ Wn×p (X − µ) 2
2 : rank (µ) ≤ k
with Wij = 0 if Xij is missing, Wij = 1 otherwise; elementwise
multiplication
Many algorithms:
Gabriel & Zamir, 1979: weighted alternating least squares (without
explicit imputation)
Kiers, 1997: iterative PCA (with imputation)
12 / 33
Introduction Multilevel Distribution
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data → (U , D , V ); k dim kept
(b) (ˆµ )k
= U D V X = W X + (1 − W ) ˆµ
3 steps of estimation and imputation are repeated
13 / 33
Introduction Multilevel Distribution
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data → (U , D , V ); k dim kept
(b) (ˆµ )k
= U D V X = W X + (1 − W ) ˆµ
3 steps of estimation and imputation are repeated
⇒ ˆµ from incomplete data: EM algo: X = FV + ε
X = µ + ε, εij
iid
∼ N 0, σ2 with µ of low rank
⇒ Completed data: good imputation (matrix completion, Netflix)
(Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017)
Selecting k? Generalized cross-validation (Josse & Husson, 2012)
13 / 33
Introduction Multilevel Distribution
Iterative PCA
1 initialization = 0: X0 (mean imputation)
2 step :
(a) PCA on the completed data → (U , D , V ); k dim kept
(b) (ˆµ )k
= U D V X = W X + (1 − W ) ˆµ
3 steps of estimation and imputation are repeated
⇒ ˆµ from incomplete data: EM algo: X = FV + ε
X = µ + ε, εij
iid
∼ N 0, σ2 with µ of low rank
⇒ Completed data: good imputation (matrix completion, Netflix)
(Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017)
Selecting k? Generalized cross-validation (Josse & Husson, 2012)
13 / 33
Introduction Multilevel Distribution
Multilevel component analysis
Ex: inhabitants nested within countries X ∈ RK×J
• similarities between countries? level 1
• similarities between inhabitants within each
country? level 2
• relationship between variables at each level
i
I
1
1
k1
1
1
kI
ki
1 J
Xi
XI
X1
xijki = x.j. + (xij. − x.j.) + (xijki − xij.)
Between + Within
Analysis of variance: split the sum of squares for each variable j
I
i=1
ki
k=1
(xijki )2
=
I
i=1
ki (x.j.)2
+
I
i=1
ki (xij. −x.j.)2
+
I
i=1
ki
k=1
(xijki −xij.)2
14 / 33
Introduction Multilevel Distribution
Multilevel PCA MLPCA
⇒ Model for the between and within part i = 1, ..., I groups, J var
Xi(ki ×J)
= 1ki m + 1ki Fb
i V b + Fw
i V w + Ei
• Fb
i (Qb × 1) between component scores of group i
• V b (J × Qb) between loading matrix
• Fw
i (ki × Qw ) within component scores of group i
• Vw (J × Qw ) within loading matrix. Constant across groups
Fitted by solving the least squares (Timmerman, 2006)
argmin F(m, Fb
i , V b
, Fw
i , V w
) =
I
i=1
Xi − 1ki m − 1ki Fb
i V b
− Fw
i V w
2
I
i=1 ki Fb
i = 0Qb
and 1ki
Fw
i = 0Qw , ∀i for identifiability.
15 / 33
Introduction Multilevel Distribution
MLPCA - quantitative data
i = 1, ..., I groups, J var, ki nb obs in group i
⇒ Estimation: minimize the RSS
argmin F() =
I
i=1 Xi − 1ki m − 1ki Fb
i V b
− Fw
i V w
2
,
I
i=1 ki Fb
i = 0Qb
and 1ki
Fw
i = 0Qw
, ∀i for identifiability.
(ˆFb
, ˆV b
): Weigthed PCA on the between part: SVD on WXm; Xm (I × J)
the means of the variables per group, W (I × I) wii =
√
ki
(ˆFw , ˆV w ) PCA on the within part: SVD on the centered data per
group Xw (K × J), K = i
ki
⇒ With missing values: Weighted Least Squares
⇒ Iterative imputation algorithm (imputation - estimation)
16 / 33
Introduction Multilevel Distribution
Iterative MLPCA
2. iteration : estimation of the between structure
• SVD WXm = PDQ ; Qb eigenvectors are kept:
ˆFb
i = [W −1
PQb ]i , ˆFb
concatenation by row of [1ki
ˆFb
i ]
ˆV b
= QQb DQb , (J × Qb)
• the between hat matrix is computed: (ˆXb
) = ˆFb ˆV b
3. iteration : imputation of the missing values with the fitted values
• ˆX = 1K ˆm( −1)
+ (ˆXb
) + (ˆXw
)( −1)
. The newly imputed dataset is
X = W X + (1K × 1J − W ) ˆX
• ˆm is computed on X
4. iteration : estimation of the within structure
• SVD (Xw
) = PDQ ; Qw eigenvectors are kept:
Fw
= PQw (K × Qw )
V w
= QQw DQw (J × Qw )
• the within hat matrix is computed (ˆXw
) = ˆFw ˆV w
5. iteration : imputation of the missing values with the fitted values
• X +1
= M X + (1K × 1J − M) 1K ˆm( )
+ (ˆXb
) + (ˆXw
)
• ˆm +1
is computed on X +1
17 / 33
Introduction Multilevel Distribution
Multiple Correspondence Analysis (MCA) Greenacre
Xn×m m categorical variables coded with indicator matrix A
y . . . attack
y . . . attack
y . . . attack
n . . . suicide
X =
n . . . accident
n . . . suicide
1 0 . . . 1 0 0
1 0 . . . 1 0 0
1 0 . . . 1 0 0
0 1 . . . 0 1 0
A =
0 1 . . . 0 0 1
0 1 . . . 0 1 0
p1 0
Dp =
.
.
.
0 pJ
For a category c, the frequency of the category: pc = nc /n.
A SVD on weighted matrix: Z = 1√
mn
(A − 1pT
)D
−1/2
p = UDV
The PC (F = UD) satisfies: arg maxFq∈Rn
1
m
m
j=1 η2
(Fq, Xj )
η2
(F, Xj ) =
Cj
c=1 nc (F.c − F..)2
n
i=1
Cj
c=1(Fic )2
=
RSS between
RSS tot
Benzecri, 1973 :"In data analysis the mathematical problems reduces to computing
eigenvectors; all the science (the art) is in finding the right matrix to diagonalize"
18 / 33
Introduction Multilevel Distribution
Multilevel MCA
⇒ Start with the matrix of dummy variables A and define a
between and a within part
⇒ Then, MCA is applied on each part
Between: Apply MCA on the matrix with the mean of A per group
i (proportion of obs taking each category in group i) (proportion of
some disease in a particular hospital). ˆAb = FbV b D
1/2
p + 1np
Within part Apply MCA on the data where the between part has
been swept out (SVD is applied to 1
np A − ˆAb D
−1/2
p )
ˆAw = (np)Fw V w D
1/2
p .
ˆA = ˆAb
+ ˆAw
19 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
2 imputation with the fitted matrix ˆA = ˆAb
+ ˆAw
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
2 imputation with the fitted matrix ˆA = ˆAb
+ ˆAw
3 column margins are updated
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
2 imputation with the fitted matrix ˆA = ˆAb
+ ˆAw
3 column margins are updated
V1 V2 V3 … V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h …
ind 1 a NA g … u ind 1 1 0 0 0.71 0.29 1 0 …
ind 2 NA f g u ind 2 0.12 0.29 0.59 0 1 1 0 …
ind 3 a e h v ind 3 1 0 0 1 0 0 1 …
ind 4 a e h v ind 4 1 0 0 1 0 0 1 …
ind 5 b f h u ind 5 0 1 0 0 1 0 1 …
ind 6 c f h u ind 6 0 0 1 0 1 0 1 …
ind 7 c f NA v ind 7 0 0 1 0 1 0.37 0.63 …
… … … … … … … … … … … … … …
ind 1232 c f h v ind 1232 0 0 1 0 1 0 1 …
⇒ the imputed values can be seen as degree of membership
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
2 imputation with the fitted matrix ˆA = ˆAb
+ ˆAw
3 column margins are updated
Two ways to impute categories: majority or draw
20 / 33
Introduction Multilevel Distribution
Regularized iterative Multilevel MCA
• Initialization: imputation of the indicator matrix (proportions)
• Iterate until convergence
1 estimation: Multilevel MCA on the completed data →
ˆFb
, ˆV b
, ˆFw
, ˆV w
2 imputation with the fitted matrix ˆA = ˆAb
+ ˆAw
3 column margins are updated
Two ways to impute categories: majority or draw
20 / 33
Introduction Multilevel Distribution
Public Assistance - Paris Hospitals
Traumabase: 15000 patients/ 250 variables/ 6 hospitals
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85 NR NR 180 110
2 Lille Other 33 m 80 1.8 24.69 130 62
3 Pitie Salpetriere Gun 26 m NR NR NR 131 62
4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89
6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86
7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66
9 HEGP White weapon 16 m 98 1.92 26.58 118 54
10 Toulon White weapon 20 m NR NR NR 124 73
11 Bicetre Fall 61 m 84 1.7 29.07 144 105
...................
SpO2 Temperature Lactates Hb Glasgow Transfusion ...........
1 97 35.6 <NA> 12.7 12 yes
2 100 36.5 4.8 11.1 15 no
3 100 36 3.9 11.4 3 no
4 100 36.7 1.66 13 15 yes
6 100 36 NM 14.4 15 no
7 100 36.6 NM 14.3 15 yes
9 100 37.5 13 15.9 15 yes
10 100 36.9 NM 13.7 15 no
11 100 36.6 1.2 14.2 14 no
............
21 / 33
Introduction Multilevel Distribution
Imputed Paris Hospitals data
Traumabase: 15000 patients/ 250 variables/ 6 hospitals
Center Accident Age Sex Weight Height BMI BP SBP
1 Beaujon Fall 54 m 85.00 1.84 27.04 83 13
2 Lille Other 33 m 80.00 1.80 24.69 33 98
3 Pitie Salpetriere Gun 26 m 81.78 1.85 24.33 34 98
4 Beaujon AVP moto 63 m 80.00 1.80 24.69 48 125
6 Pitie Salpetriere AVP bicycle 33 m 75.00 1.83 24.53 6 122
7 Pitie Salpetriere AVP pedestri 30 m 81.89 1.82 25.24 9 102
9 HEGP White weapon 16 m 98.00 1.92 26.58 21 90
10 Toulon White weapon 20 m 81.68 1.82 25.05 27 109
11 Bicetre Fall 61 m 84.00 1.70 29.07 47 8
SpO2 Temperature Lactates Hb Glasgow............
1 46 61 289.07 33 14
2 2 72 464.00 16 14
3 2 65 416.00 19 7
4 2 74 130.00 36 6
6 2 65 285.91 50 6
7 2 73 244.99 49 6
9 2 83 196.00 65 6
10 2 76 262.44 43 6
11 2 73 84.00 48 5
22 / 33
Introduction Multilevel Distribution
Results
Competitors - Prediction error: 1
KJ (xijki − ˆxijki )2
• Conditional model with random effect regression (mice)
• Random forests (bühlmann, 2012) (not designed)
• Global PCA - Separate PCA
• Global mixed PCA (with hospital)
q
q
FAMD global Jomo Mult PCA global PCA sep
2.002.052.102.152.202.252.30
MSE
sigma=2, pNA=0.2, J=10
q
qq
1 2 3 4 5
050100150200250
(PCA separate − multilevel PCA) per group
group
Vary % of missing, type of missing values, n, nb of groups, etc. 23 / 33
Introduction Multilevel Distribution
Results
J = 10 J = 30 5cat 5cat
Global PCA 0.09 0.3
mice 11 282
Multilevel SVD 1.5 1.2 2 7
Global mixed PCA 0.4 0.7 1 4
Random forest 59 200 27 246
Table: Time in seconds for a dataset with 20% NA, I = 5 ki = 200
• PCA mixed as Random Forest
• mice (random effect model): difficulties with large dimensions
• Separate PCA: pb with many missing values
• Multilevel SVD = global SVD when no group effect
• Imputation properties depends on the method (linear)
• Other methods do not handle categorical variables
24 / 33
Introduction Multilevel Distribution
Outline
1 Introduction
2 Single imputation for mixed multilevel data
3 Distribution
25 / 33
Introduction Multilevel Distribution
6 hospital databases
Combining data from different institutional databases promises
many advantages in personalizing medical care (large n, more
chance for finding patients like me)
⇒ NIH requires sharing of data from funded projects
26 / 33
Introduction Multilevel Distribution
6 hospital databases
Combining data from different institutional databases promises
many advantages in personalizing medical care (large n, more
chance for finding patients like me)
⇒ NIH requires sharing of data from funded projects
⇒ The problem: high barriers to aggregation of medical data
• lack of standardization of ontologies
• privacy concerns
• proprietary attitude towards data, reluctance to cede control
• complexity/size of aggregated data, updates problems
26 / 33
Introduction Multilevel Distribution
Solution: distributed computation
⇒ Data aggregation is not always necessary
⇒ NIH splits the storage of aggregated data across several centers
⇒ Data can stay at site
⇒ Computations can be distributed (share burden)
⇒ Hospitals only share intermediate results instead of the raw data
27 / 33
Introduction Multilevel Distribution
Topology: master-workers (Wolfson, et. al (2010))
⇒ Ex: Each site share the sum of age ˜Xi and the number of
patients ni . The master computes ¯X = ni
˜Xi / ni 28 / 33
Introduction Multilevel Distribution
Solution: distributed computation
⇒ Many models fitting can be implemented:
• Maximizing a likelihood. Intermediate computations break up
into sums of quantities computed on local data at sites.
Log-likelihood, score function and Fisher information can
partition into sums. (OK for logistic regression)
• Singular Value Decomposition. Iterative algorithms available
for SVD using quantities computed on local data at sites.
• And more.
Implemented in the R package discomp (Narasimhan et. al., 2017)
29 / 33
Introduction Multilevel Distribution
Singular value decomposition
SVD: Xn×p : Un×kDk×kVp×k
Power method to get the first direction:
Other dims: "deflation", same procedure in the residuals
(X − udv )
⇒ Involves inner products and sums: distributed
30 / 33
Introduction Multilevel Distribution
Privacy preserving rank k SVD
31 / 33
Introduction Multilevel Distribution
Multilevel imputation
i
I
1
1
k1
1
1
kI
ki
1 J
Xi
XI
X1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
⇒ Impute multilevel data with Multilevel SVD
⇒ Distributed multilevel imputation
⇒ Impute the data of one hospital using the data of the others
⇒ Incentive to encourage the hospitals to participate in the project
⇒ Apply other statistical methods on the imputed data (logistic
regression)
32 / 33
Introduction Multilevel Distribution
Take home message - On going work
Multilevel PCA powerful for single imputation of continuous & categorical
multilevel data: reduce the dimensionality - capture the similarities
between rows and relationship between variables at both levels.
⇒ Computationaly fast - distributed - Implemented R package missMDA
• Numbers of components Qb and Qw ?
• Inference after imputation. Underestimation of the variance with single
imputation
33 / 33
Introduction Multilevel Distribution
Take home message - On going work
Multilevel PCA powerful for single imputation of continuous & categorical
multilevel data: reduce the dimensionality - capture the similarities
between rows and relationship between variables at both levels.
⇒ Computationaly fast - distributed - Implemented R package missMDA
• Numbers of components Qb and Qw ?
cross-validation?
• Inference after imputation. Underestimation of the variance with single
imputation
33 / 33
Introduction Multilevel Distribution
Take home message - On going work
Multilevel PCA powerful for single imputation of continuous & categorical
multilevel data: reduce the dimensionality - capture the similarities
between rows and relationship between variables at both levels.
⇒ Computationaly fast - distributed - Implemented R package missMDA
• Numbers of components Qb and Qw ?
cross-validation?
• Inference after imputation. Underestimation of the variance with single
imputation
Multiple imputation: bootstrap + drawn from the predictive distribution
N 1K ˆm + ˆFb ˆBb
+ ˆFw ˆBw
, ˆσ2
What confidence should be given to the results obtained from a
completed data set?
33 / 33

More Related Content

Similar to Multilevel Matrix Completion for Medical Databases

eggs_project_interm
eggs_project_intermeggs_project_interm
eggs_project_intermRoopan Verma
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for HealthcareChandan Reddy
 
Network meta-analysis & models for inconsistency
Network meta-analysis & models for inconsistencyNetwork meta-analysis & models for inconsistency
Network meta-analysis & models for inconsistencycheweb1
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsNitin George
 
Streaming multiscale anomaly detection
Streaming multiscale anomaly detectionStreaming multiscale anomaly detection
Streaming multiscale anomaly detectionRavi Kiran B.
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solutionKrunal Shah
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISFUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISIrene Pochinok
 
Repeated events analyses
Repeated events analysesRepeated events analyses
Repeated events analysesMike LaValley
 
Chi square Test Using SPSS
Chi square Test Using SPSSChi square Test Using SPSS
Chi square Test Using SPSSDr Athar Khan
 
Research Methodology anova
  Research Methodology anova  Research Methodology anova
Research Methodology anovaPraveen Minz
 
Diabetic Retinopathy Detection
Diabetic Retinopathy DetectionDiabetic Retinopathy Detection
Diabetic Retinopathy DetectionSPb_Data_Science
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics Model
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics ModelEffects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics Model
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics ModelIJERA Editor
 
2016-04-20-Moscow-EstimationError-V01
2016-04-20-Moscow-EstimationError-V012016-04-20-Moscow-EstimationError-V01
2016-04-20-Moscow-EstimationError-V01Francesco Sarracino
 
Discrete probability distribution.pptx
Discrete probability distribution.pptxDiscrete probability distribution.pptx
Discrete probability distribution.pptxDHARMENDRAKUMAR216662
 

Similar to Multilevel Matrix Completion for Medical Databases (20)

Input analysis
Input analysisInput analysis
Input analysis
 
eggs_project_interm
eggs_project_intermeggs_project_interm
eggs_project_interm
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 
One-Way ANOVA
One-Way ANOVAOne-Way ANOVA
One-Way ANOVA
 
Network meta-analysis & models for inconsistency
Network meta-analysis & models for inconsistencyNetwork meta-analysis & models for inconsistency
Network meta-analysis & models for inconsistency
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Streaming multiscale anomaly detection
Streaming multiscale anomaly detectionStreaming multiscale anomaly detection
Streaming multiscale anomaly detection
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solution
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISFUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
 
Repeated events analyses
Repeated events analysesRepeated events analyses
Repeated events analyses
 
Chi square Test Using SPSS
Chi square Test Using SPSSChi square Test Using SPSS
Chi square Test Using SPSS
 
Student’s t test
Student’s  t testStudent’s  t test
Student’s t test
 
Survival analysis
Survival analysisSurvival analysis
Survival analysis
 
Research Methodology anova
  Research Methodology anova  Research Methodology anova
Research Methodology anova
 
Diabetic Retinopathy Detection
Diabetic Retinopathy DetectionDiabetic Retinopathy Detection
Diabetic Retinopathy Detection
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics Model
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics ModelEffects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics Model
Effects of A Simulated Power Cut in AMS on Milk Yield Valued by Statistics Model
 
2016-04-20-Moscow-EstimationError-V01
2016-04-20-Moscow-EstimationError-V012016-04-20-Moscow-EstimationError-V01
2016-04-20-Moscow-EstimationError-V01
 
Discrete probability distribution.pptx
Discrete probability distribution.pptxDiscrete probability distribution.pptx
Discrete probability distribution.pptx
 
Mech ma6452 snm_notes
Mech ma6452 snm_notesMech ma6452 snm_notes
Mech ma6452 snm_notes
 

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Managing international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha DimbanManaging international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha Dimban
 
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria KnorpsOptimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
 
Perspectives, by M. Pannegeon
Perspectives, by M. PannegeonPerspectives, by M. Pannegeon
Perspectives, by M. Pannegeon
 
Evaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled dataEvaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled data
 
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
 
An age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-PierreAn age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-Pierre
 
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle LautréApplying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
Global Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna AbreuGlobal Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna Abreu
 
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie DelonPlug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
 
Sales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca IannuzziSales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca Iannuzzi
 
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta BinkyteIdentifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta Binkyte
 
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
 
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
 
Sandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI projectSandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI project
 
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
 
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdfKhrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
 
Iana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdfIana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdf
 
41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf
 
Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

Multilevel Matrix Completion for Medical Databases

  • 1. Introduction Multilevel Distribution Distributed multilevel matrix completion for medical databases Julie Josse Ecole Polytechnique, INRIA Women in Machine Learning & Data Science Meetup January 25, 2018 1 / 33
  • 2. Introduction Multilevel Distribution Overview 1 Introduction 2 Single imputation for mixed multilevel data 3 Distribution 2 / 33
  • 3. Introduction Multilevel Distribution Outline 1 Introduction 2 Single imputation for mixed multilevel data 3 Distribution 3 / 33
  • 4. Introduction Multilevel Distribution Research activities • Dimensionality reduction methods to visualize complex data (PCA based): multi-sources, textual, arrays, questionnaire • Missing values - matrix completion • Low rank estimation, selection of regularization parameters • Fields of application: bio-sciences (agronomy, sensory analysis), health data (hospital data) • R community: book R for Statistics, R foundation, R taskforce Forwards, JSS papers and R packages: FactoMineR explore continuous, categorical, multiple contingency tables (correspondence analysis), combine clustering and PC, .. MissMDA for single and multiple imputation, PCA with missing denoiseR to denoise data 4 / 33
  • 5. Introduction Multilevel Distribution Collaborators Imputation of mixed data with multilevel SVD Distributed computation Genevieve Robin, François Husson, Balasubramanian Narasimhan Polytechnique, Agrocampus, Stanford (stat/biomedical) 5 / 33
  • 6. Introduction Multilevel Distribution Public Assistance - Paris Hospitals Traumabase: 15000 patients/ 250 variables/ Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR 180 110 2 Lille Other 33 m 80 1.8 24.69 130 62 3 Pitie Salpetriere Gun 26 m NR NR NR 131 62 4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89 6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86 7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66 9 HEGP White weapon 16 m 98 1.92 26.58 118 54 10 Toulon White weapon 20 m NR NR NR 124 73 11 Bicetre Fall 61 m 84 1.7 29.07 144 105 ................... SpO2 Temperature Lactates Hb Glasgow Transfusion ........... 1 97 35.6 <NA> 12.7 12 yes 2 100 36.5 4.8 11.1 15 no 3 100 36 3.9 11.4 3 no 4 100 36.7 1.66 13 15 yes 6 100 36 NM 14.4 15 no 7 100 36.6 NM 14.3 15 yes 9 100 37.5 13 15.9 15 yes 10 100 36.9 NM 13.7 15 no 11 100 36.6 1.2 14.2 14 no ............ ⇒ Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... ⇒ (Logistic) regressions with categorical/continuous values 6 / 33
  • 7. Introduction Multilevel Distribution Public Assistance - Paris Hospitals Traumabase: 15000 patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR 180 110 2 Lille Other 33 m 80 1.8 24.69 130 62 3 Pitie Salpetriere Gun 26 m NR NR NR 131 62 4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89 6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86 7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66 9 HEGP White weapon 16 m 98 1.92 26.58 118 54 10 Toulon White weapon 20 m NR NR NR 124 73 11 Bicetre Fall 61 m 84 1.7 29.07 144 105 ................... SpO2 Temperature Lactates Hb Glasgow Transfusion ........... 1 97 35.6 <NA> 12.7 12 yes 2 100 36.5 4.8 11.1 15 no 3 100 36 3.9 11.4 3 no 4 100 36.7 1.66 13 15 yes 6 100 36 NM 14.4 15 no 7 100 36.6 NM 14.3 15 yes 9 100 37.5 13 15.9 15 yes 10 100 36.9 NM 13.7 15 no 11 100 36.6 1.2 14.2 14 no ............ ⇒ Predict the Glasgow score, whether to start a blood transfusion, to administer fresh frozen plasma, etc... ⇒ (Logistic) regressions with missing categorical/continuous values 6 / 33
  • 8. Introduction Multilevel Distribution Public Assistance - Paris Hospitals ⇒ 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization ⇒ Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR 180 110 2 Lille Other 33 m 80 1.8 24.69 130 62 3 Pitie Salpetriere Gun 26 m NR NR NR 131 62 4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89 6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86 7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66 9 HEGP White weapon 16 m 98 1.92 26.58 118 54 ................... SpO2 Temperature Lactates Hb Glasgow Transfusion ........... 1 97 35.6 <NA> 12.7 12 yes 2 100 36.5 4.8 11.1 15 no 3 100 36 3.9 11.4 3 no 4 100 36.7 1.66 13 15 yes 6 100 36 NM 14.4 15 no 7 100 36.6 NM 14.3 15 yes 9 100 37.5 13 15.9 15 yes 1) Handling missing values in multilevel data 7 / 33
  • 9. Introduction Multilevel Distribution Public Assistance - Paris Hospitals ⇒ 6 hospitals: multilevel structure (patients within hospital) Hopital effect: lack of standardization 2) Data not aggregated but which stay on each site ⇒ Missing values Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR 180 110 2 Lille Other 33 m 80 1.8 24.69 130 62 3 Pitie Salpetriere Gun 26 m NR NR NR 131 62 4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89 6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86 7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66 9 HEGP White weapon 16 m 98 1.92 26.58 118 54 ................... SpO2 Temperature Lactates Hb Glasgow Transfusion ........... 1 97 35.6 <NA> 12.7 12 yes 2 100 36.5 4.8 11.1 15 no 3 100 36 3.9 11.4 3 no 4 100 36.7 1.66 13 15 yes 6 100 36 NM 14.4 15 no 7 100 36.6 NM 14.3 15 yes 9 100 37.5 13 15.9 15 yes 1) Handling missing values in multilevel data 7 / 33
  • 10. Introduction Multilevel Distribution Missing values are everywhere: unanswered questions in a survey, lost data, damaged plants, machines that fail... The best thing to do with missing values is not to have any” Gertrude Mary Cox. ⇒ Still an issue with "big data" Data integration: data from different sources i I 1 1 k1 1 1 kI ki 1 J Xi XI X1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? Multilevel structure: sporadically - systematic missing values (one variable missing in one hospital) 8 / 33
  • 11. Introduction Multilevel Distribution Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) ⇒ Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... ⇒ Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative 9 / 33
  • 12. Introduction Multilevel Distribution Solutions to handle missing values Litterature: Schaefer (2002); Little & Rubin (2002); Gelman & Meng (2004); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2015) ⇒ Modify the estimation process to deal with missing values. Maximum likelihood: EM algorithm to obtain point estimates + Supplemented EM (Meng & Rubin, 1991) ; Louis for their variability One specific algorithm for each statistical method... ⇒ Imputation (multiple) to get a completed data set on which you can perform any statistical method (Rubin, 1976) Famous imputation based on SVD (PCA) - quantitative Extension to imputation with multilevel SVD + categorical 9 / 33
  • 13. Introduction Multilevel Distribution Outline 1 Introduction 2 Single imputation for mixed multilevel data 3 Distribution 10 / 33
  • 14. Introduction Multilevel Distribution PCA reconstruction -2.00 -2.74 -1.56 -0.77 -1.11 -1.59 -0.67 -1.13 -0.22 -1.22 0.22 -0.52 0.67 1.46 1.11 0.63 1.56 1.10 2.00 1.00 -2.16 -2.58 -0.96 -1.35 -1.15 -1.55 -0.70 -1.09 -0.53 -0.92 0.04 -0.34 1.24 0.89 1.05 0.69 1.50 1.15 1.67 1.33 X -3 -2 -1 0 1 2 3 -3-2-10123 x1 x2 ^μ X F ^μ V' ≈ ⇒ Minimizes distance between observations and their projection ⇒ Approx Xn×p with a low rank matrix k < p A 2 2 = tr(AA ): argminµ X − µ 2 2 : rank (µ) ≤ k SVD X: ˆµPCA = Un×kDk×kVp×k = Fn×kVp×k F = UD PC - scores V principal axes - loadings 11 / 33
  • 15. Introduction Multilevel Distribution PCA reconstruction -2.00 -2.74 NA -0.77 -1.11 -1.59 -0.67 -1.13 -0.22 NA 0.22 -0.52 0.67 1.46 NA 0.63 1.56 1.10 2.00 1.00 -2.16 -2.58 -0.96 -1.35 -1.15 -1.55 -0.70 -1.09 -0.53 -0.92 0.04 -0.34 1.24 0.89 1.05 0.69 1.50 1.15 1.67 1.33 X -3 -2 -1 0 1 2 3 -3-2-10123 x1 x2 ^μ X F ^μ V' ≈ ⇒ Minimizes distance between observations and their projection ⇒ Approx Xn×p with a low rank matrix k < p A 2 2 = tr(AA ): argminµ X − µ 2 2 : rank (µ) ≤ k SVD X: ˆµPCA = Un×kDk×kVp×k = Fn×kVp×k F = UD PC - scores V principal axes - loadings 11 / 33
  • 16. Introduction Multilevel Distribution Missing values in PCA ⇒ PCA: least squares argminµ Xn×p − µn×p 2 2 : rank (µ) ≤ k ⇒ PCA with missing values: weighted least squares argminµ Wn×p (X − µ) 2 2 : rank (µ) ≤ k with Wij = 0 if Xij is missing, Wij = 1 otherwise; elementwise multiplication Many algorithms: Gabriel & Zamir, 1979: weighted alternating least squares (without explicit imputation) Kiers, 1997: iterative PCA (with imputation) 12 / 33
  • 17. Introduction Multilevel Distribution Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data → (U , D , V ); k dim kept (b) (ˆµ )k = U D V X = W X + (1 − W ) ˆµ 3 steps of estimation and imputation are repeated 13 / 33
  • 18. Introduction Multilevel Distribution Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data → (U , D , V ); k dim kept (b) (ˆµ )k = U D V X = W X + (1 − W ) ˆµ 3 steps of estimation and imputation are repeated ⇒ ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, εij iid ∼ N 0, σ2 with µ of low rank ⇒ Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 33
  • 19. Introduction Multilevel Distribution Iterative PCA 1 initialization = 0: X0 (mean imputation) 2 step : (a) PCA on the completed data → (U , D , V ); k dim kept (b) (ˆµ )k = U D V X = W X + (1 − W ) ˆµ 3 steps of estimation and imputation are repeated ⇒ ˆµ from incomplete data: EM algo: X = FV + ε X = µ + ε, εij iid ∼ N 0, σ2 with µ of low rank ⇒ Completed data: good imputation (matrix completion, Netflix) (Udell & Townsend Nice Latent Variable Models Have Log-Rank, 2017) Selecting k? Generalized cross-validation (Josse & Husson, 2012) 13 / 33
  • 20. Introduction Multilevel Distribution Multilevel component analysis Ex: inhabitants nested within countries X ∈ RK×J • similarities between countries? level 1 • similarities between inhabitants within each country? level 2 • relationship between variables at each level i I 1 1 k1 1 1 kI ki 1 J Xi XI X1 xijki = x.j. + (xij. − x.j.) + (xijki − xij.) Between + Within Analysis of variance: split the sum of squares for each variable j I i=1 ki k=1 (xijki )2 = I i=1 ki (x.j.)2 + I i=1 ki (xij. −x.j.)2 + I i=1 ki k=1 (xijki −xij.)2 14 / 33
  • 21. Introduction Multilevel Distribution Multilevel PCA MLPCA ⇒ Model for the between and within part i = 1, ..., I groups, J var Xi(ki ×J) = 1ki m + 1ki Fb i V b + Fw i V w + Ei • Fb i (Qb × 1) between component scores of group i • V b (J × Qb) between loading matrix • Fw i (ki × Qw ) within component scores of group i • Vw (J × Qw ) within loading matrix. Constant across groups Fitted by solving the least squares (Timmerman, 2006) argmin F(m, Fb i , V b , Fw i , V w ) = I i=1 Xi − 1ki m − 1ki Fb i V b − Fw i V w 2 I i=1 ki Fb i = 0Qb and 1ki Fw i = 0Qw , ∀i for identifiability. 15 / 33
  • 22. Introduction Multilevel Distribution MLPCA - quantitative data i = 1, ..., I groups, J var, ki nb obs in group i ⇒ Estimation: minimize the RSS argmin F() = I i=1 Xi − 1ki m − 1ki Fb i V b − Fw i V w 2 , I i=1 ki Fb i = 0Qb and 1ki Fw i = 0Qw , ∀i for identifiability. (ˆFb , ˆV b ): Weigthed PCA on the between part: SVD on WXm; Xm (I × J) the means of the variables per group, W (I × I) wii = √ ki (ˆFw , ˆV w ) PCA on the within part: SVD on the centered data per group Xw (K × J), K = i ki ⇒ With missing values: Weighted Least Squares ⇒ Iterative imputation algorithm (imputation - estimation) 16 / 33
  • 23. Introduction Multilevel Distribution Iterative MLPCA 2. iteration : estimation of the between structure • SVD WXm = PDQ ; Qb eigenvectors are kept: ˆFb i = [W −1 PQb ]i , ˆFb concatenation by row of [1ki ˆFb i ] ˆV b = QQb DQb , (J × Qb) • the between hat matrix is computed: (ˆXb ) = ˆFb ˆV b 3. iteration : imputation of the missing values with the fitted values • ˆX = 1K ˆm( −1) + (ˆXb ) + (ˆXw )( −1) . The newly imputed dataset is X = W X + (1K × 1J − W ) ˆX • ˆm is computed on X 4. iteration : estimation of the within structure • SVD (Xw ) = PDQ ; Qw eigenvectors are kept: Fw = PQw (K × Qw ) V w = QQw DQw (J × Qw ) • the within hat matrix is computed (ˆXw ) = ˆFw ˆV w 5. iteration : imputation of the missing values with the fitted values • X +1 = M X + (1K × 1J − M) 1K ˆm( ) + (ˆXb ) + (ˆXw ) • ˆm +1 is computed on X +1 17 / 33
  • 24. Introduction Multilevel Distribution Multiple Correspondence Analysis (MCA) Greenacre Xn×m m categorical variables coded with indicator matrix A y . . . attack y . . . attack y . . . attack n . . . suicide X = n . . . accident n . . . suicide 1 0 . . . 1 0 0 1 0 . . . 1 0 0 1 0 . . . 1 0 0 0 1 . . . 0 1 0 A = 0 1 . . . 0 0 1 0 1 . . . 0 1 0 p1 0 Dp = . . . 0 pJ For a category c, the frequency of the category: pc = nc /n. A SVD on weighted matrix: Z = 1√ mn (A − 1pT )D −1/2 p = UDV The PC (F = UD) satisfies: arg maxFq∈Rn 1 m m j=1 η2 (Fq, Xj ) η2 (F, Xj ) = Cj c=1 nc (F.c − F..)2 n i=1 Cj c=1(Fic )2 = RSS between RSS tot Benzecri, 1973 :"In data analysis the mathematical problems reduces to computing eigenvectors; all the science (the art) is in finding the right matrix to diagonalize" 18 / 33
  • 25. Introduction Multilevel Distribution Multilevel MCA ⇒ Start with the matrix of dummy variables A and define a between and a within part ⇒ Then, MCA is applied on each part Between: Apply MCA on the matrix with the mean of A per group i (proportion of obs taking each category in group i) (proportion of some disease in a particular hospital). ˆAb = FbV b D 1/2 p + 1np Within part Apply MCA on the data where the between part has been swept out (SVD is applied to 1 np A − ˆAb D −1/2 p ) ˆAw = (np)Fw V w D 1/2 p . ˆA = ˆAb + ˆAw 19 / 33
  • 26. Introduction Multilevel Distribution Regularized iterative Multilevel MCA 20 / 33
  • 27. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) 20 / 33
  • 28. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 20 / 33
  • 29. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 2 imputation with the fitted matrix ˆA = ˆAb + ˆAw 20 / 33
  • 30. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 2 imputation with the fitted matrix ˆA = ˆAb + ˆAw 3 column margins are updated 20 / 33
  • 31. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 2 imputation with the fitted matrix ˆA = ˆAb + ˆAw 3 column margins are updated V1 V2 V3 … V14 V1_a V1_b V1_c V2_e V2_f V3_g V3_h … ind 1 a NA g … u ind 1 1 0 0 0.71 0.29 1 0 … ind 2 NA f g u ind 2 0.12 0.29 0.59 0 1 1 0 … ind 3 a e h v ind 3 1 0 0 1 0 0 1 … ind 4 a e h v ind 4 1 0 0 1 0 0 1 … ind 5 b f h u ind 5 0 1 0 0 1 0 1 … ind 6 c f h u ind 6 0 0 1 0 1 0 1 … ind 7 c f NA v ind 7 0 0 1 0 1 0.37 0.63 … … … … … … … … … … … … … … … ind 1232 c f h v ind 1232 0 0 1 0 1 0 1 … ⇒ the imputed values can be seen as degree of membership 20 / 33
  • 32. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 2 imputation with the fitted matrix ˆA = ˆAb + ˆAw 3 column margins are updated Two ways to impute categories: majority or draw 20 / 33
  • 33. Introduction Multilevel Distribution Regularized iterative Multilevel MCA • Initialization: imputation of the indicator matrix (proportions) • Iterate until convergence 1 estimation: Multilevel MCA on the completed data → ˆFb , ˆV b , ˆFw , ˆV w 2 imputation with the fitted matrix ˆA = ˆAb + ˆAw 3 column margins are updated Two ways to impute categories: majority or draw 20 / 33
  • 34. Introduction Multilevel Distribution Public Assistance - Paris Hospitals Traumabase: 15000 patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85 NR NR 180 110 2 Lille Other 33 m 80 1.8 24.69 130 62 3 Pitie Salpetriere Gun 26 m NR NR NR 131 62 4 Beaujon AVP moto 63 m 80 1.8 24.69 145 89 6 Pitie Salpetriere AVP bicycle 33 m 75 NR NR 104 86 7 Pitie Salpetriere AVP pedestrian 30 w NR NR NR 107 66 9 HEGP White weapon 16 m 98 1.92 26.58 118 54 10 Toulon White weapon 20 m NR NR NR 124 73 11 Bicetre Fall 61 m 84 1.7 29.07 144 105 ................... SpO2 Temperature Lactates Hb Glasgow Transfusion ........... 1 97 35.6 <NA> 12.7 12 yes 2 100 36.5 4.8 11.1 15 no 3 100 36 3.9 11.4 3 no 4 100 36.7 1.66 13 15 yes 6 100 36 NM 14.4 15 no 7 100 36.6 NM 14.3 15 yes 9 100 37.5 13 15.9 15 yes 10 100 36.9 NM 13.7 15 no 11 100 36.6 1.2 14.2 14 no ............ 21 / 33
  • 35. Introduction Multilevel Distribution Imputed Paris Hospitals data Traumabase: 15000 patients/ 250 variables/ 6 hospitals Center Accident Age Sex Weight Height BMI BP SBP 1 Beaujon Fall 54 m 85.00 1.84 27.04 83 13 2 Lille Other 33 m 80.00 1.80 24.69 33 98 3 Pitie Salpetriere Gun 26 m 81.78 1.85 24.33 34 98 4 Beaujon AVP moto 63 m 80.00 1.80 24.69 48 125 6 Pitie Salpetriere AVP bicycle 33 m 75.00 1.83 24.53 6 122 7 Pitie Salpetriere AVP pedestri 30 m 81.89 1.82 25.24 9 102 9 HEGP White weapon 16 m 98.00 1.92 26.58 21 90 10 Toulon White weapon 20 m 81.68 1.82 25.05 27 109 11 Bicetre Fall 61 m 84.00 1.70 29.07 47 8 SpO2 Temperature Lactates Hb Glasgow............ 1 46 61 289.07 33 14 2 2 72 464.00 16 14 3 2 65 416.00 19 7 4 2 74 130.00 36 6 6 2 65 285.91 50 6 7 2 73 244.99 49 6 9 2 83 196.00 65 6 10 2 76 262.44 43 6 11 2 73 84.00 48 5 22 / 33
  • 36. Introduction Multilevel Distribution Results Competitors - Prediction error: 1 KJ (xijki − ˆxijki )2 • Conditional model with random effect regression (mice) • Random forests (bühlmann, 2012) (not designed) • Global PCA - Separate PCA • Global mixed PCA (with hospital) q q FAMD global Jomo Mult PCA global PCA sep 2.002.052.102.152.202.252.30 MSE sigma=2, pNA=0.2, J=10 q qq 1 2 3 4 5 050100150200250 (PCA separate − multilevel PCA) per group group Vary % of missing, type of missing values, n, nb of groups, etc. 23 / 33
  • 37. Introduction Multilevel Distribution Results J = 10 J = 30 5cat 5cat Global PCA 0.09 0.3 mice 11 282 Multilevel SVD 1.5 1.2 2 7 Global mixed PCA 0.4 0.7 1 4 Random forest 59 200 27 246 Table: Time in seconds for a dataset with 20% NA, I = 5 ki = 200 • PCA mixed as Random Forest • mice (random effect model): difficulties with large dimensions • Separate PCA: pb with many missing values • Multilevel SVD = global SVD when no group effect • Imputation properties depends on the method (linear) • Other methods do not handle categorical variables 24 / 33
  • 38. Introduction Multilevel Distribution Outline 1 Introduction 2 Single imputation for mixed multilevel data 3 Distribution 25 / 33
  • 39. Introduction Multilevel Distribution 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) ⇒ NIH requires sharing of data from funded projects 26 / 33
  • 40. Introduction Multilevel Distribution 6 hospital databases Combining data from different institutional databases promises many advantages in personalizing medical care (large n, more chance for finding patients like me) ⇒ NIH requires sharing of data from funded projects ⇒ The problem: high barriers to aggregation of medical data • lack of standardization of ontologies • privacy concerns • proprietary attitude towards data, reluctance to cede control • complexity/size of aggregated data, updates problems 26 / 33
  • 41. Introduction Multilevel Distribution Solution: distributed computation ⇒ Data aggregation is not always necessary ⇒ NIH splits the storage of aggregated data across several centers ⇒ Data can stay at site ⇒ Computations can be distributed (share burden) ⇒ Hospitals only share intermediate results instead of the raw data 27 / 33
  • 42. Introduction Multilevel Distribution Topology: master-workers (Wolfson, et. al (2010)) ⇒ Ex: Each site share the sum of age ˜Xi and the number of patients ni . The master computes ¯X = ni ˜Xi / ni 28 / 33
  • 43. Introduction Multilevel Distribution Solution: distributed computation ⇒ Many models fitting can be implemented: • Maximizing a likelihood. Intermediate computations break up into sums of quantities computed on local data at sites. Log-likelihood, score function and Fisher information can partition into sums. (OK for logistic regression) • Singular Value Decomposition. Iterative algorithms available for SVD using quantities computed on local data at sites. • And more. Implemented in the R package discomp (Narasimhan et. al., 2017) 29 / 33
  • 44. Introduction Multilevel Distribution Singular value decomposition SVD: Xn×p : Un×kDk×kVp×k Power method to get the first direction: Other dims: "deflation", same procedure in the residuals (X − udv ) ⇒ Involves inner products and sums: distributed 30 / 33
  • 45. Introduction Multilevel Distribution Privacy preserving rank k SVD 31 / 33
  • 46. Introduction Multilevel Distribution Multilevel imputation i I 1 1 k1 1 1 kI ki 1 J Xi XI X1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ⇒ Impute multilevel data with Multilevel SVD ⇒ Distributed multilevel imputation ⇒ Impute the data of one hospital using the data of the others ⇒ Incentive to encourage the hospitals to participate in the project ⇒ Apply other statistical methods on the imputed data (logistic regression) 32 / 33
  • 47. Introduction Multilevel Distribution Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. ⇒ Computationaly fast - distributed - Implemented R package missMDA • Numbers of components Qb and Qw ? • Inference after imputation. Underestimation of the variance with single imputation 33 / 33
  • 48. Introduction Multilevel Distribution Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. ⇒ Computationaly fast - distributed - Implemented R package missMDA • Numbers of components Qb and Qw ? cross-validation? • Inference after imputation. Underestimation of the variance with single imputation 33 / 33
  • 49. Introduction Multilevel Distribution Take home message - On going work Multilevel PCA powerful for single imputation of continuous & categorical multilevel data: reduce the dimensionality - capture the similarities between rows and relationship between variables at both levels. ⇒ Computationaly fast - distributed - Implemented R package missMDA • Numbers of components Qb and Qw ? cross-validation? • Inference after imputation. Underestimation of the variance with single imputation Multiple imputation: bootstrap + drawn from the predictive distribution N 1K ˆm + ˆFb ˆBb + ˆFw ˆBw , ˆσ2 What confidence should be given to the results obtained from a completed data set? 33 / 33