BioSHaRE conference July 28th, 2015, Milan - Latest tools and services for data sharing
Stream 3: Study application and results
Contact info:
Prof. Edwin van den Heuvel
University of Eindhoven
e.r.v.d.heuvel@tue.nl
key words: biobank, bioshare, cohort, data sharing, epidemiology, harmonisation, statistics
3. INTRODUCTION
Meta-Analysis
Combining data from different sources started most
likely with Carl Friedrich Gauss (1777-1855).
He used data from astronomers to calculate planet orbits
He developed
least squares and
the classical
reliability theory:
the true
parameter is
observed with
noise
4. INTRODUCTION
Meta-Analysis
Combining data was a true problem in the beginning
of the 20th
century
Potency estimates from bioassays showed
tremendous heterogeneity
Least squares was
unsatisfactory
Landmark paper of
Cochran in 1954
discussed various
weighted means
This field implicitly used
random effects model
Reference
Unknown
Response
Concentration
β
α−α
=ρ RU
5. INTRODUCTION
Meta-Analysis
Gene Glass introduces the term (aggregate data)
meta-analysis in 1976 as the analysis of analyses
This paper did not refer to the bioassay field at all
Pools estimates from published papers
A meta-analysis assumes the existence of
The estimate of the association bi at study i
A standard error siof the estimate bi
The number of degrees of freedom di for standard error si
Different statistical approaches are available to pool
the estimates
6. INTRODUCTION
Meta-Analysis
Fixed effects meta-analysis model
bi = β + ei, ei ~ N(0,σi
2
)
the standard error si is an estimate of σi
Random effects meta-analysis model
bi = β + Ui + ei, ei ~ N(0,σi
2
), Ui ~ N(0,τ2
)
the standard error si is an estimate of σi
τ2
represents heterogeneity in the estimates
In case heterogeneity is present (τ ≠ 0) the fixed
effects analysis underestimates the standard error of
the pooled estimate
7. INTRODUCTION
Meta-Analysis
Not all researchers are in favor of meta-analysis
Van Houwelingen (1997) wrote:
“…popular practice of analysing summary measures from
selected publications is a poor man’s solution.”
…I hope that we will have full multi-center multi-study
databases that can be analysed by appropriate random
effects models considering both random variation within
and between studies and/or centres.”
Thus there is a strong need for individual participant
data analysis
8. INTRODUCTION
Meta-Analysis
IPD meta-analysis can be performed in two ways:
One-stage analysis: All individual data is simultaneously
analyzed (possibly with sophisticated statistical models)
Two-stage or coordinated meta analysis: Each study is
analyzed separately and the model parameters are pooled
according to original meta-analysis tools
Two-stage analysis seems easier to implement, since
it does not require that data is pooled at one
location
One-stage IPD meta-analysis that does not pool data
at one location is called federated data analysis
9. FEDERATED DATA ANALYSIS
Linear Regression
Consider the following setting
Yij is the response of subject j in study i
Xij is the exposure of subject j in study i
Zij is a confounder of subject j in study i
The simplest linear regression model is
M1: Yij = β0 + βZ·Zij + βX·Xij + eij
Model M1 assumes:
The populations are homogeneous – intercept
Associations are homogeneous
Residual variances are homogeneous
The ratio sample and population size is homogenous
10. FEDERATED DATA ANALYSIS
Linear Regression
Federated data analysis for estimation of β0, βZ, βX,
and σ2
require study summary statistics:
Number of observations
Sum of the confounders
Sum of the exposures
Sum of the squared confounders
Sum of the squared exposures
Sum of the confounder – exposure product
Sum of the response
Sum of the response – confounder product
Sum of the response – exposure products
Sum of the squared responses (for the SE)
12. FEDERATED DATA ANALYSIS
Linear Regression
Models M2 and M3a
Have a homogeneous association for the exposure
Require a federated data analysis
The same summary statistics for the federated data
analysis of model M1 are involved
Models M3b, M3c, and M4 can
be estimated with the same summary statistics used in the
federated data analysis
Require aggregate data meta-analysis to pool the
estimates bX,i from different studies
13. FEDERATED DATA ANALYSIS
Linear Regression
Simulation studies shows that an aggregate data
meta-analysis for model M3a produces strong
heterogeneity in the estimates bX,i even tough the
association is homogeneous
Treating the regression parameters in models M2,
M3a, M3b, M3c, and M4 as fixed effects will
underestimates the pooled association βX
Thus models M2, M3a, M3b, M3c, and M4 need to
assume that the parameters are random – mixed
effects models like the random effects
meta-analysis
14. FEDERATED DATA ANALYSIS
Mixed Effects
Model M2 becomes
M2: Yij = β0 + βZ·Zij + βX·Xij + Ui + eij
The associations are still assumed homogeneous
The residual variance is homogeneous
Intercept is heterogeneous Ui ~ N(0,τ2
): random intercept
model
Federated data analysis for mixed models is less
straightforward – random term complicates method
The Expectation-Maximization algorithm can be used
to estimate the model parameters in a
federated approach
15. EM-ALGORITHM
Mixed Effects Models
Step 0: Choose starting values β0(0), βZ(0), βX(0),
τ(0), and σ(0) for β0, βZ, βX, τ, and σ
Step 1: E-Step: using the estimates from the
previous step, estimate Ui
M-Step: Using the result from the E-step
determine β0(1), βZ(1), βX(1), and τ(1)
Evaluate: how much the estimates has changed
If the changes are small enough → convergence
If the changes are still to large → conduct step 1 using the
last available estimates
EM uses the same summary statistics
iteration
16. VALIDATION RESULTS
TEST DATA: MULTICENTER TRIAL
Data from a multicenter trial was used
Two responses: Hemoglobin in blood (g/dl)
Blood loss during surgery (mL)
Exposure: Treatment (control; new)
Covariate: Age (years)
Three centers (1, 2, 3) – Different centers were selected
for the responses
The EM-algorithm was used with the summary
statistics needed to estimate M1
A random intercept model with maximum likelihood
was applied on the full data set
17. VALIDATION RESULTS
TEST DATA: MULTICENTER TRIAL
Description of the validation data
Hemoglobine
Blood loss
Center 1
(n=200)
Center 2
(n=20)
Center 3
(n=30)
P-value
Hb (Std) 6.50 (0.890) 6.81 (1.065) 6.80 (0.859) 0.179
Age (Std) 66.4 (9.78) 67.3 (9.97) 66.1 (8.21) 0.864
Treatment (%) 100 (50) 8 (40) 15 (50) 0.692
Center 1
(n=200)
Center 2
(n=48)
Center 3
(n=39)
P-value
BL (Std) 641 (701) 763 (527) 748 (428) <0.001
Age (Std) 66.4 (9.78) 64.6 (9.64) 61.7 (9.73) 0.007
Treatment (%) 100 (50) 21 (44) 21 (54) 0.622
18. VALIDATION RESULTS
TEST DATA: MULTICENTER TRIAL
Hemoglobine
EM-nr indicates the number of iterations used in EM
Convergence criterion for all parameters was set at 10-8
A start value of 0 for τ leads to incorrect results
Convergence is relatively fast and close to
the truth for positive starting values
β0 βZ βX τ2
σ2
EM-0 1 1 1 1 1
EM-190206 7.6086 -0.01548 0.04021 0.006659 0.7887
EM-0 1 1 1 0 1
EM-3 7.5683 -0.01556 0.03838 0 0.7933
SAS 7.6086 -0.01548 0.04021 0.006657 0.7887
19. VALIDATION RESULTS
TEST DATA: MULTICENTER TRIAL
Blood loss
EM-nr indicates the number of iterations used in EM
Convergence criterion for all parameters was set at 10-8
A start value of 1 for τ leads to incorrect results
Convergence is really fast and close to the
truth when τ = 0 as starting values
β0 βZ βX τ2
σ2
EM-0 1 1 1 1 1
EM-1139723 608.23 1.7755 -97.9651 0.01140 411547
EM-0 1 1 1 0 1
EM-3 608.23 1.7755 -97.9651 0 411547
SAS 608.23 1.7755 -97.9651 0 411547
20. VALIDATION RESULTS
TEST DATA: MULTICENTER TRIAL
Both sets of starting values are needed to make
appropriate inference
When the two sets provide identical estimates on fixed
parameters, then the set with τ = 0 provides the answer
When the two sets provide identical estimates on fixed
parameters, then τ ≠ 0 provides the answer
The standard errors can also be determined
It has not yet been incorporated in the R-program
Manual calculations demonstrate that the results coincide
with the SAS output, when the appropriate estimates are
taken into account
21. VALIDATION RESULTS
BIOSHARE DATA
Response: Systolic blood pressure
Exposure: Noise
Confounders: Age
Sex
PM10
Two cohorts: HUNT and LifeLines
EM-algorithm applied to fit random intercept
Comparison with model M1
Using standard-algorithm
Using DataSHIELD glm
22. VALIDATION RESULTS
BIOSHARE DATA
Systolic blood pressure
The EM-algorithm seems to demonstrate a heterogeneity
in the intercepts
The analysis of model M1 and EM are identical when the
starting value of τ = 0 → they both used the summaries
DataSHIELD glm seems to deviate somewhat, but this did
not happen on the test data
β0 βAGE βSEX βPM10 βNOISE τ2
σ2
EM-0 1 1 1 1 1 1 1
EM-Final 111.59 0.4141 -7.255 0.04627 -0.01351 1.8992 217.43
EM-0 1 1 1 1 1 0 1
EM-Final 114.68 0.4143 -7.2473 -0.16617 -0.00300 0 217.45
Model M1 114.68 0.4143 -7.2473 -0.16617 -0.00300 NA 217.45
DataSHIELD 114.95 0.4149 -7.2449 -0.18129 -0.00230 NA 217.41
23. CONCLUDING REMARKS
Follow-up steps for BioSHARE (in August):
1. Complete the existing algorithm for linear random
intercept models including standard errors
2. Implement this algorithm in DataSHIELD
3. Finalize statistics paper on algorithms for federated data
analysis for mixed models
Extensions for DataSHIELD after BioSHARE
1. To handling missing data sets as well
2. To linear random coefficient models
3. To generalized random coefficient models
24. Acknowledgement
The research leading to these results has received funding from the
European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 261433 (Biobank Standardisation and Harmonisation
for Research Excellence in the European Union - BioSHaRE-EU)
<please adapt text and lay out as necessary, and include other funders as
well. >