July 2, 2020
Open Science for research questions,
data, and analyses?
Ewout W. Steyerberg, PhD
Professor of Clinical Biostatistics and
Medical Decision Making
Thanks to many for assistance and inspiration,
including the GAP3 consortium, CENTER-TBI Study
Open Science: what is it at LMU?
2-Jul-202 Insert > Header & footer
Open Science: what is it in the Netherlands?
2-Jul-203 Insert > Header & footer
https://www.openscience.nl/
https://www.coalition-s.org/
Open vs closed science
Long ago
- Performed by few, elitarian scientists
- Doing private experiments
- Discussion in small, closed communities
Recent
- Science as a profession
- Protect data + code as intellectual property
- Aim for shocking findings in high IF journals
https://www.sciencemag.org/news/2020/06/whos-blame-these-three-scientists-are-heart-surgisphere-covid-19-scandal
Overall claim
“Open Science will make research better”
Vote pro / con
Aims today:
- Highlight some strong points in Open Science
- Hint at some challenges in Open Science
Reflections based on personal 30-yr research experience,
specific focus on prediction / decision making
2-Jul-205 Insert > Header & footer
Open Science to better address
Big research questions
Open science research questions: case 1
Example 1: Red cards and dark skin soccer players
https://psyarxiv.com/qkwst/
2-Jul-207 Insert > Header & footer
Open science research questions: case 1
• 29 teams involving 61 analysts; same dataset; same research question:
whether soccer referees are more likely to give red cards to dark skin
toned players than light skin toned players
• Estimated odds ratios 0.89 –2.93 (median 1.3)
• 20 teams: statistically significant positive effect, 9: non-significant relation
2-Jul-208 Insert > Header & footer
Estimated odds ratios by 29 research teams
2-Jul-209 Insert > Header & footer
“Logistic regression”
2-Jul-2010 Insert > Header & footer
Open science research questions: case 1
• 29 teams involving 61 analysts; same dataset; same research question:
whether soccer referees are more likely to give red cards to dark skin toned
players than light skin toned players
• Estimated odds ratios 0.89 –2.93 (median 1.3).
• 20 teams: statistically significant positive effect, 9: non-significant relation.
• 21 unique combinations of covariates
• “Variation in analysis of complex data may be difficult to
avoid, even by experts with honest intentions”
2-Jul-2011 Insert > Header & footer
Open science research questions: case 2
2-Jul-2012 Insert > Header & footer
Machine learning vs conventional modeling
1. Findings convincing?
2. Systematic / ”it depends” ?
2-Jul-2013 Insert > Header & footer
Findings not convincing
Cox, #4, 30 vars, max c =0.793
RF, #7, 600 vars, c=0.797
Elastic, #9, 600 vars, c=0.801
2-Jul-2014 Insert > Header & footer
Machine learning vs conventional modeling
1. Findings convincing?
“We found that random forests did not outperform Cox models despite their
inherent ability to accommodate nonlinearities and interactions. …
Elastic nets achieved the highest discrimination performance …, demonstrating
the ability of regularisation to select relevant variables and optimise model
coefficients in an EHR context.”
2-Jul-2015 Insert > Header & footer
Machine learning vs conventional modeling
1. Findings convincing? Not in case-study
2. Systematic / ”it depends” ?
2-Jul-2016 Insert > Header & footer
2-Jul-2017 Insert > Header & footer
2-Jul-2018 Insert > Header & footer
Open science research questions: case 2
• 243 real datasets from “the OpenML database”
• RF performed better than LR:
mean difference between RF and LR was 0.041 (95%-CI =[0.031,0.053]) for
the Area Under the ROC Curve
• Results were dependent on the inclusion criteria used to select the example
datasets
• ES: Results rely on 10 x 10-fold cross-validation
2-Jul-2019 Insert > Header & footer
Open science research questions: case 2
• More clarification needed when ML / RF works best; at least large N needed
2-Jul-2020 Insert > Header & footer
Systematic review on ML vs classic modeling
2-Jul-2021 Insert > Header & footer
Differences in discrimination
Summary on examples of Open Science
to better address Big research questions
• 1 data set
• multiple modelers
• Multiple modeling options
• 1 neutral comparison; 243 OpenML databases
• Review of 282 comparative studies: meta-research
2-Jul-2023 Insert > Header & footer
Open Science: data sharing
2-Jul-2025 Insert > Header & footer
Heterogeneity in data .. ignored
2-Jul-2026 Insert > Header & footer
Data sharing
• Pro:
• Allowed for larger sample size in a rare disease
• Cons:
• Heterogeneity?
• Substantial politics / efforts
2-Jul-2027 Insert > Header & footer
Open Science: analyses and interpretation
OHDSI: bridging data sharing - analyses
Analyses: ODHSI model
2-Jul-2030 Insert > Header & footer
OHDSI: COVID and other research topics
2-Jul-2031 Insert > Header & footer
The power of OHDSI
2-Jul-2032 Insert > Header & footer
OMOP common data model enables sharing of
model development code
2-Jul-2033 Insert > Header & footer
Performance for different outcomes in multiple cohorts
2-Jul-2034 Insert > Header & footer
OHDSI: bridging data sharing - analyses
• Keep data local
• Run locally started, centrally available analyses
• Share results centrally
Open Science: analyses and interpretation
Open Science challenge:
dealing with heterogeneity
Heterogeneity
• Study design
• Selection of subjects
• Measurement of covariates
• Measurement of outcomes
• Associations of covariates with outcome
• Overall outcome rates
• Performance of prediction models
Analyses: dealing with heterogeneity
2-Jul-2038 Insert > Header & footer
15 cohorts: 11 RCTs, 4 Observational studies
2-Jul-2039 Insert > Header & footer
Heterogeneous case-mix
2-Jul-2040 Insert > Header & footer
Heterogeneous predictor effects
2-Jul-2041 Insert > Header & footer
Heterogeneous predictions
2-Jul-2042 Insert > Header & footer
Heterogeneity in individual predictions
2-Jul-2043 Insert > Header & footer
“Open Science will make research better”
1. Research questions in competitions
• Red cards
• Neutral comparisons / meta-analysis
2. Data sharing
• old-fashioned?
3. Analyses
• OHDSI: modern
• Heterogeneity
Open science research extends discussions from meta-analysis;
contrast Cochrane reviews vs Big Data
2-Jul-2044 Insert > Header & footer

Open science LMU session contribution E Steyerberg 2jul20

  • 1.
    July 2, 2020 OpenScience for research questions, data, and analyses? Ewout W. Steyerberg, PhD Professor of Clinical Biostatistics and Medical Decision Making Thanks to many for assistance and inspiration, including the GAP3 consortium, CENTER-TBI Study
  • 2.
    Open Science: whatis it at LMU? 2-Jul-202 Insert > Header & footer
  • 3.
    Open Science: whatis it in the Netherlands? 2-Jul-203 Insert > Header & footer https://www.openscience.nl/ https://www.coalition-s.org/
  • 4.
    Open vs closedscience Long ago - Performed by few, elitarian scientists - Doing private experiments - Discussion in small, closed communities Recent - Science as a profession - Protect data + code as intellectual property - Aim for shocking findings in high IF journals https://www.sciencemag.org/news/2020/06/whos-blame-these-three-scientists-are-heart-surgisphere-covid-19-scandal
  • 5.
    Overall claim “Open Sciencewill make research better” Vote pro / con Aims today: - Highlight some strong points in Open Science - Hint at some challenges in Open Science Reflections based on personal 30-yr research experience, specific focus on prediction / decision making 2-Jul-205 Insert > Header & footer
  • 6.
    Open Science tobetter address Big research questions
  • 7.
    Open science researchquestions: case 1 Example 1: Red cards and dark skin soccer players https://psyarxiv.com/qkwst/ 2-Jul-207 Insert > Header & footer
  • 8.
    Open science researchquestions: case 1 • 29 teams involving 61 analysts; same dataset; same research question: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players • Estimated odds ratios 0.89 –2.93 (median 1.3) • 20 teams: statistically significant positive effect, 9: non-significant relation 2-Jul-208 Insert > Header & footer
  • 9.
    Estimated odds ratiosby 29 research teams 2-Jul-209 Insert > Header & footer
  • 10.
  • 11.
    Open science researchquestions: case 1 • 29 teams involving 61 analysts; same dataset; same research question: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players • Estimated odds ratios 0.89 –2.93 (median 1.3). • 20 teams: statistically significant positive effect, 9: non-significant relation. • 21 unique combinations of covariates • “Variation in analysis of complex data may be difficult to avoid, even by experts with honest intentions” 2-Jul-2011 Insert > Header & footer
  • 12.
    Open science researchquestions: case 2 2-Jul-2012 Insert > Header & footer
  • 13.
    Machine learning vsconventional modeling 1. Findings convincing? 2. Systematic / ”it depends” ? 2-Jul-2013 Insert > Header & footer
  • 14.
    Findings not convincing Cox,#4, 30 vars, max c =0.793 RF, #7, 600 vars, c=0.797 Elastic, #9, 600 vars, c=0.801 2-Jul-2014 Insert > Header & footer
  • 15.
    Machine learning vsconventional modeling 1. Findings convincing? “We found that random forests did not outperform Cox models despite their inherent ability to accommodate nonlinearities and interactions. … Elastic nets achieved the highest discrimination performance …, demonstrating the ability of regularisation to select relevant variables and optimise model coefficients in an EHR context.” 2-Jul-2015 Insert > Header & footer
  • 16.
    Machine learning vsconventional modeling 1. Findings convincing? Not in case-study 2. Systematic / ”it depends” ? 2-Jul-2016 Insert > Header & footer
  • 17.
    2-Jul-2017 Insert >Header & footer
  • 18.
    2-Jul-2018 Insert >Header & footer
  • 19.
    Open science researchquestions: case 2 • 243 real datasets from “the OpenML database” • RF performed better than LR: mean difference between RF and LR was 0.041 (95%-CI =[0.031,0.053]) for the Area Under the ROC Curve • Results were dependent on the inclusion criteria used to select the example datasets • ES: Results rely on 10 x 10-fold cross-validation 2-Jul-2019 Insert > Header & footer
  • 20.
    Open science researchquestions: case 2 • More clarification needed when ML / RF works best; at least large N needed 2-Jul-2020 Insert > Header & footer
  • 21.
    Systematic review onML vs classic modeling 2-Jul-2021 Insert > Header & footer
  • 22.
  • 23.
    Summary on examplesof Open Science to better address Big research questions • 1 data set • multiple modelers • Multiple modeling options • 1 neutral comparison; 243 OpenML databases • Review of 282 comparative studies: meta-research 2-Jul-2023 Insert > Header & footer
  • 24.
  • 25.
    2-Jul-2025 Insert >Header & footer
  • 26.
    Heterogeneity in data.. ignored 2-Jul-2026 Insert > Header & footer
  • 27.
    Data sharing • Pro: •Allowed for larger sample size in a rare disease • Cons: • Heterogeneity? • Substantial politics / efforts 2-Jul-2027 Insert > Header & footer
  • 28.
    Open Science: analysesand interpretation
  • 29.
    OHDSI: bridging datasharing - analyses
  • 30.
    Analyses: ODHSI model 2-Jul-2030Insert > Header & footer
  • 31.
    OHDSI: COVID andother research topics 2-Jul-2031 Insert > Header & footer
  • 32.
    The power ofOHDSI 2-Jul-2032 Insert > Header & footer
  • 33.
    OMOP common datamodel enables sharing of model development code 2-Jul-2033 Insert > Header & footer
  • 34.
    Performance for differentoutcomes in multiple cohorts 2-Jul-2034 Insert > Header & footer
  • 35.
    OHDSI: bridging datasharing - analyses • Keep data local • Run locally started, centrally available analyses • Share results centrally
  • 36.
    Open Science: analysesand interpretation
  • 37.
    Open Science challenge: dealingwith heterogeneity Heterogeneity • Study design • Selection of subjects • Measurement of covariates • Measurement of outcomes • Associations of covariates with outcome • Overall outcome rates • Performance of prediction models
  • 38.
    Analyses: dealing withheterogeneity 2-Jul-2038 Insert > Header & footer
  • 39.
    15 cohorts: 11RCTs, 4 Observational studies 2-Jul-2039 Insert > Header & footer
  • 40.
  • 41.
  • 42.
  • 43.
    Heterogeneity in individualpredictions 2-Jul-2043 Insert > Header & footer
  • 44.
    “Open Science willmake research better” 1. Research questions in competitions • Red cards • Neutral comparisons / meta-analysis 2. Data sharing • old-fashioned? 3. Analyses • OHDSI: modern • Heterogeneity Open science research extends discussions from meta-analysis; contrast Cochrane reviews vs Big Data 2-Jul-2044 Insert > Header & footer