Data re-use in the CALIBER
programme
Anoop Shah (a.shah@ucl.ac.uk)
Clinical Epidemiology Group, University College London
...
1 The CALIBER programme

2 Why make research data re-usable?

3 The CALIBER approach

4 Summary
The CALIBER programme
UCL & LSHTM collaboration
General practice

MINAP registry

CALIBER
linked research database

Death
...
CALIBER data
Defining continuous variables
clinical e.g. blood pressure, laboratory e.g. white cell
count
ˆ

Recorded in CPRD (primary c...
Medcodes associated with a test result
Example: neutrophil counts (a type of white blood
cell) – may be absolute or percen...
Distribution of values for different units
Most common units
Analysis issues
ˆ
ˆ

Extraction algorithm
Remove biologically implausible extreme values
ˆ

ˆ
ˆ

In a huge dataset with no...
Observation time in GP practice
ˆ

Observation time – when registered at GP
practice
ˆ Practice ‘up to standard date’ – da...
Defining a diagnosis, e.g. atrial fibrillation
Defining a diagnosis

ˆ
ˆ

Cross-map against different datasets
Individual data sources may miss cases, so
consider using li...
Non-fatal myocardial infarction – all
sources miss cases
MINAP
disease
registry

8%
6%
Primary
care
(CPRD)

18%

7%
20%

1...
Motivations for re-using data

ˆ

Time taken to prepare data and define
variables
ˆ

ˆ

Cost

Different definitions used by d...
Possible approaches

ˆ

Ad hoc sharing of codelists and algorithms
within a group
ˆ Publish codelists and algorithms with ...
CALIBER ‘LEGO’ data access model
1001, 2000-01-01, 23,1,NULL,I48
1001, 1994-08-11,1234,1,3,7L1H300
1001, 1993-01-01, 253,1...
CALIBER phenotypes (research variables)
ˆ

Consistent definitions for multiple studies (over
300 variables curated)
ˆ Read,...
CALIBER data portal
Open data
CALIBER data portal

ˆ

Encourage researchers to define variables in a
way that will be of use to others
ˆ Final validated ...
CALIBER analysis software

ˆ

R packages for managing codelists and data
preparation (http://caliberanalysis.
r-forge.r-pr...
CALIBER expects researchers to
contribute to the resource
Investigators

Noninvestigators
Nonexperienced

Experienced

Res...
Difficulties encountered

ˆ

Setting up the data portal takes time, needs
dedicated staff
ˆ Researchers need to think outside...
Summary

ˆ

When analysing routine data think about how
the data were collected, and cross-check
different sources of infor...
Upcoming SlideShare
Loading in …5
×

Data re-use in the CALIBER programme

337 views

Published on

An overview of work being performed to make research data easier to manage, analyse and use in the CALIBER programme. Presentation given by Anoop Shah of UCL at the Data Management in Practice workshop which took place on Nov 14th at the London School of Hygiene and Tropical Medicine

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
337
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data re-use in the CALIBER programme

  1. 1. Data re-use in the CALIBER programme Anoop Shah (a.shah@ucl.ac.uk) Clinical Epidemiology Group, University College London 14th November 2013
  2. 2. 1 The CALIBER programme 2 Why make research data re-usable? 3 The CALIBER approach 4 Summary
  3. 3. The CALIBER programme UCL & LSHTM collaboration General practice MINAP registry CALIBER linked research database Death registrations Hospital Episode Statistics Funded by NIHR and Wellcome Trust
  4. 4. CALIBER data
  5. 5. Defining continuous variables clinical e.g. blood pressure, laboratory e.g. white cell count ˆ Recorded in CPRD (primary care) ˆ Identified by ‘entity code’ and medcode (more granular) ˆ Lab data now electronically transferred ˆ Problems: ˆ ˆ ˆ ˆ Missing units Erroneous values Inconsistent recording Missing data
  6. 6. Medcodes associated with a test result Example: neutrophil counts (a type of white blood cell) – may be absolute or percentage Medcode Percent Term 18 89.6 Neutrophil count 17622 9.9 Percentage neutrophils 23114 0.3 Granulocyte count 23115 0.1 13777 0.1 Percentage granulocytes Neutrophil count NOS
  7. 7. Distribution of values for different units
  8. 8. Most common units
  9. 9. Analysis issues ˆ ˆ Extraction algorithm Remove biologically implausible extreme values ˆ ˆ ˆ In a huge dataset with no restriction on possible values, there will be some errors Standardise units Decide how to analyse ˆ ˆ ˆ ˆ Timing e.g. relative to index date Repeat measures Transformation, splines, categories etc. Missing data (e.g. multiple imputation)
  10. 10. Observation time in GP practice ˆ Observation time – when registered at GP practice ˆ Practice ‘up to standard date’ – date after which we expect that data are recorded ˆ If nothing recorded while registered at GP: ˆ ˆ ˆ Patient may be abroad Patient may be genuinely healthy Excluding observation time with no records risks bias
  11. 11. Defining a diagnosis, e.g. atrial fibrillation
  12. 12. Defining a diagnosis ˆ ˆ Cross-map against different datasets Individual data sources may miss cases, so consider using linked datasets ˆ ˆ Important for accurate measures of incidence May be less important for associations between disease and risk factor, as long as the risk factor does not influence recording
  13. 13. Non-fatal myocardial infarction – all sources miss cases MINAP disease registry 8% 6% Primary care (CPRD) 18% 7% 20% 10% Hospital Episode Statistics
  14. 14. Motivations for re-using data ˆ Time taken to prepare data and define variables ˆ ˆ Cost Different definitions used by different groups ˆ Lack of transparency and reproducibility
  15. 15. Possible approaches ˆ Ad hoc sharing of codelists and algorithms within a group ˆ Publish codelists and algorithms with papers ˆ The CALIBER approach ˆ ˆ Repository of codelists and algorithms Web portal for researcher access
  16. 16. CALIBER ‘LEGO’ data access model 1001, 2000-01-01, 23,1,NULL,I48 1001, 1994-08-11,1234,1,3,7L1H300 1001, 1993-01-01, 253,1,1,793Mz00 1231, 2012-03-03, 23,1,123,K65 1121, 2013-05-04, 7,1,3,5,14AN.00 1121, 2011-05-21, 81,1,9, G573100 1511, 1993-01-11, 91,1,6,9hF1.00 1511, 199-03-11, 91,1,6, G573100 9913, 2012-05-21, 81,1,9, G573100 67222, 1994-11-01,1234,1,3,7L1H300 67222, 1995-12-21,1234,1,3,7L1H300 67222, 1991-03-03,1234,1,3,7L1H310 682444, 1993-01-01, 253,1,1,793Mz00 1001, 2000-01-01, af_gprd=1 1231, 2012-03-03, af_hes=3 1121, 2013-05-04, af_procs_gprd=1 1511, 1993-01-11, heart_valve_gprd=2 9913, 2012-05-21, af_hes=1 67222, 1994-08-11, af_hes=1 682444, 1993-01-01, heart_valve_hes=2 af=1, af_diag_date=2001-12-01
  17. 17. CALIBER phenotypes (research variables) ˆ Consistent definitions for multiple studies (over 300 variables curated) ˆ Read, ICD-9, ICD-10, OPCS codelists ˆ Web portal to view variable definitions, and registered users can view codelists (https: //www.caliberresearch.org/portal) ˆ Future: able to download scripts (e.g. Stata, R, SQL)
  18. 18. CALIBER data portal
  19. 19. Open data
  20. 20. CALIBER data portal ˆ Encourage researchers to define variables in a way that will be of use to others ˆ Final validated versions of codelists and variables ˆ Review by clinician and researcher
  21. 21. CALIBER analysis software ˆ R packages for managing codelists and data preparation (http://caliberanalysis. r-forge.r-project.org/) ˆ Lookup tables and data dictionaries ˆ Functions to simplify / automate common steps in data preparation
  22. 22. CALIBER expects researchers to contribute to the resource Investigators Noninvestigators Nonexperienced Experienced Research coordinator Industry Website form Approvals Data Analysis Publication Impacts Website content Project feasibility and prioritization Unified data access form LEGO data access model Contribute phenotyping algorithms, linkages Contribute to knowledge base Open access Advancement of knowledge Translation Legislation, policy, guidelines Economic benefit, industry
  23. 23. Difficulties encountered ˆ Setting up the data portal takes time, needs dedicated staff ˆ Researchers need to think outside their own project ˆ Variables are updated / corrected; need to store different versions
  24. 24. Summary ˆ When analysing routine data think about how the data were collected, and cross-check different sources of information ˆ Data sharing and re-use can bring benefits but needs time and resources to manage

×