• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data re-use in the CALIBER programme

Data re-use in the CALIBER programme



An overview of work being performed to make research data easier to manage, analyse and use in the CALIBER programme. Presentation given by Anoop Shah of UCL at the Data Management in Practice ...

An overview of work being performed to make research data easier to manage, analyse and use in the CALIBER programme. Presentation given by Anoop Shah of UCL at the Data Management in Practice workshop which took place on Nov 14th at the London School of Hygiene and Tropical Medicine



Total Views
Views on SlideShare
Embed Views



1 Embed 35

http://blogs.lshtm.ac.uk 35



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data re-use in the CALIBER programme Data re-use in the CALIBER programme Presentation Transcript

    • Data re-use in the CALIBER programme Anoop Shah (a.shah@ucl.ac.uk) Clinical Epidemiology Group, University College London 14th November 2013
    • 1 The CALIBER programme 2 Why make research data re-usable? 3 The CALIBER approach 4 Summary
    • The CALIBER programme UCL & LSHTM collaboration General practice MINAP registry CALIBER linked research database Death registrations Hospital Episode Statistics Funded by NIHR and Wellcome Trust
    • CALIBER data
    • Defining continuous variables clinical e.g. blood pressure, laboratory e.g. white cell count ˆ Recorded in CPRD (primary care) ˆ Identified by ‘entity code’ and medcode (more granular) ˆ Lab data now electronically transferred ˆ Problems: ˆ ˆ ˆ ˆ Missing units Erroneous values Inconsistent recording Missing data
    • Medcodes associated with a test result Example: neutrophil counts (a type of white blood cell) – may be absolute or percentage Medcode Percent Term 18 89.6 Neutrophil count 17622 9.9 Percentage neutrophils 23114 0.3 Granulocyte count 23115 0.1 13777 0.1 Percentage granulocytes Neutrophil count NOS
    • Distribution of values for different units
    • Most common units
    • Analysis issues ˆ ˆ Extraction algorithm Remove biologically implausible extreme values ˆ ˆ ˆ In a huge dataset with no restriction on possible values, there will be some errors Standardise units Decide how to analyse ˆ ˆ ˆ ˆ Timing e.g. relative to index date Repeat measures Transformation, splines, categories etc. Missing data (e.g. multiple imputation)
    • Observation time in GP practice ˆ Observation time – when registered at GP practice ˆ Practice ‘up to standard date’ – date after which we expect that data are recorded ˆ If nothing recorded while registered at GP: ˆ ˆ ˆ Patient may be abroad Patient may be genuinely healthy Excluding observation time with no records risks bias
    • Defining a diagnosis, e.g. atrial fibrillation
    • Defining a diagnosis ˆ ˆ Cross-map against different datasets Individual data sources may miss cases, so consider using linked datasets ˆ ˆ Important for accurate measures of incidence May be less important for associations between disease and risk factor, as long as the risk factor does not influence recording
    • Non-fatal myocardial infarction – all sources miss cases MINAP disease registry 8% 6% Primary care (CPRD) 18% 7% 20% 10% Hospital Episode Statistics
    • Motivations for re-using data ˆ Time taken to prepare data and define variables ˆ ˆ Cost Different definitions used by different groups ˆ Lack of transparency and reproducibility
    • Possible approaches ˆ Ad hoc sharing of codelists and algorithms within a group ˆ Publish codelists and algorithms with papers ˆ The CALIBER approach ˆ ˆ Repository of codelists and algorithms Web portal for researcher access
    • CALIBER ‘LEGO’ data access model 1001, 2000-01-01, 23,1,NULL,I48 1001, 1994-08-11,1234,1,3,7L1H300 1001, 1993-01-01, 253,1,1,793Mz00 1231, 2012-03-03, 23,1,123,K65 1121, 2013-05-04, 7,1,3,5,14AN.00 1121, 2011-05-21, 81,1,9, G573100 1511, 1993-01-11, 91,1,6,9hF1.00 1511, 199-03-11, 91,1,6, G573100 9913, 2012-05-21, 81,1,9, G573100 67222, 1994-11-01,1234,1,3,7L1H300 67222, 1995-12-21,1234,1,3,7L1H300 67222, 1991-03-03,1234,1,3,7L1H310 682444, 1993-01-01, 253,1,1,793Mz00 1001, 2000-01-01, af_gprd=1 1231, 2012-03-03, af_hes=3 1121, 2013-05-04, af_procs_gprd=1 1511, 1993-01-11, heart_valve_gprd=2 9913, 2012-05-21, af_hes=1 67222, 1994-08-11, af_hes=1 682444, 1993-01-01, heart_valve_hes=2 af=1, af_diag_date=2001-12-01
    • CALIBER phenotypes (research variables) ˆ Consistent definitions for multiple studies (over 300 variables curated) ˆ Read, ICD-9, ICD-10, OPCS codelists ˆ Web portal to view variable definitions, and registered users can view codelists (https: //www.caliberresearch.org/portal) ˆ Future: able to download scripts (e.g. Stata, R, SQL)
    • CALIBER data portal
    • Open data
    • CALIBER data portal ˆ Encourage researchers to define variables in a way that will be of use to others ˆ Final validated versions of codelists and variables ˆ Review by clinician and researcher
    • CALIBER analysis software ˆ R packages for managing codelists and data preparation (http://caliberanalysis. r-forge.r-project.org/) ˆ Lookup tables and data dictionaries ˆ Functions to simplify / automate common steps in data preparation
    • CALIBER expects researchers to contribute to the resource Investigators Noninvestigators Nonexperienced Experienced Research coordinator Industry Website form Approvals Data Analysis Publication Impacts Website content Project feasibility and prioritization Unified data access form LEGO data access model Contribute phenotyping algorithms, linkages Contribute to knowledge base Open access Advancement of knowledge Translation Legislation, policy, guidelines Economic benefit, industry
    • Difficulties encountered ˆ Setting up the data portal takes time, needs dedicated staff ˆ Researchers need to think outside their own project ˆ Variables are updated / corrected; need to store different versions
    • Summary ˆ When analysing routine data think about how the data were collected, and cross-check different sources of information ˆ Data sharing and re-use can bring benefits but needs time and resources to manage