Data Management for Predictive ToolsPaul Fearn, MBANLM Informatics Research FellowBiomedical and Health InformaticsUniversity of Washington | Fred Hutchinson Cancer Research CenterSeattle, WashingtonPROSTATE CANCER: PREDICTIVE MODELS FOR DECISION MAKINGApril 7th – 9th, 2011  - MSKCC - New York, NY
Data Management RequirementsNeed to assemble large datasets for predictive modelingPooling data across sites, systems and countriesLinking data across clinical, specimen and lab repositoriesQuality assurance (for reproducibility of results)Tradeoffs between accuracy and reproducibility of data pointsTransparency of data processingComplete and up-to-date datasetsEase to access, sort, filter and export dataStatistical analysis in Stata, R, SPSS, SAS, ExcelSQL queries and reportsSustainabilitySecondary (N-ary) use of clinical and research dataCumulative cost of data entryCumulative cost of staff training and turnoverCumulative risks and opportunity costs of staff entrenchment
The Growth ProblemLu Z. PubMed and Beyond. Database 2011;2011:baq036  21245076[pmid]
The Growth Problemhttp://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
The Growth Problemhttp://www.ncbi.nlm.nih.gov/books/NBK44423/
The Breaking Point1000 cases
The Growth ProblemMicrosoft Access databases1999 ProstateDB 1.02000 PRDB / ProstabaseColdFusion & SQL Server web-based database2002 Valhalla 1.0 – 1.1Prostate2003 Valhalla 1.2 (7,994 patients)Billing/EMR compliant populated clinic formsASP.NET & SQL Server web-based database2004 CAISIS 2.0 – 2.1 (26,470 patients)Integrated bladder, kidney, testis2005 CAISIS 3.0 – 3.1 (44,000 patients)Prostatectomy eForm, protocol manager, tumor maps2006 CAISIS 3.5 – (55,000 patients)GU and Urology Prostate Follow-up eForms2007 CAISIS 4.0 – (80,000 patients)Metadata, dynamic forms, new diseases and eForms2008 CAISIS 4.1 – (98,000 patients)Email eForms, advanced find, specimen tracking2009 CAISIS 4.5 – (120,000+ patients)Project tracking, patient education, virtual fields, reporting module2010 CAISIS 5.0x
The Curation ProblemIncreasing volume of dataMore data points for annotationClinical / patientGenomic / biologicalPublic health / environmentParallel curation issues in modern clinical and biological research databases (Krallinger 2008*)Development of NLP system to support clinical research operations (Savova 2010**)*18834499[pmid], **20819853[pmid]
On the Other Hand…Long tail of research effortsSmall heterogeneous labs and projectsSubsets of dataSpecialized requirementsInnovative approaches
Spectrum of ApproachesOne dataset per project (i.e. study based systems)Registry databases (i.e. one treatment or disease)Data warehouse or data repositoryCommon schema (data model)“Amalgamation” of heterogeneous datasetsCommon security and accessCommon syntax (data format)Defined links between recordsIndexed for searching and retrievalFederation / grid of semantically integrated dataCommon vocabulary / terminologyFormal models (caBIG)
Loosely Linking Datahttp://www.ncbi.nlm.nih.gov/sites/gquery
Tightly Integrating DataVocabulary / TerminologyNCI Thesaurus (NCIt)NLM UMLSStandard data modelscaBIG / caDSRHL7/FDA/NCI CDISC / BRIDGWeb services*Common syntax / format*Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-20 12000935[pmid]
The CAISIS System
Appendix: 394 people at 60 sites visited from Aug, 2008 to Jun, 2009DrivingFlying
Rise of collaborative networks (e.g. CTSAs)
Costly curation and support of research databases
Widespread and large scale implementation of EMRs

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for predictive tools

  • 1.
    Data Management forPredictive ToolsPaul Fearn, MBANLM Informatics Research FellowBiomedical and Health InformaticsUniversity of Washington | Fred Hutchinson Cancer Research CenterSeattle, WashingtonPROSTATE CANCER: PREDICTIVE MODELS FOR DECISION MAKINGApril 7th – 9th, 2011 - MSKCC - New York, NY
  • 2.
    Data Management RequirementsNeedto assemble large datasets for predictive modelingPooling data across sites, systems and countriesLinking data across clinical, specimen and lab repositoriesQuality assurance (for reproducibility of results)Tradeoffs between accuracy and reproducibility of data pointsTransparency of data processingComplete and up-to-date datasetsEase to access, sort, filter and export dataStatistical analysis in Stata, R, SPSS, SAS, ExcelSQL queries and reportsSustainabilitySecondary (N-ary) use of clinical and research dataCumulative cost of data entryCumulative cost of staff training and turnoverCumulative risks and opportunity costs of staff entrenchment
  • 3.
    The Growth ProblemLuZ. PubMed and Beyond. Database 2011;2011:baq036 21245076[pmid]
  • 4.
  • 5.
  • 6.
  • 7.
    The Growth ProblemMicrosoftAccess databases1999 ProstateDB 1.02000 PRDB / ProstabaseColdFusion & SQL Server web-based database2002 Valhalla 1.0 – 1.1Prostate2003 Valhalla 1.2 (7,994 patients)Billing/EMR compliant populated clinic formsASP.NET & SQL Server web-based database2004 CAISIS 2.0 – 2.1 (26,470 patients)Integrated bladder, kidney, testis2005 CAISIS 3.0 – 3.1 (44,000 patients)Prostatectomy eForm, protocol manager, tumor maps2006 CAISIS 3.5 – (55,000 patients)GU and Urology Prostate Follow-up eForms2007 CAISIS 4.0 – (80,000 patients)Metadata, dynamic forms, new diseases and eForms2008 CAISIS 4.1 – (98,000 patients)Email eForms, advanced find, specimen tracking2009 CAISIS 4.5 – (120,000+ patients)Project tracking, patient education, virtual fields, reporting module2010 CAISIS 5.0x
  • 8.
    The Curation ProblemIncreasingvolume of dataMore data points for annotationClinical / patientGenomic / biologicalPublic health / environmentParallel curation issues in modern clinical and biological research databases (Krallinger 2008*)Development of NLP system to support clinical research operations (Savova 2010**)*18834499[pmid], **20819853[pmid]
  • 9.
    On the OtherHand…Long tail of research effortsSmall heterogeneous labs and projectsSubsets of dataSpecialized requirementsInnovative approaches
  • 10.
    Spectrum of ApproachesOnedataset per project (i.e. study based systems)Registry databases (i.e. one treatment or disease)Data warehouse or data repositoryCommon schema (data model)“Amalgamation” of heterogeneous datasetsCommon security and accessCommon syntax (data format)Defined links between recordsIndexed for searching and retrievalFederation / grid of semantically integrated dataCommon vocabulary / terminologyFormal models (caBIG)
  • 11.
  • 12.
    Tightly Integrating DataVocabulary/ TerminologyNCI Thesaurus (NCIt)NLM UMLSStandard data modelscaBIG / caDSRHL7/FDA/NCI CDISC / BRIDGWeb services*Common syntax / format*Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-20 12000935[pmid]
  • 13.
  • 14.
    Appendix: 394 peopleat 60 sites visited from Aug, 2008 to Jun, 2009DrivingFlying
  • 15.
    Rise of collaborativenetworks (e.g. CTSAs)
  • 16.
    Costly curation andsupport of research databases
  • 17.
    Widespread and largescale implementation of EMRs
  • 18.
    Development of datawarehouses and repositories
  • 19.
  • 20.
    Difficulties accessing andretrieving research data
  • 21.
  • 22.
    Prevalence of MicrosoftAccess and Excel solutions
  • 23.
    Shifts to lessexpensive and more open source platforms
  • 24.
    REDCap, CAISIS, caTissue,Python and BioconductorAppendix: Site Visit Findings
  • 25.
    Appendix: Clinical SystemsSurgicalReportsRadiation Therapy ReportsPathology ReportsLaboratory ReportsRadiology ReportsReview of Systems and Patient Reported OutcomesElectronic Medical / Health RecordsRegistration / demographicsClinical trials eligibility and recruitmentScheduling and operations
  • 26.
    Appendix: Engaging Patientsin Data ManagementPre-first visit questionnairesWeb-based survey systems (e.g. REDCap)Patient reported outcomesLongitudinal follow-up processTablets, iPads and mobile applications

Editor's Notes

  • #2 I hope you will give a broad overview of the key features of the database that would allow the development of optimal predictive models, demonstrate how Caisis works to collect clinical and research data, and has proved to be so valuable to the development of predictive models.
  • #3 Constraints on data entry increase reproducibility, but may decrease accuracyConducive to quantitative research and hypothesis testingOpen fields / coding may increase accuracy, but decrease reproducibilityConducive to qualitative research and discovery
  • #9 Krallinger et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol (2008) vol. 9 Suppl 2 pp. S8Savova et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc (2010) vol. 17 (5) pp. 507-13
  • #11 Caisis is a data repository. One data model to rule them all
  • #13 How much time and effort does it take to pool databases and spreadsheets for predictive modeling?Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-2012000935[pmid]If there is a need for large aggregated datasets from heterogeneous sources to support predictive modeling, we need to plan for this model.Building for one site and rolling out to other sites successfully is rare.
  • #16 Most people proclaimed that they did not want to “reinvent the wheel”, but proceeded to do so. Disconnect between beliefs and actions.Harris et al. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform (2009) vol. 42 (2) pp. 377-81