Data Management for Predictive Tools<br />Paul Fearn, MBA<br />NLM Informatics Research Fellow<br />Biomedical and Health ...
Data Management Requirements<br />Need to assemble large datasets for predictive modeling<br />Pooling data across sites, ...
The Growth Problem<br />Lu Z. PubMed and Beyond. Database 2011;2011:baq036  21245076[pmid]<br />
The Growth Problem<br />http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html<br />
The Growth Problem<br />http://www.ncbi.nlm.nih.gov/books/NBK44423/<br />
The Breaking Point<br />1000 cases<br />
The Growth Problem<br />Microsoft Access databases<br />1999 ProstateDB 1.0<br />2000 PRDB / Prostabase<br />ColdFusion & ...
The Curation Problem<br />Increasing volume of data<br />More data points for annotation<br />Clinical / patient<br />Geno...
On the Other Hand…<br />Long tail of research efforts<br />Small heterogeneous labs and projects<br />Subsets of data<br /...
Spectrum of Approaches<br />One dataset per project (i.e. study based systems)<br />Registry databases (i.e. one treatment...
Loosely Linking Data<br />http://www.ncbi.nlm.nih.gov/sites/gquery<br />
Tightly Integrating Data<br />Vocabulary / Terminology<br />NCI Thesaurus (NCIt)<br />NLM UMLS<br />Standard data models<b...
The CAISIS System<br />
Appendix: 394 people at 60 sites visited from Aug, 2008 to Jun, 2009<br />Driving<br />Flying<br />
<ul><li>Rise of collaborative networks (e.g. CTSAs)
Costly curation and support of research databases
Widespread and large scale implementation of EMRs
Upcoming SlideShare
Loading in …5
×

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for predictive tools

576
-1

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
576
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • I hope you will give a broad overview of the key features of the database that would allow the development of optimal predictive models, demonstrate how Caisis works to collect clinical and research data, and has proved to be so valuable to the development of predictive models.
  • Constraints on data entry increase reproducibility, but may decrease accuracyConducive to quantitative research and hypothesis testingOpen fields / coding may increase accuracy, but decrease reproducibilityConducive to qualitative research and discovery
  • Krallinger et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol (2008) vol. 9 Suppl 2 pp. S8Savova et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc (2010) vol. 17 (5) pp. 507-13
  • Caisis is a data repository. One data model to rule them all
  • How much time and effort does it take to pool databases and spreadsheets for predictive modeling?Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-2012000935[pmid]If there is a need for large aggregated datasets from heterogeneous sources to support predictive modeling, we need to plan for this model.Building for one site and rolling out to other sites successfully is rare.
  • Most people proclaimed that they did not want to “reinvent the wheel”, but proceeded to do so. Disconnect between beliefs and actions.Harris et al. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform (2009) vol. 42 (2) pp. 377-81
  • NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for predictive tools

    1. 1. Data Management for Predictive Tools<br />Paul Fearn, MBA<br />NLM Informatics Research Fellow<br />Biomedical and Health Informatics<br />University of Washington | Fred Hutchinson Cancer Research Center<br />Seattle, Washington<br />PROSTATE CANCER: PREDICTIVE MODELS FOR DECISION MAKING<br />April 7th – 9th, 2011 - MSKCC - New York, NY<br />
    2. 2. Data Management Requirements<br />Need to assemble large datasets for predictive modeling<br />Pooling data across sites, systems and countries<br />Linking data across clinical, specimen and lab repositories<br />Quality assurance (for reproducibility of results)<br />Tradeoffs between accuracy and reproducibility of data points<br />Transparency of data processing<br />Complete and up-to-date datasets<br />Ease to access, sort, filter and export data<br />Statistical analysis in Stata, R, SPSS, SAS, Excel<br />SQL queries and reports<br />Sustainability<br />Secondary (N-ary) use of clinical and research data<br />Cumulative cost of data entry<br />Cumulative cost of staff training and turnover<br />Cumulative risks and opportunity costs of staff entrenchment<br />
    3. 3. The Growth Problem<br />Lu Z. PubMed and Beyond. Database 2011;2011:baq036 21245076[pmid]<br />
    4. 4. The Growth Problem<br />http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html<br />
    5. 5. The Growth Problem<br />http://www.ncbi.nlm.nih.gov/books/NBK44423/<br />
    6. 6. The Breaking Point<br />1000 cases<br />
    7. 7. The Growth Problem<br />Microsoft Access databases<br />1999 ProstateDB 1.0<br />2000 PRDB / Prostabase<br />ColdFusion & SQL Server web-based database<br />2002 Valhalla 1.0 – 1.1<br />Prostate<br />2003 Valhalla 1.2 (7,994 patients)<br />Billing/EMR compliant populated clinic forms<br />ASP.NET & SQL Server web-based database<br />2004 CAISIS 2.0 – 2.1 (26,470 patients)<br />Integrated bladder, kidney, testis<br />2005 CAISIS 3.0 – 3.1 (44,000 patients)<br />Prostatectomy eForm, protocol manager, tumor maps<br />2006 CAISIS 3.5 – (55,000 patients)<br />GU and Urology Prostate Follow-up eForms<br />2007 CAISIS 4.0 – (80,000 patients)<br />Metadata, dynamic forms, new diseases and eForms<br />2008 CAISIS 4.1 – (98,000 patients)<br />Email eForms, advanced find, specimen tracking<br />2009 CAISIS 4.5 – (120,000+ patients)<br />Project tracking, patient education, virtual fields, reporting module<br />2010 CAISIS 5.0x<br />
    8. 8. The Curation Problem<br />Increasing volume of data<br />More data points for annotation<br />Clinical / patient<br />Genomic / biological<br />Public health / environment<br />Parallel curation issues in modern clinical and biological research databases (Krallinger 2008*)<br />Development of NLP system to support clinical research operations (Savova 2010**)<br />*18834499[pmid], **20819853[pmid]<br />
    9. 9. On the Other Hand…<br />Long tail of research efforts<br />Small heterogeneous labs and projects<br />Subsets of data<br />Specialized requirements<br />Innovative approaches<br />
    10. 10. Spectrum of Approaches<br />One dataset per project (i.e. study based systems)<br />Registry databases (i.e. one treatment or disease)<br />Data warehouse or data repository<br />Common schema (data model)<br />“Amalgamation” of heterogeneous datasets<br />Common security and access<br />Common syntax (data format)<br />Defined links between records<br />Indexed for searching and retrieval<br />Federation / grid of semantically integrated data<br />Common vocabulary / terminology<br />Formal models (caBIG)<br />
    11. 11. Loosely Linking Data<br />http://www.ncbi.nlm.nih.gov/sites/gquery<br />
    12. 12. Tightly Integrating Data<br />Vocabulary / Terminology<br />NCI Thesaurus (NCIt)<br />NLM UMLS<br />Standard data models<br />caBIG / caDSR<br />HL7/FDA/NCI CDISC / BRIDG<br />Web services*<br />Common syntax / format<br />*Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-20 12000935[pmid]<br />
    13. 13. The CAISIS System<br />
    14. 14. Appendix: 394 people at 60 sites visited from Aug, 2008 to Jun, 2009<br />Driving<br />Flying<br />
    15. 15. <ul><li>Rise of collaborative networks (e.g. CTSAs)
    16. 16. Costly curation and support of research databases
    17. 17. Widespread and large scale implementation of EMRs
    18. 18. Development of data warehouses and repositories
    19. 19. Integrating biospecimen repository systems and data
    20. 20. Difficulties accessing and retrieving research data
    21. 21. Skewed distribution of data systems
    22. 22. Prevalence of Microsoft Access and Excel solutions
    23. 23. Shifts to less expensive and more open source platforms
    24. 24. REDCap, CAISIS, caTissue, Python and Bioconductor</li></ul>Appendix: Site Visit Findings<br />
    25. 25. Appendix: Clinical Systems<br />Surgical Reports<br />Radiation Therapy Reports<br />Pathology Reports<br />Laboratory Reports<br />Radiology Reports<br />Review of Systems and Patient Reported Outcomes<br />Electronic Medical / Health Records<br />Registration / demographics<br />Clinical trials eligibility and recruitment<br />Scheduling and operations<br />
    26. 26. Appendix: Engaging Patients in Data Management<br />Pre-first visit questionnaires<br />Web-based survey systems (e.g. REDCap)<br />Patient reported outcomes<br />Longitudinal follow-up process<br />Tablets, iPads and mobile applications<br />

    ×