I hope you will give a broad overview of the key features of the database that would allow the development of optimal predictive models, demonstrate how Caisis works to collect clinical and research data, and has proved to be so valuable to the development of predictive models.
Constraints on data entry increase reproducibility, but may decrease accuracyConducive to quantitative research and hypothesis testingOpen fields / coding may increase accuracy, but decrease reproducibilityConducive to qualitative research and discovery
Krallinger et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol (2008) vol. 9 Suppl 2 pp. S8Savova et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc (2010) vol. 17 (5) pp. 507-13
Caisis is a data repository. One data model to rule them all
How much time and effort does it take to pool databases and spreadsheets for predictive modeling?Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-2012000935[pmid]If there is a need for large aggregated datasets from heterogeneous sources to support predictive modeling, we need to plan for this model.Building for one site and rolling out to other sites successfully is rare.
Most people proclaimed that they did not want to “reinvent the wheel”, but proceeded to do so. Disconnect between beliefs and actions.Harris et al. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform (2009) vol. 42 (2) pp. 377-81
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for predictive tools
Data Management for Predictive Tools<br />Paul Fearn, MBA<br />NLM Informatics Research Fellow<br />Biomedical and Health Informatics<br />University of Washington | Fred Hutchinson Cancer Research Center<br />Seattle, Washington<br />PROSTATE CANCER: PREDICTIVE MODELS FOR DECISION MAKING<br />April 7th – 9th, 2011 - MSKCC - New York, NY<br />
Data Management Requirements<br />Need to assemble large datasets for predictive modeling<br />Pooling data across sites, systems and countries<br />Linking data across clinical, specimen and lab repositories<br />Quality assurance (for reproducibility of results)<br />Tradeoffs between accuracy and reproducibility of data points<br />Transparency of data processing<br />Complete and up-to-date datasets<br />Ease to access, sort, filter and export data<br />Statistical analysis in Stata, R, SPSS, SAS, Excel<br />SQL queries and reports<br />Sustainability<br />Secondary (N-ary) use of clinical and research data<br />Cumulative cost of data entry<br />Cumulative cost of staff training and turnover<br />Cumulative risks and opportunity costs of staff entrenchment<br />
The Growth Problem<br />Lu Z. PubMed and Beyond. Database 2011;2011:baq036 21245076[pmid]<br />
The Growth Problem<br />http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html<br />
The Growth Problem<br />http://www.ncbi.nlm.nih.gov/books/NBK44423/<br />
The Curation Problem<br />Increasing volume of data<br />More data points for annotation<br />Clinical / patient<br />Genomic / biological<br />Public health / environment<br />Parallel curation issues in modern clinical and biological research databases (Krallinger 2008*)<br />Development of NLP system to support clinical research operations (Savova 2010**)<br />*18834499[pmid], **20819853[pmid]<br />
On the Other Hand…<br />Long tail of research efforts<br />Small heterogeneous labs and projects<br />Subsets of data<br />Specialized requirements<br />Innovative approaches<br />
Spectrum of Approaches<br />One dataset per project (i.e. study based systems)<br />Registry databases (i.e. one treatment or disease)<br />Data warehouse or data repository<br />Common schema (data model)<br />“Amalgamation” of heterogeneous datasets<br />Common security and access<br />Common syntax (data format)<br />Defined links between records<br />Indexed for searching and retrieval<br />Federation / grid of semantically integrated data<br />Common vocabulary / terminology<br />Formal models (caBIG)<br />