This document discusses data management requirements for predictive modeling using large datasets from multiple clinical, specimen, and lab repositories. It notes the need to assemble complete and up-to-date datasets while maintaining quality assurance and transparency. Over time, data storage systems experience problems with exponential data growth, manual data curation difficulties, and challenges integrating heterogeneous databases across different research groups. The document examines a spectrum of potential data management approaches and highlights collaborative networks and use of open source platforms as ways to address these issues.