Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Research data management for medical data with pyradigm

34 views

Published on

Research data management for medical data with pyradigm.

Python data structure for biomedical data to manage multiple tables linked via patient info or other washable IDs. Allowing continuous validation, this data structure would improve ease of use as well as integrity of the dataset.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Research data management for medical data with pyradigm

  1. 1. Research data management for medical data with pyradigm Pradeep Reddy Raamana crossinvalidation.com github.com/raamana
  2. 2. Research data management 2
  3. 3. Research data management Plan Create Process AnalyzePreserve Share Reuse 2
  4. 4. Research data management Plan Create Process AnalyzePreserve Share Reuse 2 Goal: reduce data entropy in few parts of this lifecycle! Data entropy: Normal degradation in information content associated with data and metadata over time © Data One
  5. 5. Research data management Plan Create Process AnalyzePreserve Share Reuse 2 Goal: reduce data entropy in few parts of this lifecycle! Data entropy: Normal degradation in information content associated with data and metadata over time © Data One
  6. 6. Research data management Plan Create Process AnalyzePreserve Share Reuse 2 Goal: reduce data entropy in few parts of this lifecycle! Data entropy: Normal degradation in information content associated with data and metadata over time © Data One I focus not on files, but derived features: tables for machine learning: Research Feature Management (RFM)?
  7. 7. Dataset Lifecyle in ML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  8. 8. Dataset Lifecyle in ML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  9. 9. Challenges in RDM for Medical Data 4 Too many tables to manage even for a small project!
  10. 10. Challenges in RDM for Medical Data 4
  11. 11. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types 4
  12. 12. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs 4
  13. 13. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data 4
  14. 14. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all 4
  15. 15. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training 4
  16. 16. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! 4
  17. 17. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! • Library built to reduce my own pain! 4
  18. 18. Need for accessibility and domain adaptation 5
  19. 19. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use 5
  20. 20. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) 5
  21. 21. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) • Trying to reduce “data entropy” in key parts of RDM • Reminder: Data lasts MUCH LONGER than the project itself! 5
  22. 22. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X
  23. 23. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X ktargets y
  24. 24. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y
  25. 25. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  26. 26. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  27. 27. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification covariates or confounds such as age, gender, site, Usually scalars, but sometimes vectors too!
  28. 28. Implementation details 7
  29. 29. Implementation details • BaseDataset 7
  30. 30. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of 7
  31. 31. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats 7
  32. 32. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical 7
  33. 33. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) 7
  34. 34. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc 7
  35. 35. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc • Many other possibilities • Depending on domain and use-case 7
  36. 36. Implementation details contd. • Classes and data are managed via dict of dicts • Convenience for developers • Current setup is fine for our domain: • 1000s of rows, few 1000s of columns • Larger and more complex domains need fine-tuning • Serialization • Pickle files • HDF etc are possible • Requesting help from contributors 8
  37. 37. Usage 9
  38. 38. Usage 9 • Once it is built by someone, no one else has to worry about it. • They can slice and dice it in any number of ways they desire • You don’t need the script that built this data structure, as it’s more or less self-explanatory
  39. 39. Dataset iteration & arithmetic 10
  40. 40. Dataset iteration & arithmetic 10 • Lot more intuitive! • Higher-level organization, not rows, columns and comments! • Meta-data gets propagated automatically ! life is much more productive! • Achieving this with CSVs is a huge pain!
  41. 41. Advantages 11
  42. 42. Advantages • Intuitive for niche domains: easy to use and teach 11
  43. 43. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! 11
  44. 44. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back 11
  45. 45. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back • Improves integrity 11
  46. 46. Advanced use cases: MultiDataset 12 Nsamplets! k targets y
  47. 47. Advanced use cases: MultiDataset 12 Nsamplets! k targets y mattributes A
  48. 48. Advanced use cases: MultiDataset 12 Nsamplets! k targets y mattributes A p1features! X1
  49. 49. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 mattributes A p1features! X1
  50. 50. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 mattributes A p1features! X1
  51. 51. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  52. 52. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  53. 53. Thank you • Check it out here github.com/raamana 
 • Follow me @ twitter.com/raamana_ • Contributors most welcome. 13

×