Research data management for medical data with pyradigm

Research data management for
medical data with pyradigm
Pradeep Reddy Raamana
crossinvalidation.com github.com/raamana

Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2

Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal: reduce data entropy in few parts of this lifecycle!

Data entropy: Normal degradation in information content
associated with data and metadata over time © Data One

Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal: reduce data entropy in few parts of this lifecycle!

Data entropy: Normal degradation in information content
associated with data and metadata over time © Data One
I focus not on files, but derived
features: tables for machine
learning: Research Feature
Management (RFM)?

Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3

Challenges in RDM for Medical Data
4
Too many tables to manage
even for a small project!

4

• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
4

• Multiple tables need to be linked
and integrated with unique IDs
4

• Provenance needs to be captured
• Processing steps
• Their meta data
4

• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
4

• Their meta data
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
4

• Their meta data
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
4

• Their meta data
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
• Library built to reduce my own pain!
4

Need for accessibility and domain adaptation
5

• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
5

• Too contrived
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
5

• Too contrived
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
• Trying to reduce “data entropy” in key parts of RDM
• Reminder: Data lasts MUCH LONGER than the project itself!
5

Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X

6
p features !
Nsamplets!
deal with
• features
• targets
• confounds
• scalar, vector
different covariate regressions etc
X
ktargets
y

6
p features !
Nsamplets!
deal with
• features
• targets
• confounds
• scalar, vector
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y

6
p features !
Nsamplets!
deal with
• features
• targets
• confounds
• scalar, vector
X
ktargets
y
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification

6
p features !
Nsamplets!
deal with
• features
• targets
• confounds
• scalar, vector
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification

6
p features !
Nsamplets!
deal with
• features
• targets
• confounds
• scalar, vector
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
covariates or
confounds
such as age,
gender, site,
Usually
scalars, but
sometimes
vectors too!

Implementation details
• BaseDataset
7

• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
7

• BaseDataset
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
7

• BaseDataset
• Methods
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
7

• BaseDataset
• Methods
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
7

• BaseDataset
• Methods
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
7

• BaseDataset
• Methods
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
• Many other possibilities
• Depending on domain and use-case
7

Implementation details contd.
• Classes and data are managed via dict of dicts
• Convenience for developers
• Current setup is fine for our domain:
• 1000s of rows, few 1000s of columns
• Larger and more complex domains need fine-tuning
• Serialization
• Pickle files
• HDF etc are possible
• Requesting help from contributors
8

Usage
9
• Once it is built by someone, no
one else has to worry about it.
• They can slice and dice it in any
number of ways they desire
• You don’t need the script that
built this data structure, as it’s
more or less self-explanatory

Dataset iteration & arithmetic
10

Dataset iteration & arithmetic
10
• Lot more intuitive!
• Higher-level organization, not
rows, columns and comments!
• Meta-data gets propagated
automatically ! life is much
more productive!
• Achieving this with CSVs is a
huge pain!

Advantages
• Intuitive for niche domains: easy to use and teach
11

Advantages
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
11

Advantages
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
11

Advantages
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
• Improves integrity
11

Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y

12
Nsamplets!
k targets
y
mattributes
A

12
Nsamplets!
k targets
y
mattributes
A
p1features!
X1

12
Nsamplets!
k targets
y
p2
features
X2
mattributes
A
p1features!
X1

12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
mattributes
A
p1features!
X1

12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
X4
p4
features
mattributes
A
p1features!
X1

Thank you
• Check it out here
github.com/raamana
 
• Follow me @ twitter.com/raamana_
• Contributors most welcome.
13

Research data management for medical data with pyradigm

More Related Content

What's hot

Similar to Research data management for medical data with pyradigm

Recently uploaded

Research data management for medical data with pyradigm