Research data management for
medical data with pyradigm
Pradeep Reddy Raamana
crossinvalidation.com github.com/raamana
Research data management
2
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
I focus not on files, but derived
features: tables for machine
learning: Research Feature
Management (RFM)?
Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
Challenges in RDM for Medical Data
4
Too many tables to manage
even for a small project!
Challenges in RDM for Medical Data
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
• Library built to reduce my own pain!
4
Need for accessibility and domain adaptation
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
• Trying to reduce “data entropy” in key parts of RDM
• Reminder: Data lasts MUCH LONGER than the project itself!
5
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
ktargets
y
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
covariates or
confounds
such as age,
gender, site,
Usually
scalars, but
sometimes
vectors too!
Implementation details
7
Implementation details
• BaseDataset
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
• Many other possibilities
• Depending on domain and use-case
7
Implementation details contd.
• Classes and data are managed via dict of dicts
• Convenience for developers
• Current setup is fine for our domain:
• 1000s of rows, few 1000s of columns
• Larger and more complex domains need fine-tuning
• Serialization
• Pickle files
• HDF etc are possible
• Requesting help from contributors
8
Usage
9
Usage
9
• Once it is built by someone, no
one else has to worry about it.
• They can slice and dice it in any
number of ways they desire
• You don’t need the script that
built this data structure, as it’s
more or less self-explanatory
Dataset iteration & arithmetic
10
Dataset iteration & arithmetic
10
• Lot more intuitive!
• Higher-level organization, not
rows, columns and comments!
• Meta-data gets propagated
automatically ! life is much
more productive!
• Achieving this with CSVs is a
huge pain!
Advantages
11
Advantages
• Intuitive for niche domains: easy to use and teach
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
• Improves integrity
11
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
mattributes
A
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
X4
p4
features
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
X4
p4
features
mattributes
A
p1features!
X1
Thank you
• Check it out here
github.com/raamana


• Follow me @ twitter.com/raamana_
• Contributors most welcome.
13

Research data management for medical data with pyradigm

  • 1.
    Research data managementfor medical data with pyradigm Pradeep Reddy Raamana crossinvalidation.com github.com/raamana
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    Dataset Lifecyle inML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  • 8.
    Dataset Lifecyle inML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  • 9.
    Challenges in RDMfor Medical Data 4 Too many tables to manage even for a small project!
  • 10.
    Challenges in RDMfor Medical Data 4
  • 11.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types 4
  • 12.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs 4
  • 13.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data 4
  • 14.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all 4
  • 15.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training 4
  • 16.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! 4
  • 17.
    Challenges in RDMfor Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! • Library built to reduce my own pain! 4
  • 18.
    Need for accessibilityand domain adaptation 5
  • 19.
    Need for accessibilityand domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use 5
  • 20.
    Need for accessibilityand domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) 5
  • 21.
    Need for accessibilityand domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) • Trying to reduce “data entropy” in key parts of RDM • Reminder: Data lasts MUCH LONGER than the project itself! 5
  • 22.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X
  • 23.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X ktargets y
  • 24.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y
  • 25.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  • 26.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  • 27.
    Core structure ofpyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification covariates or confounds such as age, gender, site, Usually scalars, but sometimes vectors too!
  • 28.
  • 29.
  • 30.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of 7
  • 31.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats 7
  • 32.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical 7
  • 33.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) 7
  • 34.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc 7
  • 35.
    Implementation details • BaseDataset •Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc • Many other possibilities • Depending on domain and use-case 7
  • 36.
    Implementation details contd. •Classes and data are managed via dict of dicts • Convenience for developers • Current setup is fine for our domain: • 1000s of rows, few 1000s of columns • Larger and more complex domains need fine-tuning • Serialization • Pickle files • HDF etc are possible • Requesting help from contributors 8
  • 37.
  • 38.
    Usage 9 • Once itis built by someone, no one else has to worry about it. • They can slice and dice it in any number of ways they desire • You don’t need the script that built this data structure, as it’s more or less self-explanatory
  • 39.
    Dataset iteration &arithmetic 10
  • 40.
    Dataset iteration &arithmetic 10 • Lot more intuitive! • Higher-level organization, not rows, columns and comments! • Meta-data gets propagated automatically ! life is much more productive! • Achieving this with CSVs is a huge pain!
  • 41.
  • 42.
    Advantages • Intuitive forniche domains: easy to use and teach 11
  • 43.
    Advantages • Intuitive forniche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! 11
  • 44.
    Advantages • Intuitive forniche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back 11
  • 45.
    Advantages • Intuitive forniche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back • Improves integrity 11
  • 46.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y
  • 47.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y mattributes A
  • 48.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y mattributes A p1features! X1
  • 49.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y p2 features X2 mattributes A p1features! X1
  • 50.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 mattributes A p1features! X1
  • 51.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  • 52.
    Advanced use cases:MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  • 53.
    Thank you • Checkit out here github.com/raamana 
 • Follow me @ twitter.com/raamana_ • Contributors most welcome. 13