SlideShare a Scribd company logo
1 of 37
Download to read offline
From data lakes to actionable data
(adventures in data curation)
Andrea Splendiani, PhD
BioData, Basel
November 29th , 2018
NIBR Informatics
NIBR Informatics, TMS
What we do How we think Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use2
NIBR Informatics, TMS
The data lake paradigm
What & Why: data lakes
Noise vs actionable data
Public use3
The data lake
paradigm
1. Collect “all” data
2. Index it, make it
searchable
3. …
4. Analyze it
5. …
6. Generate value
NIBR Informatics, TMS
What & Why: an example
Public use4
• A data item (sample annotation):
– It is published
– It is ingested in a larger repository
– Some extraction/normalization is
done
– The data item is now in a larger
context (it can be queried with
other data)
NIBR Informatics, TMS
What & Why: an example
Public use5
Sample annotation: structured and unstructured information
NIBR Informatics, TMS
What & Why: an example
Public use6
• Information is incorporated
in a repository
• Some
normalization/mapping of
information, e.g.:
– Adult -> EFO:001272
– Ethanol -> CHEBI:16236
• Structured representation
NIBR Informatics, TMS
What & Why: an example
Public use7
• Information is put in a
larger context and can be
queried across different
datasets:
– E.g.: all samples treated with
alcohol (chebi:16236)
– E.g.: differentially expressed
genes for samples treated with
alcohol
An ideal lake
NIBR Informatics, TMS
What & Why: an example
Property Value
Ontology (annotation
from EBI)
biomaterial provider
Peter Ritchie (Victoria
University of Wellington)
EFO_0000001
(experimental factor)
development stage adult EFO_0001272
latitude and longitude 46.50 S 166.00 E EFO_0000001
organism part Muscle UBERON_0001015
strain wild caught EFO_0000001
geographic location New Zealand: Puysegur EFO_0000001
storage conditions Ethanol CHEBI_16236
sample code OR00579 EFO_0000001
Public use8
Sample ID: SAMN03105804
Description: Model organism or animal sample from Hoplostethus atlanticus
Biological characteristics (structure description):
All genes affected by alcohol, a close look at results.
NIBR Informatics, TMS
The value of
results of
queries across
large data
assets depends
on the quality
of the data
harmonization
What & Why: an example
Public use9
FAIR
In the previous examples, errors would not have impacted the overall results.
But as we loose track of details, can we know how errors propagate?
NIBR Informatics, TMS
What we do
How we think Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use10
NIBR Informatics, TMS
Reactive:
Cleanse (meta)data already
produced
What we do
Public use11
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation
NIBR Informatics, TMS
• Why: formalize curation
processes
– Efficiency
– Reproducibility
• What: A rule-based environment
to design “curation protocols”.
– Embed atomic operations, such as NLP-
based ontology mapping, text extraction,
computations…
• How: Build by example approach
• Who: “Power user”. Has a stake
in the standard definition process
Data Curation Framework
Public use12
Can we “augment” a curator with NLP that scale?
Can we make human processes reproducible?
NIBR Informatics, TMS
Data Curation Framework
(The theory behind)
Framing the data curation process: multiple dimensions explicit
Public use13
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
NIBR Informatics, TMS
Data Curation Framework
(The theory behind): abstract rules,
operators
Public14
Compute missing identifier:
If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists
(E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and
E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
NIBR Informatics, TMS
This curation rule can be saved, shared, and executed…
Public use15
7
Data Curation Framework
(from theory to practice)
rules to set curation tasks
NIBR Informatics, TMS
Reactive:
Cleanse (meta)data already
produced
What we do
Public use16
Proactive:
Influence the production of
(meta)data
Proactive and reactive approaches to data curation
NIBR Informatics, TMS
Templates: end user interaction
Public use17
sampleAnnot-template
• Why: propagate data
standards.
• What:
• a simple excel-like template (can be
shared via a URL)
• a central system to serve and
process templates
• How:
• rules behind the template allow
normalization.
• Central repository can capture
variations.
• Who: power user design a
template, end-users use it
(and change it).
NIBR Informatics, TMS
What we do
How we think
Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use18
NIBR Informatics, TMS
Sample schema mapping
(the basic problem)
Public use19
Standardized list of fields:
Sample Source Species
Sample Source Anatomical par
Sample Source Sex
Sample Storage Conditions
…
NIBR Informatics, TMS
Can we combine lexical similarities and data
distributions for better predictions?
Public use20
Assumption, within the same study,
properties with the same name
have the same meaning
Tamr/ML/
interactive
training
NIBR Informatics, TMS
Public use21
Can we combine lexical similarities and
data distributions for better predictions?
NIBR Informatics, TMS
Public use22
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learning
in Tamr)
Data review
Assessment pf
results (sampling)
NIBR Informatics, TMS
Public use23
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
NIBR Informatics, TMS
Public use24
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
NIBR Informatics, TMS
Public use25
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
NIBR Informatics, TMS
Public use26
Experimental process
Hypothesis:
What can we
leverage?
Problem
framing
Execution
(Training/learni
ng in Tamr)
Data review
Assessment pf
results
(sampling)
NIBR Informatics, TMS
Ideas
Public use27
Properties are not
independent, can we use
co-occurrence?
Two species detected.
Rule: need to specify if transplant or not
Can we use context for better mappings?
Source?
Submitter?
NIBR Informatics, TMS
What we do How we think
Perspectives
Content
From data to knowledge:
The What and Why of data curation (and data lakes)
Public use28
NIBR Informatics, TMS
Reflection points
Public use29
• What is the value of “curation”?
• What model for curation?
• What metrics?
• What measurements?
NIBR Informatics, TMS
What is the value of data curation ?
• 80% of analyst time is spent
discovering and preparing data
[1]. Why are there not solutions ?
• The major proportion of published
data is irreproducible [3]
• Lost know-how about the data.
• Cost of producing data (e.g.:
genomics) is exponentially
decreasing: what is the value of
past data? [2]
• Is it at all possible to have
homogenous data? Do we an
homogenous knowledge?
Public use30
A devil’s advocate
perspective: is data re-use
really valuable?
NIBR Informatics, TMS
Curation at source
Can we anticipate all use cases?
Curation at ingestion
Hard coding a use case
Curation on demand ?
At all feasible?
What model for curation ?
Public use31
NIBR Informatics, TMS
What metrics?
• Can we quantify how much data is “curated enough”?
• Can we quantify the value of data curation ?
• FAIRness metrics, measures of quality:
• Can we assess how data fits a purpose?
• (if you had to invest X money in N datasets, which criteria would you
use to choose?)
Public use32
http://bit.ly/valueOfData
NIBR Informatics, TMS
What measurements?
• Data (and its context)
evolve.
• Whichever measure for
“value” we choose, it will
change in time.
• How do we monitor such
“value” ?
Public use33
http://yummydata.org
NIBR Informatics, TMS
Conclusions/recap
1. The more we have data in data lakes, the more we
need to think on how to relate data together
– Especially if data is observational and coming from different sources
2. We can implement both reactive and proactive
approaches to normalize data.
3. Is curation meta-”data science”?
4. How can we quantify the value of curation?
Public use34
NIBR Informatics, TMS
Acknowledgments
• Daniel Cronenberger (SW Engineering)
• Frederic Sutter (SW Engineering)
• Dorothy Reilly (Data curation)
• Jean Marc Von-Allmen (Data curation)
• Anosha Siripala (Data curation)
• Joseph Kunkel (Data science)
• Martin Zablocki (Data science , Trivadis)
• Ted Snyder (Data science , Tamr)
Public use35
Thank you
NIBR Informatics, TMS
References and picture credits
• References
– [1] https://hbr.org/2017/05/whats-your-data-strategy
– [2] https://www.collaborativedrug.com/provocative-thoughts-from-chris-
lipinski/?utm_source=hs_email&utm_medium=email&utm_content=67107351&_hsenc=p2ANqtz-8_ZZ-
AFSiGUcDnAIumoS6GgXnCeZDA55mY2WwDl9XLuZUchRSG53bFpfqJNgtp3CsXRG2uj62yG_L6PKvcc8o-
Q8MwLdHjHA_zxfYtzQD7iERrbO8&_hsmi=67107352#older-data
– [3] C.G. Begley, L.M. Ellis, Drug development: raise standards for preclinical cancer research, Nature 483 (March (7391)) (2012)
531–533.
• Picture credits
– https://www.pexels.com/photo/full-frame-shot-of-abstract-pattern-247719/
– https://www.semanticscholar.org/paper/The-EBI-RDF-platform%3A-linked-open-data-for-the-life-Jupp-
Malone/6516a4a5885847438ba2ec7f7f32000c50389a04
– https://www.wamc.org/post/epa-fund-clean-water-project-hopewell-junction
– https://upload.wikimedia.org/wikipedia/commons/d/d8/Inle_Lake_%28Myanmar%29.jpg
– https://www.ebi.ac.uk/ols/ontologies/uberon/terms/graph?iri=http://purl.obolibrary.org/obo/UBERON_0002107
– https://en.wikipedia.org/wiki/The_Devil%27s_Advocate_(1997_film)#/media/File:Faust_and_Mephisto.png
– https://www.researchgate.net/figure/Part-1-of-the-Conceptual-Data-Model-that-stores-the-Learning-Objects-in-
Creator_fig1_277477099
– https://commons.wikimedia.org/wiki/File:Etl-process.svg
Business Use Only37

More Related Content

What's hot

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseKartik Kalpande Patil
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AImelissadata
 
Data Mining based on Hashing Technique
Data Mining based on Hashing TechniqueData Mining based on Hashing Technique
Data Mining based on Hashing Techniqueijtsrd
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Mathieu d'Aquin
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)Kartik Kalpande Patil
 

What's hot (18)

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Data Mining based on Hashing Technique
Data Mining based on Hashing TechniqueData Mining based on Hashing Technique
Data Mining based on Hashing Technique
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
data mining
data miningdata mining
data mining
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 

Similar to From data lakes to actionable data (adventures in data curation)

Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...IJCI JOURNAL
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
NeuroVault and the vision for data sharing in neuroimaging
NeuroVault and the vision for data sharing in neuroimagingNeuroVault and the vision for data sharing in neuroimaging
NeuroVault and the vision for data sharing in neuroimagingKrzysztof Gorgolewski
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
0912f50eedb48e44d7000000
0912f50eedb48e44d70000000912f50eedb48e44d7000000
0912f50eedb48e44d7000000Rakesh Sharma
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptxDikshantSharma63
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 adrianheilbut
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph MiningSabri Skhiri
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Data mining 2012 generalwithmethods
Data mining  2012 generalwithmethodsData mining  2012 generalwithmethods
Data mining 2012 generalwithmethodsMichael Gilman
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 

Similar to From data lakes to actionable data (adventures in data curation) (20)

Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...
Scalable Action Mining Hybrid Method for Enhanced User Emotions in Education ...
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
NeuroVault and the vision for data sharing in neuroimaging
NeuroVault and the vision for data sharing in neuroimagingNeuroVault and the vision for data sharing in neuroimaging
NeuroVault and the vision for data sharing in neuroimaging
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
0912f50eedb48e44d7000000
0912f50eedb48e44d70000000912f50eedb48e44d7000000
0912f50eedb48e44d7000000
 
Data science guide
Data science guideData science guide
Data science guide
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Data mining 2012 generalwithmethods
Data mining  2012 generalwithmethodsData mining  2012 generalwithmethods
Data mining 2012 generalwithmethods
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 

Recently uploaded

Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Cherry
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycleCherry
 
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...Cherry
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Cherry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil RecordSangram Sahoo
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxCherry
 
COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCherry
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCherry
 
Method of Quantifying interactions and its types
Method of Quantifying interactions and its typesMethod of Quantifying interactions and its types
Method of Quantifying interactions and its typesNISHIKANTKRISHAN
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Nistarini College, Purulia (W.B) India
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Cherry
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.takadzanijustinmaime
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 

Recently uploaded (20)

Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
Molecular phylogeny, molecular clock hypothesis, molecular evolution, kimuras...
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demerits
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
Method of Quantifying interactions and its types
Method of Quantifying interactions and its typesMethod of Quantifying interactions and its types
Method of Quantifying interactions and its types
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 

From data lakes to actionable data (adventures in data curation)

  • 1. From data lakes to actionable data (adventures in data curation) Andrea Splendiani, PhD BioData, Basel November 29th , 2018 NIBR Informatics
  • 2. NIBR Informatics, TMS What we do How we think Perspectives Content From data to knowledge: The What and Why of data curation (and data lakes) Public use2
  • 3. NIBR Informatics, TMS The data lake paradigm What & Why: data lakes Noise vs actionable data Public use3 The data lake paradigm 1. Collect “all” data 2. Index it, make it searchable 3. … 4. Analyze it 5. … 6. Generate value
  • 4. NIBR Informatics, TMS What & Why: an example Public use4 • A data item (sample annotation): – It is published – It is ingested in a larger repository – Some extraction/normalization is done – The data item is now in a larger context (it can be queried with other data)
  • 5. NIBR Informatics, TMS What & Why: an example Public use5 Sample annotation: structured and unstructured information
  • 6. NIBR Informatics, TMS What & Why: an example Public use6 • Information is incorporated in a repository • Some normalization/mapping of information, e.g.: – Adult -> EFO:001272 – Ethanol -> CHEBI:16236 • Structured representation
  • 7. NIBR Informatics, TMS What & Why: an example Public use7 • Information is put in a larger context and can be queried across different datasets: – E.g.: all samples treated with alcohol (chebi:16236) – E.g.: differentially expressed genes for samples treated with alcohol An ideal lake
  • 8. NIBR Informatics, TMS What & Why: an example Property Value Ontology (annotation from EBI) biomaterial provider Peter Ritchie (Victoria University of Wellington) EFO_0000001 (experimental factor) development stage adult EFO_0001272 latitude and longitude 46.50 S 166.00 E EFO_0000001 organism part Muscle UBERON_0001015 strain wild caught EFO_0000001 geographic location New Zealand: Puysegur EFO_0000001 storage conditions Ethanol CHEBI_16236 sample code OR00579 EFO_0000001 Public use8 Sample ID: SAMN03105804 Description: Model organism or animal sample from Hoplostethus atlanticus Biological characteristics (structure description): All genes affected by alcohol, a close look at results.
  • 9. NIBR Informatics, TMS The value of results of queries across large data assets depends on the quality of the data harmonization What & Why: an example Public use9 FAIR In the previous examples, errors would not have impacted the overall results. But as we loose track of details, can we know how errors propagate?
  • 10. NIBR Informatics, TMS What we do How we think Perspectives Content From data to knowledge: The What and Why of data curation (and data lakes) Public use10
  • 11. NIBR Informatics, TMS Reactive: Cleanse (meta)data already produced What we do Public use11 Proactive: Influence the production of (meta)data Proactive and reactive approaches to data curation
  • 12. NIBR Informatics, TMS • Why: formalize curation processes – Efficiency – Reproducibility • What: A rule-based environment to design “curation protocols”. – Embed atomic operations, such as NLP- based ontology mapping, text extraction, computations… • How: Build by example approach • Who: “Power user”. Has a stake in the standard definition process Data Curation Framework Public use12 Can we “augment” a curator with NLP that scale? Can we make human processes reproducible?
  • 13. NIBR Informatics, TMS Data Curation Framework (The theory behind) Framing the data curation process: multiple dimensions explicit Public use13 Validation state (Confidence) Valid Valid Valid Curation goal (The need) Required Required Required Required Required Semantic type1 (Meaning) Identifier about Sample ID2 about Organism Name about Organism Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name (the “location” in the source) ID taxID Organism Gender age Value GSM701 607 10090 Mus Musculus 6 weeks old 1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition) 2 Identifiers also require a domain specification Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
  • 14. NIBR Informatics, TMS Data Curation Framework (The theory behind): abstract rules, operators Public14 Compute missing identifier: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
  • 15. NIBR Informatics, TMS This curation rule can be saved, shared, and executed… Public use15 7 Data Curation Framework (from theory to practice) rules to set curation tasks
  • 16. NIBR Informatics, TMS Reactive: Cleanse (meta)data already produced What we do Public use16 Proactive: Influence the production of (meta)data Proactive and reactive approaches to data curation
  • 17. NIBR Informatics, TMS Templates: end user interaction Public use17 sampleAnnot-template • Why: propagate data standards. • What: • a simple excel-like template (can be shared via a URL) • a central system to serve and process templates • How: • rules behind the template allow normalization. • Central repository can capture variations. • Who: power user design a template, end-users use it (and change it).
  • 18. NIBR Informatics, TMS What we do How we think Perspectives Content From data to knowledge: The What and Why of data curation (and data lakes) Public use18
  • 19. NIBR Informatics, TMS Sample schema mapping (the basic problem) Public use19 Standardized list of fields: Sample Source Species Sample Source Anatomical par Sample Source Sex Sample Storage Conditions …
  • 20. NIBR Informatics, TMS Can we combine lexical similarities and data distributions for better predictions? Public use20 Assumption, within the same study, properties with the same name have the same meaning Tamr/ML/ interactive training
  • 21. NIBR Informatics, TMS Public use21 Can we combine lexical similarities and data distributions for better predictions?
  • 22. NIBR Informatics, TMS Public use22 Experimental process Hypothesis: What can we leverage? Problem framing Execution (Training/learning in Tamr) Data review Assessment pf results (sampling)
  • 23. NIBR Informatics, TMS Public use23 Experimental process Hypothesis: What can we leverage? Problem framing Execution (Training/learni ng in Tamr) Data review Assessment pf results (sampling)
  • 24. NIBR Informatics, TMS Public use24 Experimental process Hypothesis: What can we leverage? Problem framing Execution (Training/learni ng in Tamr) Data review Assessment pf results (sampling)
  • 25. NIBR Informatics, TMS Public use25 Experimental process Hypothesis: What can we leverage? Problem framing Execution (Training/learni ng in Tamr) Data review Assessment pf results (sampling)
  • 26. NIBR Informatics, TMS Public use26 Experimental process Hypothesis: What can we leverage? Problem framing Execution (Training/learni ng in Tamr) Data review Assessment pf results (sampling)
  • 27. NIBR Informatics, TMS Ideas Public use27 Properties are not independent, can we use co-occurrence? Two species detected. Rule: need to specify if transplant or not Can we use context for better mappings? Source? Submitter?
  • 28. NIBR Informatics, TMS What we do How we think Perspectives Content From data to knowledge: The What and Why of data curation (and data lakes) Public use28
  • 29. NIBR Informatics, TMS Reflection points Public use29 • What is the value of “curation”? • What model for curation? • What metrics? • What measurements?
  • 30. NIBR Informatics, TMS What is the value of data curation ? • 80% of analyst time is spent discovering and preparing data [1]. Why are there not solutions ? • The major proportion of published data is irreproducible [3] • Lost know-how about the data. • Cost of producing data (e.g.: genomics) is exponentially decreasing: what is the value of past data? [2] • Is it at all possible to have homogenous data? Do we an homogenous knowledge? Public use30 A devil’s advocate perspective: is data re-use really valuable?
  • 31. NIBR Informatics, TMS Curation at source Can we anticipate all use cases? Curation at ingestion Hard coding a use case Curation on demand ? At all feasible? What model for curation ? Public use31
  • 32. NIBR Informatics, TMS What metrics? • Can we quantify how much data is “curated enough”? • Can we quantify the value of data curation ? • FAIRness metrics, measures of quality: • Can we assess how data fits a purpose? • (if you had to invest X money in N datasets, which criteria would you use to choose?) Public use32 http://bit.ly/valueOfData
  • 33. NIBR Informatics, TMS What measurements? • Data (and its context) evolve. • Whichever measure for “value” we choose, it will change in time. • How do we monitor such “value” ? Public use33 http://yummydata.org
  • 34. NIBR Informatics, TMS Conclusions/recap 1. The more we have data in data lakes, the more we need to think on how to relate data together – Especially if data is observational and coming from different sources 2. We can implement both reactive and proactive approaches to normalize data. 3. Is curation meta-”data science”? 4. How can we quantify the value of curation? Public use34
  • 35. NIBR Informatics, TMS Acknowledgments • Daniel Cronenberger (SW Engineering) • Frederic Sutter (SW Engineering) • Dorothy Reilly (Data curation) • Jean Marc Von-Allmen (Data curation) • Anosha Siripala (Data curation) • Joseph Kunkel (Data science) • Martin Zablocki (Data science , Trivadis) • Ted Snyder (Data science , Tamr) Public use35
  • 37. NIBR Informatics, TMS References and picture credits • References – [1] https://hbr.org/2017/05/whats-your-data-strategy – [2] https://www.collaborativedrug.com/provocative-thoughts-from-chris- lipinski/?utm_source=hs_email&utm_medium=email&utm_content=67107351&_hsenc=p2ANqtz-8_ZZ- AFSiGUcDnAIumoS6GgXnCeZDA55mY2WwDl9XLuZUchRSG53bFpfqJNgtp3CsXRG2uj62yG_L6PKvcc8o- Q8MwLdHjHA_zxfYtzQD7iERrbO8&_hsmi=67107352#older-data – [3] C.G. Begley, L.M. Ellis, Drug development: raise standards for preclinical cancer research, Nature 483 (March (7391)) (2012) 531–533. • Picture credits – https://www.pexels.com/photo/full-frame-shot-of-abstract-pattern-247719/ – https://www.semanticscholar.org/paper/The-EBI-RDF-platform%3A-linked-open-data-for-the-life-Jupp- Malone/6516a4a5885847438ba2ec7f7f32000c50389a04 – https://www.wamc.org/post/epa-fund-clean-water-project-hopewell-junction – https://upload.wikimedia.org/wikipedia/commons/d/d8/Inle_Lake_%28Myanmar%29.jpg – https://www.ebi.ac.uk/ols/ontologies/uberon/terms/graph?iri=http://purl.obolibrary.org/obo/UBERON_0002107 – https://en.wikipedia.org/wiki/The_Devil%27s_Advocate_(1997_film)#/media/File:Faust_and_Mephisto.png – https://www.researchgate.net/figure/Part-1-of-the-Conceptual-Data-Model-that-stores-the-Learning-Objects-in- Creator_fig1_277477099 – https://commons.wikimedia.org/wiki/File:Etl-process.svg Business Use Only37