SlideShare a Scribd company logo
AI for Data Curation
Yes, can we?
Andrea Splendiani, AD, Information Systems
London
September 28, 2017
NIBR Informatics
Business or Operating Unit/Franchise or Department
Agenda
1. Focus: metadata and
reference data
2. Knowledge Engineering
and AI
3. Data curation: a use case
for AI?
4. Ideas and experiences
5. Conclusions
Public2
What we do
in context
Some
considerations
at 10000ft
Holistic view on
a process
(1000ft)
Details
Reflections at
10000ft
Business or Operating Unit/Franchise or Department
Focus: metadata and reference data
1. What:
– Annotation of datasets
– Standards
– Ontologies
– Reference information
2. Why:
– Support analysis
– Support search and query answering
– Support extraction
– Building knowledge networks / information discovery and inference
3. Where
– Typically in research
Public3
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(a stopper)
• 10 years ago: AI
approaches to Systems
Biology
• Ontology based
knowledge-bases
(Semantic Web)
• ANN/Fuzzy systems even
older
Knowledge Engineering and AI
Public4
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(taken seriously)
• Now: AI and ML are in the
hype
• Interest in Life Sciences
industries
Knowledge Engineering and AI
Public5
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
Public6
• What helped the resurgence of ML?
– Massive data available
– Massive computational power available
– Few technical improvements
– Success stories (Deep learning)
• Do these also apply to Ontology/Sem-Web based
systems?
– Uniprot: 5.7B triples in 2009, 30+B triples in 2017
– EBI RDF Platform (2015)
– Wikidata (2014?)
Source: https://tools.wmflabs.org/wikidata-todo/stats.php
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• The way information is represented has implications on
what is built on it (e.g.: analytics, data mining)
– network: are parallel executions in AND or OR
– Annotations: explicit mention of negative information
Public7
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• Metadata is important in a data-centric world (and at
least in part of ML applications)
• Knowledge representation matters, beyond metadata
(examples: AND/OR in pathways, NOT in
annotations…)
• We start to have large, distributed knowledge-bases
– Is there a role for AI systems based on logic/KR?
– Can we combine symbolic and sub-symbolic reasoning ?
– Is this already happening ?
Public8
Business or Operating Unit/Franchise or Department
Data curation
Public9
• Annotation
• Metadata
• Standards
• Model
• Literature
• Databases
• …
Source BioCuration 2017 Abstracts via wordscloud.com
Business or Operating Unit/Franchise or Department
An example: public data curation
Public10
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%
2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public11
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607
Property Value Ontology Bio-
Charac
teristic
?
Sample_sou
rce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_10
090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk
%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public12
Property Value Ontology Bio-
Charact
eristic?
Sample_sour
ce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_100
90
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
Business or Operating Unit/Franchise or Department
An example: public data curation
Public13
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.
ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Supports:
• Aggregation
• Analysis
• Search
• Link discovery
• “Machine learning”
Business or Operating Unit/Franchise or Department
Can we use AI for Data Curation ?
Why ?
– Data curation is an intellectually intensive
activity, time consuming and intensive
– Given the the increasing role and amount
of data, curation risks to be a bottleneck
Public14
Example of exponential growth in data
Business or Operating Unit/Franchise or Department
AI for data curation:
characteristics and constraints
• Can we automate data curation ?
• Difficult:
– Missing data
– Discretionality (e.g.: level of granularity)
• Looks reasonable:
– Repetition
– Consistency
– Data/distances evaluations (clustering/attractors)
• We need to combine human aspects and machineable
aspects
Public15
Business or Operating Unit/Franchise or Department
AI for data curation
framing the problem: what
Public16
Should this value be
normalized?
Meaning. E.g.: is “age”
same as “years”?
Confidence: is this
information true ?
The need. E.g.: is this a
required information. When? Is this a valid identifier?
Example, extract from NCBI GEO GSM701607
Business or Operating Unit/Franchise or Department
AI for data curation
Framing the problem: how
We consider curation activities as functions in a “curation
space” that is exemplified via a “curation record”
Public17
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
Business or Operating Unit/Franchise or Department
AI and data curation
Using a record to modularize curation
processes
• Different classes of
operations
– Schema mapping (assign a
type)
– Standard setting (assign a
goal)
– Validation (setting a validation
value)
Public18
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Validation state Valid Valid
Curation goal Required
Semantic type Identifier about
Sample
Name about
Organism
Name about Gender
Field Name ID Organism Gender
Value GSM701607 Mus Musculus
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Business or Operating Unit/Franchise or Department
• Different classes of
operations
– Normalization (filling a
column)
– Enrichment (adding a
column)
Public19
AI and data curation
Using a record to modularize curation
processes
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male PATO:000038
4
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descripti
on about
Age
Field Name ID taxID Organism age
Value GSM701607 10090 Mus Musculus 6 weeks
old
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descript
ion
about
Age
Identifie
rabout
Sample
Field Name ID taxID Organism age EBI ref.
Value GSM70160
7
10090 Mus
Musculus
6 weeks
old
SAME
A1189
935
Business or Operating Unit/Franchise or Department
Big picture
Quantity/Quality tradeoff
Public20
Quality/validity
Time/cost
• Is the optimal trade-off the
same for all data?
• Can this change for the
same data over time and
use cases ?
• Can we embed a “cost
function” in curation
processes ?
Business or Operating Unit/Franchise or Department
Big picture
(Meta) data evolution, immutability
Public21
Initial condition:
organism name
present, missing ID
Initial condition:
identifier extracted,
not verified
Identifier extracted
and verified
Entity: 1234
Information: V1
Meta-Info: V1
Entity: 1234
Information: V2
Meta-Info: V2
Entity: 1234
Information: V2
Meta-Info: V3
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Validation state Valid Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Ideas and experiences
Some details
Business or Operating Unit/Franchise or Department
Data and metadata transformations
(deterministic actions + extractors)
• Curation processes can be
expressed (by curators) in
terms of rules
• Rules embed “atomic
operations” e.g.: extractors,
transformations,…
• Simple rules go a very long
way…
Public23
<ruleConfig method="Extract">
<param name="setType" value="UNIT"/>
<param name="setAmbiguous" value="true"/>
<param name="setFullMatch" value="false"/>
<param name="setResultInJson" value="false"/>
<param name="setSimpleJson" value="false"/>
<param name="setText">
<ruleConfig method="GetCell">
<param name="setAttr" value="AgeDescription"/>
<param name="setBase" value="XCF_1"/>
</ruleConfig>
Business or Operating Unit/Franchise or Department
Abstract rules and meta-rules
• Rules can rely on abstraction/inference for higher genericity
• They can also be used to produce meta-information
Public24
Example rules (pesudo-syntax)
• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y:
E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then
E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”
• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation
State=Valid
Validation state Valid Valid
Curation goal Required Required Required
Semantic type Identifier
about
Sample
ID about
Organism
Name
about
Organism
Name about
Gender
Identifier about
Gender
Field Name
(the “location” in
the source)
ID taxID Organism Gender
Value GSM701607 10090 Mus
Musculus
Business or Operating Unit/Franchise or Department
“Approximate” transformations
• Some transformations cannot (easily) be expressed in
terms rules
– Complex and ad hoc relations
– Discretional elements
• Examples:
– Entities de-duplication
– Whether two homonymous authors mentions are referring to the same author
or not is a complex function of an extended range of the author’s features
(where they work, contact information, subject study,…)
– Schema mapping
– Determining the meaning of an attribute (e.g.: time) is a complex function of
the values this attribute takes, as well as other parameters (is this a duration, a
time point, or an execution timestamp?)
– Is ”Sample tracking number” to be mapped to “Tracking number” or to
“Identifier” ?
Public25
Business or Operating Unit/Franchise or Department
Implementation of de-duplication
and schema mapping via Tamr
• One approach that we have chosen to provide
approximate schema-mapping and de-duplication
functions is via Tamr (tamr.com)
• Tamr is data unification platform that combines machine
learning with human expertise.
– E.g.: to support schema mapping, Tamr combines several features:
– Data distribution
– Property names
– Property metadata
– It learns how to compose such functions via machine learning, through
an iterative process where human experts can provide input and
improve predictions
Public26
Business or Operating Unit/Franchise or Department
Schema-mapping (Tamr)
Public27
Users are suggested
a range of potential
mapping, with a
confidence score.
They can confirm or
suggest different
mappings. New
predictions are
routinely provided as
more input is
accumulated.
User interface for curators showing potential attribute matches
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
User interface for curators showing potential duplicates
Public28
Users are shown a
set of potential
duplicates with a
confidence score.
They can accept or
refuse such
suggestions, thus
providing training
data and iteratively
refining predictions.
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
Details of the implementation of the deduplication process (courtesy of Tamr)
Public29
Business or Operating Unit/Franchise or Department
Re-introducing logic
• Can we predict (or suggest) the association between
parameters and entities in a template?
– An ontology models the “real world”: entities, qualities, processes
– Parameters are annotated with axioms based on this ontology
– Inference provides multiple classifications of parameters, as well as
possible/necessary associations between parameters and entities.
• Can this work?
Public30
Business or Operating Unit/Franchise or Department
Re-introducing logic
Public31
Extract from an ontology representing entities and
qualities
Example of axiomatic mapping between a
parameter and an entity and qualities ontology
Deductions for parameter ReportID:
must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity,
Information Entity, Immaterial Entity
may refer to: Report, InternalReport
Business or Operating Unit/Franchise or Department
Exploring automatic ontology
matching
Public32
• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge
• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project
• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics
http://oaei.ontologymatching.org/2016
Business or Operating Unit/Franchise or Department
Conclusions: On rules, standards
and data ethnography
• Data Curation: “AI” may help (not limited to ML)
– Formal knowledge representation is part of the goal
• The need for explanations
– We need to define (document) a process
– We have theorems for proofs: can we do without ?
– Is there a role for “ML” GURUs?
• The “human side” of data
– Data normalization is based on assumptions (e.g.: what can be
considered same, what not): there is a cultural side to this.
– Would we accept an AI “editor” ?
Public33
Business or Operating Unit/Franchise or Department
Acknowledgments
• NIBR
• Daniel Cronenberger
• Ming Fang
• Frederic Sutter
• Anosha Siripala
• Fabien Pernot
• Jean Marc von Allmen
• Martin Petracchi
• Dorothy Reilly
• Pierre Parisot
• Therese Vachon
• Tamr.com
• Pistoia Alliance Ontology Matching Project team
Public34
Thank you

More Related Content

What's hot

The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseKartik Kalpande Patil
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AImelissadata
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Mathieu d'Aquin
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining conceptsBasit Rafiq
 

What's hot (19)

The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
data mining
data miningdata mining
data mining
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Data mining
Data miningData mining
Data mining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 

Viewers also liked

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...Ramy K. Aziz
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network MedicineMarc Santolini
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysisNoha Lotfy Ibrahim
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
 
Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networksngulbahce
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomicssonam786
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelLars Juhl Jensen
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its toolsGaurav Diwakar
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biologylemberger
 

Viewers also liked (11)

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network Medicine
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysis
 
RT-PCR
RT-PCRRT-PCR
RT-PCR
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networks
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Dr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 MedicineDr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 Medicine
 

Similar to Artificial Intelligence in Data Curation

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating TaxonomiesJoseph Busch
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Be Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationFintricity
 
The Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Finalrotciv
 
AI for information management: why and how
AI for information management: why and howAI for information management: why and how
AI for information management: why and howAnna Divoli
 
Channeling insights to the right people
Channeling insights to the right peopleChanneling insights to the right people
Channeling insights to the right peopleSebastien Lefebvre
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...StampedeCon
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?Ahmed Banafa
 
Actionable analytics with mongo db mongophilly-2011
Actionable analytics with mongo db   mongophilly-2011Actionable analytics with mongo db   mongophilly-2011
Actionable analytics with mongo db mongophilly-2011MongoDB
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for FindabilityFindwise
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchDaniel Faggella
 

Similar to Artificial Intelligence in Data Curation (20)

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating Taxonomies
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Nordic health data metadata
Nordic health data   metadataNordic health data   metadata
Nordic health data metadata
 
Wild hairtech bih
Wild hairtech   bihWild hairtech   bih
Wild hairtech bih
 
Be Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital Transformation
 
The Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Final
 
AI for information management: why and how
AI for information management: why and howAI for information management: why and how
AI for information management: why and how
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
Channeling insights to the right people
Channeling insights to the right peopleChanneling insights to the right people
Channeling insights to the right people
 
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?
 
Actionable analytics with mongo db mongophilly-2011
Actionable analytics with mongo db   mongophilly-2011Actionable analytics with mongo db   mongophilly-2011
Actionable analytics with mongo db mongophilly-2011
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent Search
 

Recently uploaded

In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxAlaminAfendy1
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPirithiRaju
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxmuralinath2
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxmuralinath2
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...PABOLU TEJASREE
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxAlguinaldoKong
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionpablovgd
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingJocelyn Atis
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsYOGESH DOGRA
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...NathanBaughman3
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONChetanK57
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxANONYMOUS
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptsreddyrahul
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxSultanMuhammadGhauri
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
 

Recently uploaded (20)

In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptx
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 

Artificial Intelligence in Data Curation

  • 1. AI for Data Curation Yes, can we? Andrea Splendiani, AD, Information Systems London September 28, 2017 NIBR Informatics
  • 2. Business or Operating Unit/Franchise or Department Agenda 1. Focus: metadata and reference data 2. Knowledge Engineering and AI 3. Data curation: a use case for AI? 4. Ideas and experiences 5. Conclusions Public2 What we do in context Some considerations at 10000ft Holistic view on a process (1000ft) Details Reflections at 10000ft
  • 3. Business or Operating Unit/Franchise or Department Focus: metadata and reference data 1. What: – Annotation of datasets – Standards – Ontologies – Reference information 2. Why: – Support analysis – Support search and query answering – Support extraction – Building knowledge networks / information discovery and inference 3. Where – Typically in research Public3
  • 4. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (a stopper) • 10 years ago: AI approaches to Systems Biology • Ontology based knowledge-bases (Semantic Web) • ANN/Fuzzy systems even older Knowledge Engineering and AI Public4
  • 5. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (taken seriously) • Now: AI and ML are in the hype • Interest in Life Sciences industries Knowledge Engineering and AI Public5
  • 6. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI Public6 • What helped the resurgence of ML? – Massive data available – Massive computational power available – Few technical improvements – Success stories (Deep learning) • Do these also apply to Ontology/Sem-Web based systems? – Uniprot: 5.7B triples in 2009, 30+B triples in 2017 – EBI RDF Platform (2015) – Wikidata (2014?) Source: https://tools.wmflabs.org/wikidata-todo/stats.php
  • 7. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • The way information is represented has implications on what is built on it (e.g.: analytics, data mining) – network: are parallel executions in AND or OR – Annotations: explicit mention of negative information Public7
  • 8. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • Metadata is important in a data-centric world (and at least in part of ML applications) • Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…) • We start to have large, distributed knowledge-bases – Is there a role for AI systems based on logic/KR? – Can we combine symbolic and sub-symbolic reasoning ? – Is this already happening ? Public8
  • 9. Business or Operating Unit/Franchise or Department Data curation Public9 • Annotation • Metadata • Standards • Model • Literature • Databases • … Source BioCuration 2017 Abstracts via wordscloud.com
  • 10. Business or Operating Unit/Franchise or Department An example: public data curation Public10 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk% 2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 11. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public11 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 Property Value Ontology Bio- Charac teristic ? Sample_sou rce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_10 090 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk %2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 12. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public12 Property Value Ontology Bio- Charact eristic? Sample_sour ce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_100 90 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio
  • 13. Business or Operating Unit/Franchise or Department An example: public data curation Public13 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi. ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935 Supports: • Aggregation • Analysis • Search • Link discovery • “Machine learning”
  • 14. Business or Operating Unit/Franchise or Department Can we use AI for Data Curation ? Why ? – Data curation is an intellectually intensive activity, time consuming and intensive – Given the the increasing role and amount of data, curation risks to be a bottleneck Public14 Example of exponential growth in data
  • 15. Business or Operating Unit/Franchise or Department AI for data curation: characteristics and constraints • Can we automate data curation ? • Difficult: – Missing data – Discretionality (e.g.: level of granularity) • Looks reasonable: – Repetition – Consistency – Data/distances evaluations (clustering/attractors) • We need to combine human aspects and machineable aspects Public15
  • 16. Business or Operating Unit/Franchise or Department AI for data curation framing the problem: what Public16 Should this value be normalized? Meaning. E.g.: is “age” same as “years”? Confidence: is this information true ? The need. E.g.: is this a required information. When? Is this a valid identifier? Example, extract from NCBI GEO GSM701607
  • 17. Business or Operating Unit/Franchise or Department AI for data curation Framing the problem: how We consider curation activities as functions in a “curation space” that is exemplified via a “curation record” Public17 Validation state (Confidence) Valid Valid Valid Curation goal (The need) Required Required Required Required Required Semantic type1 (Meaning) Identifier about Sample ID2 about Organism Name about Organism Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name (the “location” in the source) ID taxID Organism Gender age Value GSM701 607 10090 Mus Musculus 6 weeks old 1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition) 2 Identifiers also require a domain specification Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
  • 18. Business or Operating Unit/Franchise or Department AI and data curation Using a record to modularize curation processes • Different classes of operations – Schema mapping (assign a type) – Standard setting (assign a goal) – Validation (setting a validation value) Public18 Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old Validation state Valid Valid Curation goal Required Semantic type Identifier about Sample Name about Organism Name about Gender Field Name ID Organism Gender Value GSM701607 Mus Musculus Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old
  • 19. Business or Operating Unit/Franchise or Department • Different classes of operations – Normalization (filling a column) – Enrichment (adding a column) Public19 AI and data curation Using a record to modularize curation processes Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male PATO:000038 4 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descripti on about Age Field Name ID taxID Organism age Value GSM701607 10090 Mus Musculus 6 weeks old Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descript ion about Age Identifie rabout Sample Field Name ID taxID Organism age EBI ref. Value GSM70160 7 10090 Mus Musculus 6 weeks old SAME A1189 935
  • 20. Business or Operating Unit/Franchise or Department Big picture Quantity/Quality tradeoff Public20 Quality/validity Time/cost • Is the optimal trade-off the same for all data? • Can this change for the same data over time and use cases ? • Can we embed a “cost function” in curation processes ?
  • 21. Business or Operating Unit/Franchise or Department Big picture (Meta) data evolution, immutability Public21 Initial condition: organism name present, missing ID Initial condition: identifier extracted, not verified Identifier extracted and verified Entity: 1234 Information: V1 Meta-Info: V1 Entity: 1234 Information: V2 Meta-Info: V2 Entity: 1234 Information: V2 Meta-Info: V3 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384 Validation state Valid Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384
  • 23. Business or Operating Unit/Franchise or Department Data and metadata transformations (deterministic actions + extractors) • Curation processes can be expressed (by curators) in terms of rules • Rules embed “atomic operations” e.g.: extractors, transformations,… • Simple rules go a very long way… Public23 <ruleConfig method="Extract"> <param name="setType" value="UNIT"/> <param name="setAmbiguous" value="true"/> <param name="setFullMatch" value="false"/> <param name="setResultInJson" value="false"/> <param name="setSimpleJson" value="false"/> <param name="setText"> <ruleConfig method="GetCell"> <param name="setAttr" value="AgeDescription"/> <param name="setBase" value="XCF_1"/> </ruleConfig>
  • 24. Business or Operating Unit/Franchise or Department Abstract rules and meta-rules • Rules can rely on abstraction/inference for higher genericity • They can also be used to produce meta-information Public24 Example rules (pesudo-syntax) • Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value) • Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required” • Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid Validation state Valid Valid Curation goal Required Required Required Semantic type Identifier about Sample ID about Organism Name about Organism Name about Gender Identifier about Gender Field Name (the “location” in the source) ID taxID Organism Gender Value GSM701607 10090 Mus Musculus
  • 25. Business or Operating Unit/Franchise or Department “Approximate” transformations • Some transformations cannot (easily) be expressed in terms rules – Complex and ad hoc relations – Discretional elements • Examples: – Entities de-duplication – Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…) – Schema mapping – Determining the meaning of an attribute (e.g.: time) is a complex function of the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?) – Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ? Public25
  • 26. Business or Operating Unit/Franchise or Department Implementation of de-duplication and schema mapping via Tamr • One approach that we have chosen to provide approximate schema-mapping and de-duplication functions is via Tamr (tamr.com) • Tamr is data unification platform that combines machine learning with human expertise. – E.g.: to support schema mapping, Tamr combines several features: – Data distribution – Property names – Property metadata – It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions Public26
  • 27. Business or Operating Unit/Franchise or Department Schema-mapping (Tamr) Public27 Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated. User interface for curators showing potential attribute matches
  • 28. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) User interface for curators showing potential duplicates Public28 Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.
  • 29. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) Details of the implementation of the deduplication process (courtesy of Tamr) Public29
  • 30. Business or Operating Unit/Franchise or Department Re-introducing logic • Can we predict (or suggest) the association between parameters and entities in a template? – An ontology models the “real world”: entities, qualities, processes – Parameters are annotated with axioms based on this ontology – Inference provides multiple classifications of parameters, as well as possible/necessary associations between parameters and entities. • Can this work? Public30
  • 31. Business or Operating Unit/Franchise or Department Re-introducing logic Public31 Extract from an ontology representing entities and qualities Example of axiomatic mapping between a parameter and an entity and qualities ontology Deductions for parameter ReportID: must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entity may refer to: Report, InternalReport
  • 32. Business or Operating Unit/Franchise or Department Exploring automatic ontology matching Public32 • 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge • Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project • Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics http://oaei.ontologymatching.org/2016
  • 33. Business or Operating Unit/Franchise or Department Conclusions: On rules, standards and data ethnography • Data Curation: “AI” may help (not limited to ML) – Formal knowledge representation is part of the goal • The need for explanations – We need to define (document) a process – We have theorems for proofs: can we do without ? – Is there a role for “ML” GURUs? • The “human side” of data – Data normalization is based on assumptions (e.g.: what can be considered same, what not): there is a cultural side to this. – Would we accept an AI “editor” ? Public33
  • 34. Business or Operating Unit/Franchise or Department Acknowledgments • NIBR • Daniel Cronenberger • Ming Fang • Frederic Sutter • Anosha Siripala • Fabien Pernot • Jean Marc von Allmen • Martin Petracchi • Dorothy Reilly • Pierre Parisot • Therese Vachon • Tamr.com • Pistoia Alliance Ontology Matching Project team Public34