Artificial Intelligence in Data Curation

Novartis Institutes for BioMedical Research
Novartis Institutes for BioMedical ResearchPutting data in order (@Novartis)
AI for Data Curation
Yes, can we?
Andrea Splendiani, AD, Information Systems
London
September 28, 2017
NIBR Informatics
Business or Operating Unit/Franchise or Department
Agenda
1. Focus: metadata and
reference data
2. Knowledge Engineering
and AI
3. Data curation: a use case
for AI?
4. Ideas and experiences
5. Conclusions
Public2
What we do
in context
Some
considerations
at 10000ft
Holistic view on
a process
(1000ft)
Details
Reflections at
10000ft
Business or Operating Unit/Franchise or Department
Focus: metadata and reference data
1. What:
– Annotation of datasets
– Standards
– Ontologies
– Reference information
2. Why:
– Support analysis
– Support search and query answering
– Support extraction
– Building knowledge networks / information discovery and inference
3. Where
– Typically in research
Public3
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(a stopper)
• 10 years ago: AI
approaches to Systems
Biology
• Ontology based
knowledge-bases
(Semantic Web)
• ANN/Fuzzy systems even
older
Knowledge Engineering and AI
Public4
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(taken seriously)
• Now: AI and ML are in the
hype
• Interest in Life Sciences
industries
Knowledge Engineering and AI
Public5
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
Public6
• What helped the resurgence of ML?
– Massive data available
– Massive computational power available
– Few technical improvements
– Success stories (Deep learning)
• Do these also apply to Ontology/Sem-Web based
systems?
– Uniprot: 5.7B triples in 2009, 30+B triples in 2017
– EBI RDF Platform (2015)
– Wikidata (2014?)
Source: https://tools.wmflabs.org/wikidata-todo/stats.php
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• The way information is represented has implications on
what is built on it (e.g.: analytics, data mining)
– network: are parallel executions in AND or OR
– Annotations: explicit mention of negative information
Public7
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• Metadata is important in a data-centric world (and at
least in part of ML applications)
• Knowledge representation matters, beyond metadata
(examples: AND/OR in pathways, NOT in
annotations…)
• We start to have large, distributed knowledge-bases
– Is there a role for AI systems based on logic/KR?
– Can we combine symbolic and sub-symbolic reasoning ?
– Is this already happening ?
Public8
Business or Operating Unit/Franchise or Department
Data curation
Public9
• Annotation
• Metadata
• Standards
• Model
• Literature
• Databases
• …
Source BioCuration 2017 Abstracts via wordscloud.com
Business or Operating Unit/Franchise or Department
An example: public data curation
Public10
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%
2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public11
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607
Property Value Ontology Bio-
Charac
teristic
?
Sample_sou
rce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_10
090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk
%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public12
Property Value Ontology Bio-
Charact
eristic?
Sample_sour
ce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_100
90
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
Business or Operating Unit/Franchise or Department
An example: public data curation
Public13
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.
ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Supports:
• Aggregation
• Analysis
• Search
• Link discovery
• “Machine learning”
Business or Operating Unit/Franchise or Department
Can we use AI for Data Curation ?
Why ?
– Data curation is an intellectually intensive
activity, time consuming and intensive
– Given the the increasing role and amount
of data, curation risks to be a bottleneck
Public14
Example of exponential growth in data
Business or Operating Unit/Franchise or Department
AI for data curation:
characteristics and constraints
• Can we automate data curation ?
• Difficult:
– Missing data
– Discretionality (e.g.: level of granularity)
• Looks reasonable:
– Repetition
– Consistency
– Data/distances evaluations (clustering/attractors)
• We need to combine human aspects and machineable
aspects
Public15
Business or Operating Unit/Franchise or Department
AI for data curation
framing the problem: what
Public16
Should this value be
normalized?
Meaning. E.g.: is “age”
same as “years”?
Confidence: is this
information true ?
The need. E.g.: is this a
required information. When? Is this a valid identifier?
Example, extract from NCBI GEO GSM701607
Business or Operating Unit/Franchise or Department
AI for data curation
Framing the problem: how
We consider curation activities as functions in a “curation
space” that is exemplified via a “curation record”
Public17
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
Business or Operating Unit/Franchise or Department
AI and data curation
Using a record to modularize curation
processes
• Different classes of
operations
– Schema mapping (assign a
type)
– Standard setting (assign a
goal)
– Validation (setting a validation
value)
Public18
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Validation state Valid Valid
Curation goal Required
Semantic type Identifier about
Sample
Name about
Organism
Name about Gender
Field Name ID Organism Gender
Value GSM701607 Mus Musculus
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Business or Operating Unit/Franchise or Department
• Different classes of
operations
– Normalization (filling a
column)
– Enrichment (adding a
column)
Public19
AI and data curation
Using a record to modularize curation
processes
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male PATO:000038
4
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descripti
on about
Age
Field Name ID taxID Organism age
Value GSM701607 10090 Mus Musculus 6 weeks
old
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descript
ion
about
Age
Identifie
rabout
Sample
Field Name ID taxID Organism age EBI ref.
Value GSM70160
7
10090 Mus
Musculus
6 weeks
old
SAME
A1189
935
Business or Operating Unit/Franchise or Department
Big picture
Quantity/Quality tradeoff
Public20
Quality/validity
Time/cost
• Is the optimal trade-off the
same for all data?
• Can this change for the
same data over time and
use cases ?
• Can we embed a “cost
function” in curation
processes ?
Business or Operating Unit/Franchise or Department
Big picture
(Meta) data evolution, immutability
Public21
Initial condition:
organism name
present, missing ID
Initial condition:
identifier extracted,
not verified
Identifier extracted
and verified
Entity: 1234
Information: V1
Meta-Info: V1
Entity: 1234
Information: V2
Meta-Info: V2
Entity: 1234
Information: V2
Meta-Info: V3
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Validation state Valid Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Ideas and experiences
Some details
Business or Operating Unit/Franchise or Department
Data and metadata transformations
(deterministic actions + extractors)
• Curation processes can be
expressed (by curators) in
terms of rules
• Rules embed “atomic
operations” e.g.: extractors,
transformations,…
• Simple rules go a very long
way…
Public23
<ruleConfig method="Extract">
<param name="setType" value="UNIT"/>
<param name="setAmbiguous" value="true"/>
<param name="setFullMatch" value="false"/>
<param name="setResultInJson" value="false"/>
<param name="setSimpleJson" value="false"/>
<param name="setText">
<ruleConfig method="GetCell">
<param name="setAttr" value="AgeDescription"/>
<param name="setBase" value="XCF_1"/>
</ruleConfig>
Business or Operating Unit/Franchise or Department
Abstract rules and meta-rules
• Rules can rely on abstraction/inference for higher genericity
• They can also be used to produce meta-information
Public24
Example rules (pesudo-syntax)
• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y:
E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then
E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”
• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation
State=Valid
Validation state Valid Valid
Curation goal Required Required Required
Semantic type Identifier
about
Sample
ID about
Organism
Name
about
Organism
Name about
Gender
Identifier about
Gender
Field Name
(the “location” in
the source)
ID taxID Organism Gender
Value GSM701607 10090 Mus
Musculus
Business or Operating Unit/Franchise or Department
“Approximate” transformations
• Some transformations cannot (easily) be expressed in
terms rules
– Complex and ad hoc relations
– Discretional elements
• Examples:
– Entities de-duplication
– Whether two homonymous authors mentions are referring to the same author
or not is a complex function of an extended range of the author’s features
(where they work, contact information, subject study,…)
– Schema mapping
– Determining the meaning of an attribute (e.g.: time) is a complex function of
the values this attribute takes, as well as other parameters (is this a duration, a
time point, or an execution timestamp?)
– Is ”Sample tracking number” to be mapped to “Tracking number” or to
“Identifier” ?
Public25
Business or Operating Unit/Franchise or Department
Implementation of de-duplication
and schema mapping via Tamr
• One approach that we have chosen to provide
approximate schema-mapping and de-duplication
functions is via Tamr (tamr.com)
• Tamr is data unification platform that combines machine
learning with human expertise.
– E.g.: to support schema mapping, Tamr combines several features:
– Data distribution
– Property names
– Property metadata
– It learns how to compose such functions via machine learning, through
an iterative process where human experts can provide input and
improve predictions
Public26
Business or Operating Unit/Franchise or Department
Schema-mapping (Tamr)
Public27
Users are suggested
a range of potential
mapping, with a
confidence score.
They can confirm or
suggest different
mappings. New
predictions are
routinely provided as
more input is
accumulated.
User interface for curators showing potential attribute matches
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
User interface for curators showing potential duplicates
Public28
Users are shown a
set of potential
duplicates with a
confidence score.
They can accept or
refuse such
suggestions, thus
providing training
data and iteratively
refining predictions.
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
Details of the implementation of the deduplication process (courtesy of Tamr)
Public29
Business or Operating Unit/Franchise or Department
Re-introducing logic
• Can we predict (or suggest) the association between
parameters and entities in a template?
– An ontology models the “real world”: entities, qualities, processes
– Parameters are annotated with axioms based on this ontology
– Inference provides multiple classifications of parameters, as well as
possible/necessary associations between parameters and entities.
• Can this work?
Public30
Business or Operating Unit/Franchise or Department
Re-introducing logic
Public31
Extract from an ontology representing entities and
qualities
Example of axiomatic mapping between a
parameter and an entity and qualities ontology
Deductions for parameter ReportID:
must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity,
Information Entity, Immaterial Entity
may refer to: Report, InternalReport
Business or Operating Unit/Franchise or Department
Exploring automatic ontology
matching
Public32
• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge
• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project
• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics
http://oaei.ontologymatching.org/2016
Business or Operating Unit/Franchise or Department
Conclusions: On rules, standards
and data ethnography
• Data Curation: “AI” may help (not limited to ML)
– Formal knowledge representation is part of the goal
• The need for explanations
– We need to define (document) a process
– We have theorems for proofs: can we do without ?
– Is there a role for “ML” GURUs?
• The “human side” of data
– Data normalization is based on assumptions (e.g.: what can be
considered same, what not): there is a cultural side to this.
– Would we accept an AI “editor” ?
Public33
Business or Operating Unit/Franchise or Department
Acknowledgments
• NIBR
• Daniel Cronenberger
• Ming Fang
• Frederic Sutter
• Anosha Siripala
• Fabien Pernot
• Jean Marc von Allmen
• Martin Petracchi
• Dorothy Reilly
• Pierre Parisot
• Therese Vachon
• Tamr.com
• Pistoia Alliance Ontology Matching Project team
Public34
Thank you
1 of 35

Recommended

From data lakes to actionable data (adventures in data curation) by
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)Novartis Institutes for BioMedical Research
779 views37 slides
The Genopolis Microarray database by
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray databaseNovartis Institutes for BioMedical Research
527 views47 slides
Data Mining: Concepts and techniques: Chapter 13 trend by
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean
5.3K views52 slides
Introduction to Data Mining by
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
2.2K views58 slides
Ghhh by
GhhhGhhh
Ghhhagammya
1.4K views63 slides
1.2 steps and functionalities by
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
366 views17 slides

More Related Content

What's hot

The 8 Step Data Mining Process by
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
10.5K views15 slides
01 Introduction to Data Mining by
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
2.8K views14 slides
Knowledge discovery thru data mining by
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
5K views37 slides
Introduction to Datamining Concept and Techniques by
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
858 views29 slides
Data science syllabus by
Data science syllabusData science syllabus
Data science syllabusanoop bk
168 views2 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
721 views51 slides

What's hot(19)

The 8 Step Data Mining Process by Marc Berman
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman10.5K views
Knowledge discovery thru data mining by Devakumar Jain
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
Devakumar Jain5K views
Introduction to Datamining Concept and Techniques by Sơn Còm Nhom
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
Sơn Còm Nhom858 views
Data science syllabus by anoop bk
Data science syllabusData science syllabus
Data science syllabus
anoop bk168 views
Fairification experience clarifying the semantics of data matrices by Pistoia Alliance
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
Pistoia Alliance465 views
Data mining seminar report by mayurik19
Data mining seminar reportData mining seminar report
Data mining seminar report
mayurik1915.6K views
Additional themes of data mining for Msc CS by Thanveen
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen2.6K views
Melissa Informatics - Data Quality and AI by melissadata
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
melissadata46 views
Data analytics beyond data processing and how it affects Industry 4.0 by Mathieu d'Aquin
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
Mathieu d'Aquin885 views
Data mining concepts by Basit Rafiq
Data mining conceptsData mining concepts
Data mining concepts
Basit Rafiq709 views

Viewers also liked

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ... by
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...Ramy K. Aziz
553 views44 slides
Introduction to Network Medicine by
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network MedicineMarc Santolini
2.4K views28 slides
Gene expression concept and analysis by
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysisNoha Lotfy Ibrahim
7.3K views82 slides
RT-PCR by
RT-PCRRT-PCR
RT-PCRNoha Lotfy Ibrahim
8.1K views37 slides
Gene Expression Data Analysis by
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
7.6K views42 slides
Graph properties of biological networks by
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networksngulbahce
1.6K views46 slides

Viewers also liked(11)

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ... by Ramy K. Aziz
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
Ramy K. Aziz553 views
Introduction to Network Medicine by Marc Santolini
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network Medicine
Marc Santolini2.4K views
Graph properties of biological networks by ngulbahce
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networks
ngulbahce1.6K views
Systems biology & Approaches of genomics and proteomics by sonam786
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
sonam78610.6K views
Systems biology - Understanding biology at the systems level by Lars Juhl Jensen
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen3.6K views
System biology and its tools by Gaurav Diwakar
System biology and its toolsSystem biology and its tools
System biology and its tools
Gaurav Diwakar3.4K views
Introduction to systems biology by lemberger
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
lemberger8.8K views

Similar to Artificial Intelligence in Data Curation

Evaluating Taxonomies by
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating TaxonomiesJoseph Busch
368 views25 slides
Göteborg university(condensed) by
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
385 views103 slides
Nordic health data metadata by
Nordic health data   metadataNordic health data   metadata
Nordic health data metadataFredric Landqvist
86 views49 slides
Wild hairtech bih by
Wild hairtech   bihWild hairtech   bih
Wild hairtech bihTyrell Thornton
244 views23 slides
Be Digital or Die - Predictive Analytics for Digital Transformation by
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationFintricity
3K views16 slides
The Evolution Of Competitive Intelligence Dec09 Final by
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Finalrotciv
1.4K views32 slides

Similar to Artificial Intelligence in Data Curation(20)

Evaluating Taxonomies by Joseph Busch
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating Taxonomies
Joseph Busch368 views
Göteborg university(condensed) by Zenodia Charpy
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
Zenodia Charpy385 views
Be Digital or Die - Predictive Analytics for Digital Transformation by Fintricity
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital Transformation
Fintricity3K views
The Evolution Of Competitive Intelligence Dec09 Final by rotciv
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Final
rotciv1.4K views
AI for information management: why and how by Anna Divoli
AI for information management: why and howAI for information management: why and how
AI for information management: why and how
Anna Divoli574 views
Big data and Predictive Analytics By : Professor Lili Saghafi by Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
Introduction to Business and Data Analysis Undergraduate.pdf by AbdulrahimShaibuIssa
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta... by StampedeCon
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
StampedeCon655 views
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC) by Denodo
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo 184 views
What is Data Science? by Ahmed Banafa
What is Data Science?What is Data Science?
What is Data Science?
Ahmed Banafa277 views
Actionable analytics with mongo db mongophilly-2011 by MongoDB
Actionable analytics with mongo db   mongophilly-2011Actionable analytics with mongo db   mongophilly-2011
Actionable analytics with mongo db mongophilly-2011
MongoDB929 views
Optimising Your Content for Findability by Findwise
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
Findwise3.2K views

Recently uploaded

Experimental animal Guinea pigs.pptx by
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptxMansee Arya
40 views16 slides
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Anmol Vishnu Gupta
28 views12 slides
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.ShadmanSakib63
6 views6 slides
NUTRITION IN BACTERIA.pdf by
NUTRITION IN BACTERIA.pdfNUTRITION IN BACTERIA.pdf
NUTRITION IN BACTERIA.pdfNandadulalSannigrahi
37 views14 slides
CYTOSKELETON STRUCTURE.ppt by
CYTOSKELETON STRUCTURE.pptCYTOSKELETON STRUCTURE.ppt
CYTOSKELETON STRUCTURE.pptEstherShobhaR
14 views19 slides
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Trustlife
114 views17 slides

Recently uploaded(20)

Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya40 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by ShadmanSakib63
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
ShadmanSakib636 views
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife114 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI9 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife146 views
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank28 views
Presentation on experimental laboratory animal- Hamster by Kanika13641
Presentation on experimental laboratory animal- HamsterPresentation on experimental laboratory animal- Hamster
Presentation on experimental laboratory animal- Hamster
Kanika136416 views
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy... by Anmol Vishnu Gupta
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific115 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI6 views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa11 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain14 views

Artificial Intelligence in Data Curation

  • 1. AI for Data Curation Yes, can we? Andrea Splendiani, AD, Information Systems London September 28, 2017 NIBR Informatics
  • 2. Business or Operating Unit/Franchise or Department Agenda 1. Focus: metadata and reference data 2. Knowledge Engineering and AI 3. Data curation: a use case for AI? 4. Ideas and experiences 5. Conclusions Public2 What we do in context Some considerations at 10000ft Holistic view on a process (1000ft) Details Reflections at 10000ft
  • 3. Business or Operating Unit/Franchise or Department Focus: metadata and reference data 1. What: – Annotation of datasets – Standards – Ontologies – Reference information 2. Why: – Support analysis – Support search and query answering – Support extraction – Building knowledge networks / information discovery and inference 3. Where – Typically in research Public3
  • 4. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (a stopper) • 10 years ago: AI approaches to Systems Biology • Ontology based knowledge-bases (Semantic Web) • ANN/Fuzzy systems even older Knowledge Engineering and AI Public4
  • 5. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (taken seriously) • Now: AI and ML are in the hype • Interest in Life Sciences industries Knowledge Engineering and AI Public5
  • 6. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI Public6 • What helped the resurgence of ML? – Massive data available – Massive computational power available – Few technical improvements – Success stories (Deep learning) • Do these also apply to Ontology/Sem-Web based systems? – Uniprot: 5.7B triples in 2009, 30+B triples in 2017 – EBI RDF Platform (2015) – Wikidata (2014?) Source: https://tools.wmflabs.org/wikidata-todo/stats.php
  • 7. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • The way information is represented has implications on what is built on it (e.g.: analytics, data mining) – network: are parallel executions in AND or OR – Annotations: explicit mention of negative information Public7
  • 8. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • Metadata is important in a data-centric world (and at least in part of ML applications) • Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…) • We start to have large, distributed knowledge-bases – Is there a role for AI systems based on logic/KR? – Can we combine symbolic and sub-symbolic reasoning ? – Is this already happening ? Public8
  • 9. Business or Operating Unit/Franchise or Department Data curation Public9 • Annotation • Metadata • Standards • Model • Literature • Databases • … Source BioCuration 2017 Abstracts via wordscloud.com
  • 10. Business or Operating Unit/Franchise or Department An example: public data curation Public10 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk% 2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 11. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public11 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 Property Value Ontology Bio- Charac teristic ? Sample_sou rce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_10 090 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk %2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 12. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public12 Property Value Ontology Bio- Charact eristic? Sample_sour ce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_100 90 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio
  • 13. Business or Operating Unit/Franchise or Department An example: public data curation Public13 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi. ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935 Supports: • Aggregation • Analysis • Search • Link discovery • “Machine learning”
  • 14. Business or Operating Unit/Franchise or Department Can we use AI for Data Curation ? Why ? – Data curation is an intellectually intensive activity, time consuming and intensive – Given the the increasing role and amount of data, curation risks to be a bottleneck Public14 Example of exponential growth in data
  • 15. Business or Operating Unit/Franchise or Department AI for data curation: characteristics and constraints • Can we automate data curation ? • Difficult: – Missing data – Discretionality (e.g.: level of granularity) • Looks reasonable: – Repetition – Consistency – Data/distances evaluations (clustering/attractors) • We need to combine human aspects and machineable aspects Public15
  • 16. Business or Operating Unit/Franchise or Department AI for data curation framing the problem: what Public16 Should this value be normalized? Meaning. E.g.: is “age” same as “years”? Confidence: is this information true ? The need. E.g.: is this a required information. When? Is this a valid identifier? Example, extract from NCBI GEO GSM701607
  • 17. Business or Operating Unit/Franchise or Department AI for data curation Framing the problem: how We consider curation activities as functions in a “curation space” that is exemplified via a “curation record” Public17 Validation state (Confidence) Valid Valid Valid Curation goal (The need) Required Required Required Required Required Semantic type1 (Meaning) Identifier about Sample ID2 about Organism Name about Organism Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name (the “location” in the source) ID taxID Organism Gender age Value GSM701 607 10090 Mus Musculus 6 weeks old 1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition) 2 Identifiers also require a domain specification Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
  • 18. Business or Operating Unit/Franchise or Department AI and data curation Using a record to modularize curation processes • Different classes of operations – Schema mapping (assign a type) – Standard setting (assign a goal) – Validation (setting a validation value) Public18 Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old Validation state Valid Valid Curation goal Required Semantic type Identifier about Sample Name about Organism Name about Gender Field Name ID Organism Gender Value GSM701607 Mus Musculus Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old
  • 19. Business or Operating Unit/Franchise or Department • Different classes of operations – Normalization (filling a column) – Enrichment (adding a column) Public19 AI and data curation Using a record to modularize curation processes Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male PATO:000038 4 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descripti on about Age Field Name ID taxID Organism age Value GSM701607 10090 Mus Musculus 6 weeks old Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descript ion about Age Identifie rabout Sample Field Name ID taxID Organism age EBI ref. Value GSM70160 7 10090 Mus Musculus 6 weeks old SAME A1189 935
  • 20. Business or Operating Unit/Franchise or Department Big picture Quantity/Quality tradeoff Public20 Quality/validity Time/cost • Is the optimal trade-off the same for all data? • Can this change for the same data over time and use cases ? • Can we embed a “cost function” in curation processes ?
  • 21. Business or Operating Unit/Franchise or Department Big picture (Meta) data evolution, immutability Public21 Initial condition: organism name present, missing ID Initial condition: identifier extracted, not verified Identifier extracted and verified Entity: 1234 Information: V1 Meta-Info: V1 Entity: 1234 Information: V2 Meta-Info: V2 Entity: 1234 Information: V2 Meta-Info: V3 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384 Validation state Valid Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384
  • 23. Business or Operating Unit/Franchise or Department Data and metadata transformations (deterministic actions + extractors) • Curation processes can be expressed (by curators) in terms of rules • Rules embed “atomic operations” e.g.: extractors, transformations,… • Simple rules go a very long way… Public23 <ruleConfig method="Extract"> <param name="setType" value="UNIT"/> <param name="setAmbiguous" value="true"/> <param name="setFullMatch" value="false"/> <param name="setResultInJson" value="false"/> <param name="setSimpleJson" value="false"/> <param name="setText"> <ruleConfig method="GetCell"> <param name="setAttr" value="AgeDescription"/> <param name="setBase" value="XCF_1"/> </ruleConfig>
  • 24. Business or Operating Unit/Franchise or Department Abstract rules and meta-rules • Rules can rely on abstraction/inference for higher genericity • They can also be used to produce meta-information Public24 Example rules (pesudo-syntax) • Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value) • Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required” • Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid Validation state Valid Valid Curation goal Required Required Required Semantic type Identifier about Sample ID about Organism Name about Organism Name about Gender Identifier about Gender Field Name (the “location” in the source) ID taxID Organism Gender Value GSM701607 10090 Mus Musculus
  • 25. Business or Operating Unit/Franchise or Department “Approximate” transformations • Some transformations cannot (easily) be expressed in terms rules – Complex and ad hoc relations – Discretional elements • Examples: – Entities de-duplication – Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…) – Schema mapping – Determining the meaning of an attribute (e.g.: time) is a complex function of the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?) – Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ? Public25
  • 26. Business or Operating Unit/Franchise or Department Implementation of de-duplication and schema mapping via Tamr • One approach that we have chosen to provide approximate schema-mapping and de-duplication functions is via Tamr (tamr.com) • Tamr is data unification platform that combines machine learning with human expertise. – E.g.: to support schema mapping, Tamr combines several features: – Data distribution – Property names – Property metadata – It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions Public26
  • 27. Business or Operating Unit/Franchise or Department Schema-mapping (Tamr) Public27 Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated. User interface for curators showing potential attribute matches
  • 28. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) User interface for curators showing potential duplicates Public28 Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.
  • 29. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) Details of the implementation of the deduplication process (courtesy of Tamr) Public29
  • 30. Business or Operating Unit/Franchise or Department Re-introducing logic • Can we predict (or suggest) the association between parameters and entities in a template? – An ontology models the “real world”: entities, qualities, processes – Parameters are annotated with axioms based on this ontology – Inference provides multiple classifications of parameters, as well as possible/necessary associations between parameters and entities. • Can this work? Public30
  • 31. Business or Operating Unit/Franchise or Department Re-introducing logic Public31 Extract from an ontology representing entities and qualities Example of axiomatic mapping between a parameter and an entity and qualities ontology Deductions for parameter ReportID: must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entity may refer to: Report, InternalReport
  • 32. Business or Operating Unit/Franchise or Department Exploring automatic ontology matching Public32 • 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge • Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project • Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics http://oaei.ontologymatching.org/2016
  • 33. Business or Operating Unit/Franchise or Department Conclusions: On rules, standards and data ethnography • Data Curation: “AI” may help (not limited to ML) – Formal knowledge representation is part of the goal • The need for explanations – We need to define (document) a process – We have theorems for proofs: can we do without ? – Is there a role for “ML” GURUs? • The “human side” of data – Data normalization is based on assumptions (e.g.: what can be considered same, what not): there is a cultural side to this. – Would we accept an AI “editor” ? Public33
  • 34. Business or Operating Unit/Franchise or Department Acknowledgments • NIBR • Daniel Cronenberger • Ming Fang • Frederic Sutter • Anosha Siripala • Fabien Pernot • Jean Marc von Allmen • Martin Petracchi • Dorothy Reilly • Pierre Parisot • Therese Vachon • Tamr.com • Pistoia Alliance Ontology Matching Project team Public34