Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Artificial Intelligence in Data Curation

1,609 views

Published on

A set of ideas on the use of artificial intelligence for data curation that has been presented at the Pharma-IT conference (London, 2017), in the artificial intelligence track.
It begins with some broad discussion about semantic web, knowledge representation, machine learning and artificial intelligence. It the focus on how a "data curation" problem can be framed and hints at some possible examples.

Published in: Science
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Artificial Intelligence in Data Curation

  1. 1. AI for Data Curation Yes, can we? Andrea Splendiani, AD, Information Systems London September 28, 2017 NIBR Informatics
  2. 2. Business or Operating Unit/Franchise or Department Agenda 1. Focus: metadata and reference data 2. Knowledge Engineering and AI 3. Data curation: a use case for AI? 4. Ideas and experiences 5. Conclusions Public2 What we do in context Some considerations at 10000ft Holistic view on a process (1000ft) Details Reflections at 10000ft
  3. 3. Business or Operating Unit/Franchise or Department Focus: metadata and reference data 1. What: – Annotation of datasets – Standards – Ontologies – Reference information 2. Why: – Support analysis – Support search and query answering – Support extraction – Building knowledge networks / information discovery and inference 3. Where – Typically in research Public3
  4. 4. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (a stopper) • 10 years ago: AI approaches to Systems Biology • Ontology based knowledge-bases (Semantic Web) • ANN/Fuzzy systems even older Knowledge Engineering and AI Public4
  5. 5. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (taken seriously) • Now: AI and ML are in the hype • Interest in Life Sciences industries Knowledge Engineering and AI Public5
  6. 6. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI Public6 • What helped the resurgence of ML? – Massive data available – Massive computational power available – Few technical improvements – Success stories (Deep learning) • Do these also apply to Ontology/Sem-Web based systems? – Uniprot: 5.7B triples in 2009, 30+B triples in 2017 – EBI RDF Platform (2015) – Wikidata (2014?) Source: https://tools.wmflabs.org/wikidata-todo/stats.php
  7. 7. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • The way information is represented has implications on what is built on it (e.g.: analytics, data mining) – network: are parallel executions in AND or OR – Annotations: explicit mention of negative information Public7
  8. 8. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • Metadata is important in a data-centric world (and at least in part of ML applications) • Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…) • We start to have large, distributed knowledge-bases – Is there a role for AI systems based on logic/KR? – Can we combine symbolic and sub-symbolic reasoning ? – Is this already happening ? Public8
  9. 9. Business or Operating Unit/Franchise or Department Data curation Public9 • Annotation • Metadata • Standards • Model • Literature • Databases • … Source BioCuration 2017 Abstracts via wordscloud.com
  10. 10. Business or Operating Unit/Franchise or Department An example: public data curation Public10 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk% 2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  11. 11. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public11 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 Property Value Ontology Bio- Charac teristic ? Sample_sou rce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_10 090 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk %2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  12. 12. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public12 Property Value Ontology Bio- Charact eristic? Sample_sour ce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_100 90 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio
  13. 13. Business or Operating Unit/Franchise or Department An example: public data curation Public13 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi. ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935 Supports: • Aggregation • Analysis • Search • Link discovery • “Machine learning”
  14. 14. Business or Operating Unit/Franchise or Department Can we use AI for Data Curation ? Why ? – Data curation is an intellectually intensive activity, time consuming and intensive – Given the the increasing role and amount of data, curation risks to be a bottleneck Public14 Example of exponential growth in data
  15. 15. Business or Operating Unit/Franchise or Department AI for data curation: characteristics and constraints • Can we automate data curation ? • Difficult: – Missing data – Discretionality (e.g.: level of granularity) • Looks reasonable: – Repetition – Consistency – Data/distances evaluations (clustering/attractors) • We need to combine human aspects and machineable aspects Public15
  16. 16. Business or Operating Unit/Franchise or Department AI for data curation framing the problem: what Public16 Should this value be normalized? Meaning. E.g.: is “age” same as “years”? Confidence: is this information true ? The need. E.g.: is this a required information. When? Is this a valid identifier? Example, extract from NCBI GEO GSM701607
  17. 17. Business or Operating Unit/Franchise or Department AI for data curation Framing the problem: how We consider curation activities as functions in a “curation space” that is exemplified via a “curation record” Public17 Validation state (Confidence) Valid Valid Valid Curation goal (The need) Required Required Required Required Required Semantic type1 (Meaning) Identifier about Sample ID2 about Organism Name about Organism Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name (the “location” in the source) ID taxID Organism Gender age Value GSM701 607 10090 Mus Musculus 6 weeks old 1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition) 2 Identifiers also require a domain specification Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
  18. 18. Business or Operating Unit/Franchise or Department AI and data curation Using a record to modularize curation processes • Different classes of operations – Schema mapping (assign a type) – Standard setting (assign a goal) – Validation (setting a validation value) Public18 Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old Validation state Valid Valid Curation goal Required Semantic type Identifier about Sample Name about Organism Name about Gender Field Name ID Organism Gender Value GSM701607 Mus Musculus Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old
  19. 19. Business or Operating Unit/Franchise or Department • Different classes of operations – Normalization (filling a column) – Enrichment (adding a column) Public19 AI and data curation Using a record to modularize curation processes Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male PATO:000038 4 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descripti on about Age Field Name ID taxID Organism age Value GSM701607 10090 Mus Musculus 6 weeks old Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descript ion about Age Identifie rabout Sample Field Name ID taxID Organism age EBI ref. Value GSM70160 7 10090 Mus Musculus 6 weeks old SAME A1189 935
  20. 20. Business or Operating Unit/Franchise or Department Big picture Quantity/Quality tradeoff Public20 Quality/validity Time/cost • Is the optimal trade-off the same for all data? • Can this change for the same data over time and use cases ? • Can we embed a “cost function” in curation processes ?
  21. 21. Business or Operating Unit/Franchise or Department Big picture (Meta) data evolution, immutability Public21 Initial condition: organism name present, missing ID Initial condition: identifier extracted, not verified Identifier extracted and verified Entity: 1234 Information: V1 Meta-Info: V1 Entity: 1234 Information: V2 Meta-Info: V2 Entity: 1234 Information: V2 Meta-Info: V3 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384 Validation state Valid Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384
  22. 22. Ideas and experiences Some details
  23. 23. Business or Operating Unit/Franchise or Department Data and metadata transformations (deterministic actions + extractors) • Curation processes can be expressed (by curators) in terms of rules • Rules embed “atomic operations” e.g.: extractors, transformations,… • Simple rules go a very long way… Public23 <ruleConfig method="Extract"> <param name="setType" value="UNIT"/> <param name="setAmbiguous" value="true"/> <param name="setFullMatch" value="false"/> <param name="setResultInJson" value="false"/> <param name="setSimpleJson" value="false"/> <param name="setText"> <ruleConfig method="GetCell"> <param name="setAttr" value="AgeDescription"/> <param name="setBase" value="XCF_1"/> </ruleConfig>
  24. 24. Business or Operating Unit/Franchise or Department Abstract rules and meta-rules • Rules can rely on abstraction/inference for higher genericity • They can also be used to produce meta-information Public24 Example rules (pesudo-syntax) • Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value) • Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required” • Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid Validation state Valid Valid Curation goal Required Required Required Semantic type Identifier about Sample ID about Organism Name about Organism Name about Gender Identifier about Gender Field Name (the “location” in the source) ID taxID Organism Gender Value GSM701607 10090 Mus Musculus
  25. 25. Business or Operating Unit/Franchise or Department “Approximate” transformations • Some transformations cannot (easily) be expressed in terms rules – Complex and ad hoc relations – Discretional elements • Examples: – Entities de-duplication – Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…) – Schema mapping – Determining the meaning of an attribute (e.g.: time) is a complex function of the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?) – Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ? Public25
  26. 26. Business or Operating Unit/Franchise or Department Implementation of de-duplication and schema mapping via Tamr • One approach that we have chosen to provide approximate schema-mapping and de-duplication functions is via Tamr (tamr.com) • Tamr is data unification platform that combines machine learning with human expertise. – E.g.: to support schema mapping, Tamr combines several features: – Data distribution – Property names – Property metadata – It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions Public26
  27. 27. Business or Operating Unit/Franchise or Department Schema-mapping (Tamr) Public27 Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated. User interface for curators showing potential attribute matches
  28. 28. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) User interface for curators showing potential duplicates Public28 Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.
  29. 29. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) Details of the implementation of the deduplication process (courtesy of Tamr) Public29
  30. 30. Business or Operating Unit/Franchise or Department Re-introducing logic • Can we predict (or suggest) the association between parameters and entities in a template? – An ontology models the “real world”: entities, qualities, processes – Parameters are annotated with axioms based on this ontology – Inference provides multiple classifications of parameters, as well as possible/necessary associations between parameters and entities. • Can this work? Public30
  31. 31. Business or Operating Unit/Franchise or Department Re-introducing logic Public31 Extract from an ontology representing entities and qualities Example of axiomatic mapping between a parameter and an entity and qualities ontology Deductions for parameter ReportID: must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entity may refer to: Report, InternalReport
  32. 32. Business or Operating Unit/Franchise or Department Exploring automatic ontology matching Public32 • 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge • Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project • Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics http://oaei.ontologymatching.org/2016
  33. 33. Business or Operating Unit/Franchise or Department Conclusions: On rules, standards and data ethnography • Data Curation: “AI” may help (not limited to ML) – Formal knowledge representation is part of the goal • The need for explanations – We need to define (document) a process – We have theorems for proofs: can we do without ? – Is there a role for “ML” GURUs? • The “human side” of data – Data normalization is based on assumptions (e.g.: what can be considered same, what not): there is a cultural side to this. – Would we accept an AI “editor” ? Public33
  34. 34. Business or Operating Unit/Franchise or Department Acknowledgments • NIBR • Daniel Cronenberger • Ming Fang • Frederic Sutter • Anosha Siripala • Fabien Pernot • Jean Marc von Allmen • Martin Petracchi • Dorothy Reilly • Pierre Parisot • Therese Vachon • Tamr.com • Pistoia Alliance Ontology Matching Project team Public34
  35. 35. Thank you

×