Open Data in Agrifood and Life
Sciences: Models, Standards, Tools,
Use Cases
Paris, 17/9/2019
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Keywan Hassani-Pak <keywan.hassani-pak@rothamsted.ac.uk>
Find these slides on SlideShare
Why this workshop (ideally)
• Focus on sharing machine-readable data for agrifood and related areas
(eg, weather)
• See (examples of) at what point we are, trends, etc
• Share experiences, best practices, solutions etc
• (At least) outline some common efforts (eg, to have a common schema)
The new oil (*)
https://lod-cloud.net/
https://goo.gl/n4m5xL
https://www.economist.com/node/21521548
(*) or the old
mess?
And of course, in the agri-food field
What do we want to get by data?
What are the genes involved in yellow rust and the proteins they
encode?
In which pathways are they involved?
What publications and field trials exist as evidence?
How can we get it? => FAIR+
• Data need to be raw, PDF or HTML-only web sites not very good (Accessible, Reusable)
• Datasets need meta-data (Findable)
• Which should be FAIR too, in particular interoperable
• Common formats, schemas and ontologies (Interoperable)
• RDF in the linked data world
• OWL in the Semantic Web world (lightweight schemas ever more popular)
• JSON, APIs, JSON-Schema elsewhere
• Common identifiers (Interoperable)
• URIs in the linked data world (related to F, I too), accessions, code lists elsewhere
• Common query language(s) (FAIR principles affected)
• SPARQL in the linked data world. A plethora of competing QLs elsewhere (eg, GraphQL, Cypher, SQL-like)
• Proper licences, preferably open (Reusable)
• Ideally, text translated to and published as FAIR+ data
• Should have good quality, and it should be measurable + produce to evidence (Reusable and Useful)
• metrics, automated tests, frequent-enough updates (report publish dates, prod and version dependencies),
completeness
How to implement it?
At the begin, there was the Semantic Web, then the Linked Data,
then…
How to implement it?
We’re still in Babel, still with the same issues
Example: GraphQL
https://countries.trevorblades.com/, https://github.com/trevorblades/coun
Example: AI meets (RDF) Knowledge
Graphs
Example: Data Lakes
https://docs.italia.it/italia/daf/daf-docs/en/bozza/
(or,
models/schemas/interop
erability/standardisation
aren’t our so much
problems…)
Linked Data
APIs
APIs + standards
Guidelines
Tabular formats
Vocabularies ’n onto services
Specific
formats
What do we have in Agrifood?
The Knetminer use case
The Knetminer use case
• Green: Ondex plug-ins
• rdf2neo is a generic, non Ondex-specific
rdf->Neo4j conversion tool
• Brandizi et al, IB-2018
(https://dx.doi.org/10.1515%2Fjib-2018-0023)
• Brandizi et al, SWAT4LS-2018
(https://doi.org/10.6084/m9.figshare.7314323.v1)
The Knetminer use case
Cypher examples:
MATCH
// branching via ‘|’
(prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
// variable-length chains
- [:part_of*1..3] -> (pway:Path)
RETURN
prot.name, pwy LIMIT 1000
// Very compact forms available:
MATCH (prot:Protein) - (pway:Path) RETURN pway
• RDF + OWL used as a standardised modelling/representation language
(see BioKNO ontology: github.com/Rothamsted/bioknet-onto)
• SPARQL available too, both having pros/cons
(see our benchmark: github.com/Rothamsted/graphdb-benchmarks)
• Cypher being used for “Semantic Motif” queries, linking genes to entities of interest
(work in progress)
Towards interoperability: AgriSchemas
Towards interoperability: AgriSchemas
Towards interoperability: AgriSchemas
Towards interoperability: AgriSchemas
Experimental data (EBI
GXA)
Molecular biology
(Knetminer)
Host-pathogen interaction (PHI-
Base)
Towards interoperability: AgriSchemas
Molecular biology
(Knetminer)
Towards interoperability: AgriSchemas
Experimental data (EBI GXA)
Towards interoperability: AgriSchemas
Host-pathogen interaction (PHI-
Base)
Towards interoperability: AgriSchemas
Example queries, showing Integration & Interoperability
https://github.com/Rothamsted/agri-schemas/blob/master/drafts/201904-dfw-hackathon/ebi-gxa-use-case/SPARQL-Queries.md
Towards interoperability: AgriSchemas
Towards interoperability: AgriSchemas
Knetminer Info
GXA Info
Conclusions?
• Actually, questions above offered to you: where are we going? Where to go?
Personally, I’ve to offer my 2 cents only
• We have many FAIRification efforts, mostly based on custom formats, APIs, downloads
• We’re still missing integration, interoperability, standards
• For the purpose of queries like the one shown above (show me genes linked to
phenotype, known knowledge, experimental evidence, etc)
• It used to be the focus of Semantic Web and linked data
• Other approaches have become popular (JSON, APIs, NoSQL)
• Only recently they’ve started addressing the same old problems (eg, GraphQL,
JSON-Schema)
• Schematisation has become lightweight, even in LODs (eg, schema.org or SHACL vs
OWL or OBO ontologies)
• Though “true” ontologies are still important in life sciences, mostly for annotations
Acknowledgements
Ajit Singh
Software Engineer
• Alice Minotto, Earlham Inst, hosting providers
• Monika Mistry, master Student, Data Curator
• William Brown, IT admin
• Madhu Donepudi, Richard Holland, ext contractors, developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Joseph Hearnshaw
Software Engineer
Sandeep Amberkar
Bioinformatician, Data curator
Interactive Session: Proposals
• Model your data of interest with the AgriSchemas approach
• And/Or review what we have already drafted, let’s have a discussion about it
• Experiment with The Knetminer SPARQL/Neo4j endpoints (which includes
experimental import from GXA)
• A closer look at the Knetminer ELT pipeline, from external sources to XML/OXL,
RDF, Neo4j
Playing Knetminer endpoints
• eg, using queries at the end point, find
proteins related to "oxygenic photosynthesis" and related publications
• Use https://github.com/Rothamsted/bioknet-onto for info about the data model
• Also, see the figure: https://github.com/Rothamsted/graphdb-benchmarks/blob/master/results/ara_knet_pattern.png
• Suggestion: use the relation bk:pub_in, relating bk:Protein to Bk:Publication
• Provisional endpoint: http://marcobrandizi.info:9090/lodestar/sparql
• Solution: https://gist.github.com/marco-brandizi/d51f823a879630f46b5ba582f1450a3c
Playing Knetminer endpoints
• Using Cypher and Neo4j endpoint, explore the components (part_of relation) of
the path (Path node type) about the pathway titled (prefName) "chlorophyll a
biosynthesis I"
• Provisional endpoit: http://marcobrandizi.info:7474 (ib2019/ib2019)
• Solution: https://gist.github.com/marco-brandizi/7b37d815e2dd539361e76d5817a5d99c

AgriFood Data, Models, Standards, Tools, Use Cases

  • 1.
    Open Data inAgrifood and Life Sciences: Models, Standards, Tools, Use Cases Paris, 17/9/2019 Marco Brandizi <marco.brandizi@rothamsted.ac.uk> Keywan Hassani-Pak <keywan.hassani-pak@rothamsted.ac.uk> Find these slides on SlideShare
  • 2.
    Why this workshop(ideally) • Focus on sharing machine-readable data for agrifood and related areas (eg, weather) • See (examples of) at what point we are, trends, etc • Share experiences, best practices, solutions etc • (At least) outline some common efforts (eg, to have a common schema)
  • 3.
    The new oil(*) https://lod-cloud.net/ https://goo.gl/n4m5xL https://www.economist.com/node/21521548 (*) or the old mess?
  • 4.
    And of course,in the agri-food field
  • 5.
    What do wewant to get by data? What are the genes involved in yellow rust and the proteins they encode? In which pathways are they involved? What publications and field trials exist as evidence?
  • 6.
    How can weget it? => FAIR+ • Data need to be raw, PDF or HTML-only web sites not very good (Accessible, Reusable) • Datasets need meta-data (Findable) • Which should be FAIR too, in particular interoperable • Common formats, schemas and ontologies (Interoperable) • RDF in the linked data world • OWL in the Semantic Web world (lightweight schemas ever more popular) • JSON, APIs, JSON-Schema elsewhere • Common identifiers (Interoperable) • URIs in the linked data world (related to F, I too), accessions, code lists elsewhere • Common query language(s) (FAIR principles affected) • SPARQL in the linked data world. A plethora of competing QLs elsewhere (eg, GraphQL, Cypher, SQL-like) • Proper licences, preferably open (Reusable) • Ideally, text translated to and published as FAIR+ data • Should have good quality, and it should be measurable + produce to evidence (Reusable and Useful) • metrics, automated tests, frequent-enough updates (report publish dates, prod and version dependencies), completeness
  • 7.
    How to implementit? At the begin, there was the Semantic Web, then the Linked Data, then…
  • 8.
    How to implementit? We’re still in Babel, still with the same issues
  • 9.
  • 10.
    Example: AI meets(RDF) Knowledge Graphs
  • 11.
  • 12.
    Linked Data APIs APIs +standards Guidelines Tabular formats Vocabularies ’n onto services Specific formats What do we have in Agrifood?
  • 13.
  • 14.
    The Knetminer usecase • Green: Ondex plug-ins • rdf2neo is a generic, non Ondex-specific rdf->Neo4j conversion tool • Brandizi et al, IB-2018 (https://dx.doi.org/10.1515%2Fjib-2018-0023) • Brandizi et al, SWAT4LS-2018 (https://doi.org/10.6084/m9.figshare.7314323.v1)
  • 15.
    The Knetminer usecase Cypher examples: MATCH // branching via ‘|’ (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) // variable-length chains - [:part_of*1..3] -> (pway:Path) RETURN prot.name, pwy LIMIT 1000 // Very compact forms available: MATCH (prot:Protein) - (pway:Path) RETURN pway • RDF + OWL used as a standardised modelling/representation language (see BioKNO ontology: github.com/Rothamsted/bioknet-onto) • SPARQL available too, both having pros/cons (see our benchmark: github.com/Rothamsted/graphdb-benchmarks) • Cypher being used for “Semantic Motif” queries, linking genes to entities of interest (work in progress)
  • 16.
  • 17.
  • 18.
  • 19.
    Towards interoperability: AgriSchemas Experimentaldata (EBI GXA) Molecular biology (Knetminer) Host-pathogen interaction (PHI- Base)
  • 20.
  • 21.
  • 22.
  • 23.
    Towards interoperability: AgriSchemas Examplequeries, showing Integration & Interoperability https://github.com/Rothamsted/agri-schemas/blob/master/drafts/201904-dfw-hackathon/ebi-gxa-use-case/SPARQL-Queries.md
  • 24.
  • 25.
  • 26.
    Conclusions? • Actually, questionsabove offered to you: where are we going? Where to go? Personally, I’ve to offer my 2 cents only • We have many FAIRification efforts, mostly based on custom formats, APIs, downloads • We’re still missing integration, interoperability, standards • For the purpose of queries like the one shown above (show me genes linked to phenotype, known knowledge, experimental evidence, etc) • It used to be the focus of Semantic Web and linked data • Other approaches have become popular (JSON, APIs, NoSQL) • Only recently they’ve started addressing the same old problems (eg, GraphQL, JSON-Schema) • Schematisation has become lightweight, even in LODs (eg, schema.org or SHACL vs OWL or OBO ontologies) • Though “true” ontologies are still important in life sciences, mostly for annotations
  • 27.
    Acknowledgements Ajit Singh Software Engineer •Alice Minotto, Earlham Inst, hosting providers • Monika Mistry, master Student, Data Curator • William Brown, IT admin • Madhu Donepudi, Richard Holland, ext contractors, developers Keywan Hassani-Pak KnetMiner Team Leader Chris Rawlings Head of Computational & Analytical Sciences Joseph Hearnshaw Software Engineer Sandeep Amberkar Bioinformatician, Data curator
  • 28.
    Interactive Session: Proposals •Model your data of interest with the AgriSchemas approach • And/Or review what we have already drafted, let’s have a discussion about it • Experiment with The Knetminer SPARQL/Neo4j endpoints (which includes experimental import from GXA) • A closer look at the Knetminer ELT pipeline, from external sources to XML/OXL, RDF, Neo4j
  • 29.
    Playing Knetminer endpoints •eg, using queries at the end point, find proteins related to "oxygenic photosynthesis" and related publications • Use https://github.com/Rothamsted/bioknet-onto for info about the data model • Also, see the figure: https://github.com/Rothamsted/graphdb-benchmarks/blob/master/results/ara_knet_pattern.png • Suggestion: use the relation bk:pub_in, relating bk:Protein to Bk:Publication • Provisional endpoint: http://marcobrandizi.info:9090/lodestar/sparql • Solution: https://gist.github.com/marco-brandizi/d51f823a879630f46b5ba582f1450a3c
  • 30.
    Playing Knetminer endpoints •Using Cypher and Neo4j endpoint, explore the components (part_of relation) of the path (Path node type) about the pathway titled (prefName) "chlorophyll a biosynthesis I" • Provisional endpoit: http://marcobrandizi.info:7474 (ib2019/ib2019) • Solution: https://gist.github.com/marco-brandizi/7b37d815e2dd539361e76d5817a5d99c

Editor's Notes

  • #11 SW-based knowledge bases still much used by big players, but behind the scenes (eg, to build Google snippets) Ground-based AI (machine learning, neural networks, etc) used to sort out the mess in modelling (preferred to symbolic/formal approaches, like OWL)
  • #12 Commercial/general-purpose world is focused more on big data Not so much on modelling and standardisation
  • #21 Details for the downloader, or the second part
  • #22 Details for the downloader, or the second part
  • #23 Details for the downloader, or the second part
  • #24 Details for the downloader, or the second part
  • #25 We’re deploying our own SPARQL endpoint, where wheat and arabidopsis datasets are merged We can play with it via the LODEStar browser
  • #26 Data from different sources are merged together in the RDF coming from URI resolution The LODEStar browser can show that, but also resolve the URI (via content negotiation)