AgriFood Data, Models, Standards, Tools, Use Cases

Open Data in Agrifood and Life
Sciences: Models, Standards, Tools,
Use Cases
Paris, 17/9/2019
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Keywan Hassani-Pak <keywan.hassani-pak@rothamsted.ac.uk>
Find these slides on SlideShare

Why this workshop (ideally)
• Focus on sharing machine-readable data for agrifood and related areas
(eg, weather)
• See (examples of) at what point we are, trends, etc
• Share experiences, best practices, solutions etc
• (At least) outline some common efforts (eg, to have a common schema)

The new oil (*)
https://lod-cloud.net/
https://goo.gl/n4m5xL
https://www.economist.com/node/21521548
(*) or the old
mess?

And of course, in the agri-food field

What do we want to get by data?
What are the genes involved in yellow rust and the proteins they
encode?
In which pathways are they involved?
What publications and field trials exist as evidence?

How can we get it? => FAIR+
• Data need to be raw, PDF or HTML-only web sites not very good (Accessible, Reusable)
• Datasets need meta-data (Findable)
• Which should be FAIR too, in particular interoperable
• Common formats, schemas and ontologies (Interoperable)
• RDF in the linked data world
• OWL in the Semantic Web world (lightweight schemas ever more popular)
• JSON, APIs, JSON-Schema elsewhere
• Common identifiers (Interoperable)
• URIs in the linked data world (related to F, I too), accessions, code lists elsewhere
• Common query language(s) (FAIR principles affected)
• SPARQL in the linked data world. A plethora of competing QLs elsewhere (eg, GraphQL, Cypher, SQL-like)
• Proper licences, preferably open (Reusable)
• Ideally, text translated to and published as FAIR+ data
• Should have good quality, and it should be measurable + produce to evidence (Reusable and Useful)
• metrics, automated tests, frequent-enough updates (report publish dates, prod and version dependencies),
completeness

How to implement it?
At the begin, there was the Semantic Web, then the Linked Data,
then…

How to implement it?
We’re still in Babel, still with the same issues

Example: GraphQL
https://countries.trevorblades.com/, https://github.com/trevorblades/coun

Example: AI meets (RDF) Knowledge
Graphs

Example: Data Lakes
https://docs.italia.it/italia/daf/daf-docs/en/bozza/
(or,
models/schemas/interop
erability/standardisation
aren’t our so much
problems…)

Linked Data
APIs
APIs + standards
Guidelines
Tabular formats
Vocabularies ’n onto services
Specific
formats
What do we have in Agrifood?

The Knetminer use case
• Green: Ondex plug-ins
• rdf2neo is a generic, non Ondex-specific
rdf->Neo4j conversion tool
• Brandizi et al, IB-2018
(https://dx.doi.org/10.1515%2Fjib-2018-0023)
• Brandizi et al, SWAT4LS-2018
(https://doi.org/10.6084/m9.figshare.7314323.v1)

The Knetminer use case
Cypher examples:
MATCH
// branching via ‘|’
(prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
// variable-length chains
- [:part_of*1..3] -> (pway:Path)
RETURN
prot.name, pwy LIMIT 1000
// Very compact forms available:
MATCH (prot:Protein) - (pway:Path) RETURN pway
• RDF + OWL used as a standardised modelling/representation language
(see BioKNO ontology: github.com/Rothamsted/bioknet-onto)
• SPARQL available too, both having pros/cons
(see our benchmark: github.com/Rothamsted/graphdb-benchmarks)
• Cypher being used for “Semantic Motif” queries, linking genes to entities of interest
(work in progress)

Towards interoperability: AgriSchemas

Experimental data (EBI
GXA)
Molecular biology
(Knetminer)
Host-pathogen interaction (PHI-
Base)

Molecular biology
(Knetminer)

Experimental data (EBI GXA)

Host-pathogen interaction (PHI-
Base)

Example queries, showing Integration & Interoperability
https://github.com/Rothamsted/agri-schemas/blob/master/drafts/201904-dfw-hackathon/ebi-gxa-use-case/SPARQL-Queries.md

Knetminer Info
GXA Info

Conclusions?
• Actually, questions above offered to you: where are we going? Where to go?
Personally, I’ve to offer my 2 cents only
• We have many FAIRification efforts, mostly based on custom formats, APIs, downloads
• We’re still missing integration, interoperability, standards
• For the purpose of queries like the one shown above (show me genes linked to
phenotype, known knowledge, experimental evidence, etc)
• It used to be the focus of Semantic Web and linked data
• Other approaches have become popular (JSON, APIs, NoSQL)
• Only recently they’ve started addressing the same old problems (eg, GraphQL,
JSON-Schema)
• Schematisation has become lightweight, even in LODs (eg, schema.org or SHACL vs
OWL or OBO ontologies)
• Though “true” ontologies are still important in life sciences, mostly for annotations

Acknowledgements
Ajit Singh
Software Engineer
• Alice Minotto, Earlham Inst, hosting providers
• Monika Mistry, master Student, Data Curator
• William Brown, IT admin
• Madhu Donepudi, Richard Holland, ext contractors, developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Joseph Hearnshaw
Software Engineer
Sandeep Amberkar
Bioinformatician, Data curator

Interactive Session: Proposals
• Model your data of interest with the AgriSchemas approach
• And/Or review what we have already drafted, let’s have a discussion about it
• Experiment with The Knetminer SPARQL/Neo4j endpoints (which includes
experimental import from GXA)
• A closer look at the Knetminer ELT pipeline, from external sources to XML/OXL,
RDF, Neo4j

Playing Knetminer endpoints
• eg, using queries at the end point, find
proteins related to "oxygenic photosynthesis" and related publications
• Use https://github.com/Rothamsted/bioknet-onto for info about the data model
• Also, see the figure: https://github.com/Rothamsted/graphdb-benchmarks/blob/master/results/ara_knet_pattern.png
• Suggestion: use the relation bk:pub_in, relating bk:Protein to Bk:Publication
• Provisional endpoint: http://marcobrandizi.info:9090/lodestar/sparql
• Solution: https://gist.github.com/marco-brandizi/d51f823a879630f46b5ba582f1450a3c

Playing Knetminer endpoints
• Using Cypher and Neo4j endpoint, explore the components (part_of relation) of
the path (Path node type) about the pathway titled (prefName) "chlorophyll a
biosynthesis I"
• Provisional endpoit: http://marcobrandizi.info:7474 (ib2019/ib2019)
• Solution: https://gist.github.com/marco-brandizi/7b37d815e2dd539361e76d5817a5d99c

AgriFood Data, Models, Standards, Tools, Use Cases

More Related Content

What's hot

Similar to AgriFood Data, Models, Standards, Tools, Use Cases

More from Rothamsted Research, UK

Recently uploaded

AgriFood Data, Models, Standards, Tools, Use Cases

Editor's Notes