Sharing data with lightweight data standards, such as schema.org and bioschemas. The Knetminer case, an application for the agrifood domain and molecular biology.
Presented at Open Data Sicilia (#ODS2021)
Publishing and Consuming FAIR DataA Case in the Agri-Food Domain
1. Publishing and Consuming FAIR Data
A Case in the Agri-Food Domain
#ODS 2021, April 17th, 2021
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Find this presentation on SlideShare
background source: https://www.eurekalert.org/multimedia/pub/248200.php
2. Hello!
• Geek since 1980s and C=64 times
• Started working with Life Science Data 2003
• Started with Semantic Web and LOD
• Univ. of Milano-Bicocca, EMBL-EBI
• and now Rothamsted Research
• Meanwhile, (h)activism in open source, open data
• Especially in Italy (SOD)
• Still with Semantic Web and LOD, but ...
3. A Major Problem with (Open) Data
How many oil paintings from 1600s
are available in Italy? What are
their locations?
Source: Wikipedia:Cattedrale_di_Caltanissetta
4. A Major Problem with (Open) Data
How many oil paintings from 1600s are available
in Italy? What are their locations?
• 2 regions using common CSV
• 1 using its own CSV
• 1 using completely custom
RDF (!)
• None using Cultural-ON or
another standard
Source: Brandizi, Agenda Digitale (2018), tinyurl.com/y72wjhm8 github.com/marco-brandizi/cultural_on_ex
5. A Common Curse Problem in Many Domains
Source: Kamdar, Musen, 2021,
https://www.nature.com/articles/s41597-021-00797-y Source: Brandizi, IB2019, https://tinyurl.com/y6p78968
6. What we Do for (Plant) Biology and Agriculture
Based on publications, which genes are related to the yellow rust disease?
In which biological processes are their encoded proteins involved?
1 2
5 8
1
3
4
5
7
6
4
3
2
1
6 7
8
7. Towards FAIRer Data
Based on publications, which genes are related to the yellow rust disease? In which
biological processes are their encoded proteins involved?
AgriSchemas
ontology (BioKNO)
ETL
Tools
knetminer.org
8. Want some demo?
• Count Data Sources
• Integration of Knetminer publications and EBI/GXA gene
expression experiments
• Using data with Jupyter (and Neo4j, see more here)
10. Why schema.org?
Web-Oriented, Standard and FAIR
Source and recommended read: https://tinyurl.com/yxocd3b9
(3) Findable
Register it dataset DOI on datasetsearch.research.google.com
Recognised via schema.org
(2) Accessible
Resolvable URIs makes data accessible
(1) Interoperable
Recognised via schema.org, links to bio-ontologies, standard IDs
Query/representation standards (SPARQL, Cypher, GraphQL, JSON-LD)
(4) Reusable
Clear licence
Ideally, machine-readable licence (eg, CCREL)
12. However, we’re schema-agnostic
• Pipelines based on incremental workflows (Snakemake)
• Dependency management (Anaconda)
• RDF/RDF conversion via SPARQL
• Ontology API and Ontology annotator (via APIs)
• Want more details? Check it out on github
ETL
Tools
13. Hence, we could collaborate!
• Do you have your data integration project?
• To perform analysis?
• To try machine learning / artificial intelligence?
• Are you in the agri-food domain?
• Or life sciences, ecology, biomedicine, healthcare?
• Want to build visualisations, data explorers, UI components, etc?
• For known schemas/ontologies, ie, reusable!
• Are you a student? A teacher?
14. Ajit Singh
Software Engineer
• Samiul Haque, Ed Eyles, IT admins
• Joseph Hearnshaw, software engineer
• Louis Timberlake, visiting student
• Alice Minotto, Earlham Institute, hosting providers
• Robert Davey, Earlham Institute, DFW WP4 coordinator
• William Brown, Ricardo Gregorio, IT admins
• Monika Mistry, master Student, data Curator
• Sandeep Amberkar, bioinformatician, data curator
• Madhu Donepudi, Richard Holland, ext contractors, developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Jeremy Parsons
Bioinformatics Scientist
Acknowledgements
15. Simple & Complementary (the Profiles Approach)
Source: https://bioschemas.org/profiles/Study/0.2-DRAFT/