Publishing and Consuming FAIR Data
A Case in the Agri-Food Domain
#ODS 2021, April 17th, 2021
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Find this presentation on SlideShare
background source: https://www.eurekalert.org/multimedia/pub/248200.php
Hello!
• Geek since 1980s and C=64 times
• Started working with Life Science Data 2003
• Started with Semantic Web and LOD
• Univ. of Milano-Bicocca, EMBL-EBI
• and now Rothamsted Research
• Meanwhile, (h)activism in open source, open data
• Especially in Italy (SOD)
• Still with Semantic Web and LOD, but ...
A Major Problem with (Open) Data
How many oil paintings from 1600s
are available in Italy? What are
their locations?
Source: Wikipedia:Cattedrale_di_Caltanissetta
A Major Problem with (Open) Data
How many oil paintings from 1600s are available
in Italy? What are their locations?
• 2 regions using common CSV
• 1 using its own CSV
• 1 using completely custom
RDF (!)
• None using Cultural-ON or
another standard
Source: Brandizi, Agenda Digitale (2018), tinyurl.com/y72wjhm8 github.com/marco-brandizi/cultural_on_ex
A Common Curse Problem in Many Domains
Source: Kamdar, Musen, 2021,
https://www.nature.com/articles/s41597-021-00797-y Source: Brandizi, IB2019, https://tinyurl.com/y6p78968
What we Do for (Plant) Biology and Agriculture
Based on publications, which genes are related to the yellow rust disease?
In which biological processes are their encoded proteins involved?
1 2
5 8
1
3
4
5
7
6
4
3
2
1
6 7
8
Towards FAIRer Data
Based on publications, which genes are related to the yellow rust disease? In which
biological processes are their encoded proteins involved?
AgriSchemas
ontology (BioKNO)
ETL
Tools
knetminer.org
Want some demo?
• Count Data Sources
• Integration of Knetminer publications and EBI/GXA gene
expression experiments
• Using data with Jupyter (and Neo4j, see more here)
Why schema.org?
Simple & Complementary
Why schema.org?
Web-Oriented, Standard and FAIR
Source and recommended read: https://tinyurl.com/yxocd3b9
(3) Findable
Register it dataset DOI on datasetsearch.research.google.com
Recognised via schema.org
(2) Accessible
Resolvable URIs makes data accessible
(1) Interoperable
Recognised via schema.org, links to bio-ontologies, standard IDs
Query/representation standards (SPARQL, Cypher, GraphQL, JSON-LD)
(4) Reusable
Clear licence
Ideally, machine-readable licence (eg, CCREL)
However, we’re schema-agnostic
ETL
Tools
However, we’re schema-agnostic
• Pipelines based on incremental workflows (Snakemake)
• Dependency management (Anaconda)
• RDF/RDF conversion via SPARQL
• Ontology API and Ontology annotator (via APIs)
• Want more details? Check it out on github
ETL
Tools
Hence, we could collaborate!
• Do you have your data integration project?
• To perform analysis?
• To try machine learning / artificial intelligence?
• Are you in the agri-food domain?
• Or life sciences, ecology, biomedicine, healthcare?
• Want to build visualisations, data explorers, UI components, etc?
• For known schemas/ontologies, ie, reusable!
• Are you a student? A teacher?
Ajit Singh
Software Engineer
• Samiul Haque, Ed Eyles, IT admins
• Joseph Hearnshaw, software engineer
• Louis Timberlake, visiting student
• Alice Minotto, Earlham Institute, hosting providers
• Robert Davey, Earlham Institute, DFW WP4 coordinator
• William Brown, Ricardo Gregorio, IT admins
• Monika Mistry, master Student, data Curator
• Sandeep Amberkar, bioinformatician, data curator
• Madhu Donepudi, Richard Holland, ext contractors, developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Jeremy Parsons
Bioinformatics Scientist
Acknowledgements
Simple & Complementary (the Profiles Approach)
Source: https://bioschemas.org/profiles/Study/0.2-DRAFT/
Why schema.org? Web-oriented
Source: https://bioschemas.org/liveDeploys/

Publishing and Consuming FAIR Data A Case in the Agri-Food Domain

  • 1.
    Publishing and ConsumingFAIR Data A Case in the Agri-Food Domain #ODS 2021, April 17th, 2021 Marco Brandizi <marco.brandizi@rothamsted.ac.uk> Find this presentation on SlideShare background source: https://www.eurekalert.org/multimedia/pub/248200.php
  • 2.
    Hello! • Geek since1980s and C=64 times • Started working with Life Science Data 2003 • Started with Semantic Web and LOD • Univ. of Milano-Bicocca, EMBL-EBI • and now Rothamsted Research • Meanwhile, (h)activism in open source, open data • Especially in Italy (SOD) • Still with Semantic Web and LOD, but ...
  • 3.
    A Major Problemwith (Open) Data How many oil paintings from 1600s are available in Italy? What are their locations? Source: Wikipedia:Cattedrale_di_Caltanissetta
  • 4.
    A Major Problemwith (Open) Data How many oil paintings from 1600s are available in Italy? What are their locations? • 2 regions using common CSV • 1 using its own CSV • 1 using completely custom RDF (!) • None using Cultural-ON or another standard Source: Brandizi, Agenda Digitale (2018), tinyurl.com/y72wjhm8 github.com/marco-brandizi/cultural_on_ex
  • 5.
    A Common CurseProblem in Many Domains Source: Kamdar, Musen, 2021, https://www.nature.com/articles/s41597-021-00797-y Source: Brandizi, IB2019, https://tinyurl.com/y6p78968
  • 6.
    What we Dofor (Plant) Biology and Agriculture Based on publications, which genes are related to the yellow rust disease? In which biological processes are their encoded proteins involved? 1 2 5 8 1 3 4 5 7 6 4 3 2 1 6 7 8
  • 7.
    Towards FAIRer Data Basedon publications, which genes are related to the yellow rust disease? In which biological processes are their encoded proteins involved? AgriSchemas ontology (BioKNO) ETL Tools knetminer.org
  • 8.
    Want some demo? •Count Data Sources • Integration of Knetminer publications and EBI/GXA gene expression experiments • Using data with Jupyter (and Neo4j, see more here)
  • 9.
  • 10.
    Why schema.org? Web-Oriented, Standardand FAIR Source and recommended read: https://tinyurl.com/yxocd3b9 (3) Findable Register it dataset DOI on datasetsearch.research.google.com Recognised via schema.org (2) Accessible Resolvable URIs makes data accessible (1) Interoperable Recognised via schema.org, links to bio-ontologies, standard IDs Query/representation standards (SPARQL, Cypher, GraphQL, JSON-LD) (4) Reusable Clear licence Ideally, machine-readable licence (eg, CCREL)
  • 11.
  • 12.
    However, we’re schema-agnostic •Pipelines based on incremental workflows (Snakemake) • Dependency management (Anaconda) • RDF/RDF conversion via SPARQL • Ontology API and Ontology annotator (via APIs) • Want more details? Check it out on github ETL Tools
  • 13.
    Hence, we couldcollaborate! • Do you have your data integration project? • To perform analysis? • To try machine learning / artificial intelligence? • Are you in the agri-food domain? • Or life sciences, ecology, biomedicine, healthcare? • Want to build visualisations, data explorers, UI components, etc? • For known schemas/ontologies, ie, reusable! • Are you a student? A teacher?
  • 14.
    Ajit Singh Software Engineer •Samiul Haque, Ed Eyles, IT admins • Joseph Hearnshaw, software engineer • Louis Timberlake, visiting student • Alice Minotto, Earlham Institute, hosting providers • Robert Davey, Earlham Institute, DFW WP4 coordinator • William Brown, Ricardo Gregorio, IT admins • Monika Mistry, master Student, data Curator • Sandeep Amberkar, bioinformatician, data curator • Madhu Donepudi, Richard Holland, ext contractors, developers Keywan Hassani-Pak KnetMiner Team Leader Chris Rawlings Head of Computational & Analytical Sciences Jeremy Parsons Bioinformatics Scientist Acknowledgements
  • 15.
    Simple & Complementary(the Profiles Approach) Source: https://bioschemas.org/profiles/Study/0.2-DRAFT/
  • 16.
    Why schema.org? Web-oriented Source:https://bioschemas.org/liveDeploys/