Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making data FAIR using InterMine


Published on

Making data findable, accessible, interoperable and reusable using the InterMine life sciences data system

Published in: Science
  • Be the first to comment

Making data FAIR using InterMine

  1. 1. Making data FAIR using InterMine Justin Clark-Casey (@justincc), Software Engineer, InterMine
  2. 2. Making data FAIR using InterMine 1. What is FAIR? 2. InterMine and FAIR
  3. 3. What is FAIR? ● A set of data reuse principles first formally published in Scientific Data in 2016 − “Comment: The FAIR Guiding Principles for scientific data management and stewardship” − By Mark Wilkinson, Michel Dumontier, Carole Goble, Barend Mons and 49 others ● Generated by academia, funding agencies, industry and scholarly publishers ● Spurred on by trends such as big data, machine learning, reproducibility, Linked Data and an increasingly diverse and distributed data ecosystem ● Aims to clarify what good data management means, which has been largely undefined. ● Principles apply to data, algorithms, tools and workflows – anything necessary for humans and machines to work with scientific outputs
  4. 4. What are the FAIR principles? 4 Guiding Principles Findability Accessibility Interoperability Reusability Practical entailments expounded in 15 statements, 3 to 4 per principle Subject to some interpretation, not explained further in the paper! Not all apply to InterMine Can be adhered to in any combination and incrementally (a continuum)
  5. 5. InterMine and FAIR ● InterMine already has elements of the FAIR principles ● For example, data in InterMine is ● Findable via search and templates ● Accessible through reports, links, PathQuery and webservices ● Interoperable by cross-references to other mines and sources ● Reusable by giving license information for sources But FAIR has crystallized areas where we can improve and get ahead
  6. 6. InterMine and FAIR ● So in 2016, Gos decided to apply for a grant with the BBSRC in the UK to bring InterMine to the vanguard of FAIR ● Formulated a compendium of improvements to meet FAIR principles ● 2 developer grant (Daniela, myself) for 3 years ● Features released ongoing as part of mainline InterMine ● Time to engage with the community to promote FAIR InterMine
  7. 7. InterMine and FAIR ● In the rest of this talk: ● I will go through the FAIR InterMine improvements planned for the next 3 years and show how they relate to the FAIR principles ● Discussion of these plans during this talk is extremely welcome – please interrupt me at any time, now and in the future ● Nothing is set in stone ● Details will change as we understand the problems better ● Can adapt and pursue emerging opportunities ● But first, we’ll go through the 15 practical FAIR entailments
  8. 8. Findability F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
  9. 9. Accessibility A1. (meta)data are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available
  10. 10. Interoperability I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
  11. 11. Reusability R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards
  12. 12. Elements of FAIR InterMine InterMine Stable URIs Register URIs externally Ontologies in data model Embed metadata in webpages Add metadata to query results Objects available in XML/JSON Gener ate RDF Generate bulk RDF Objects available in RDF SPAR QL Better linking Better licensing metadata
  13. 13. F1 (meta)data are assigned a globally unique and persistent identifier ● At the moment, InterMine objects each have their own URI (e.g.→ PPARG_HUMAN protein ● But this incorporates an ID which is not stable over data reloads ● Share option provides a more persistent URI but not perfect ● Current idea is to construct an ID incorporating the class names and IDs that form a primary key for any object − e.g. − (note, this would not be enough for phytomine with mult orgs on one taxon) ● Looking to make this the primary URI instead of the InterMine ID embedding one ● Good enough? What if/when gene IDs change? Stable URIs
  14. 14. ● InterMine has internal query capability (search, PathQuery) but not so great visibility in external systems without manual action ● Two levels − At whole mine level, provide 1-click ability to register installations with external coarse-grained repositories such as BioSharing, Elixir ● Not so important for establish mines that have done this manually − Provide facilities to register top-level mine objects with fine-grained registries such as ● For example, stable HumanMine gene URIs could be registered in their own namespace or as an alternative information source for IDs in an exisitng collection (e.g. NCBI taxon namespace) ● Questions of control, avoiding spamming, etc. Register URIs externally
  15. 15. ● Currently, InterMine: − is based on the Sequence Ontology − loads ontologies for integration other with data sources (e.g. GO annotations) − But does not provide a way to attach terms to object properties or non-SO classes ● Need to address this as ontologies really drive data integration, FAIR or otherwise ● So will extend data model mechanisms to allow ontology ● Will also annotate core model with suitable existing ontologies − e.g. possibly Dublin Core for Publications − WILL NOT create ontologies ourselves Ontologies in data model
  16. 16. ● There is a general problem with finding datasets related to a particular bioentity (gene, protein, etc.) that are spread out over many different databases (especially the long tail) ● A European initiative called Bioschemas ( is looking to address this − Of which InterMine is an active part ● Idea is to embed extremely basic metadata in webpages (e.g. with JSON-LD) − So that general search engines (Google, Bing, etc.) and specialized indexers can make biological data on a particular subject more findable − Start with basics (DataSet), work up to Gene, Protein, etc. ● InterMine will drive this out of the data model (e.g. create Bioschemas DataSet JSON-LD from InterMine DataSource and DataSet entries. Embed metadata in webpages
  17. 17. ● InterMine returns query and template results as tables, as either CSV, XML, JSON or as rows of results in client APIs ● Metadata provided varies by format – CSV is just the data whilst JSON also provides column headers ● We will look to − Make the provision of existing metadata (column headers) consistent acro output formats − Add more metadata, such as ontology term URIs presented in the data model for objects and their fields ● This is important for making data reusable by other systems ● May require changes/extension to webservices/client APIs Add metadata to query results
  18. 18. ● Information on InterMine entities is currently only retrievable in a selective manner, i.e. by naming specific view columns ● However, this does not mesh well with the idea of machine findability – instead you have to know which fields you want and write a query to retrieve them ● So will make it possible to go to a bioentity’s stable URI and with HTTP content negotiation or an URL fragment retrieve all the data (and metadata) for that object in XML or JSON (or RDF...) − The machine equivalent of a currently human-only report page ● One challenge is determining how much data gets included − Sections of a report page can have thousands of results and be expensive to generate (for the human report page this is controlled by paging) ● Links to other bioentitys will be embedded as their stable URIs Objects available in XML/JSON
  19. 19. ● A big stream within FAIR (but not an overbearing part) is that of Linked Data and the Semantic Web ● RDF has a simple uniform data model that describe everything in terms of SUBJECT, PREDICATE, OBJECT triples, commonly referencing URIs − e.g (,, “4859”) ● In principle, where different data sources share ontologies and URIs to common objects, RDF triples from each source can be more automatically integrated, better meeting FAIR’s interoperability principle than InterMine specific formats. ● Maxime Déraspe from Michel Dumontier’s lab has already done work to ‘RDFize’ 6 model organism InterMines in MOLD. ● We will productize this work and provide RDF as another download format. Generate RDF
  20. 20. ● Another way for someone to integrate RDF is to download the entire graph of triples for an InterMine installation, or some subset thereof ● They can put this into a triplestore (a specialized database for RDF) with other data and run queries on their local systems rather than relying on federated query (of which more shortly) ● So we will provide a mechanism to generate this bulk RDF that InterMine operators can publish if they choose ● As a separately generated bulk download, this has no impact on runtime InterMine performance. Generate bulk RDF
  21. 21. ● In addition to making entire bioentities available as XML and JSON, we will also make them available as RDF ● This is part of the Linked Data idea, where RDF retrieved from one system will contain stable URIs to objects in other systems which can be navigated to retrieve their RDF (and so on) Objects available in RDF
  22. 22. ● Another way of retrieving RDF is to submit SPARQL queries to a data source ● prefix rdf: <> prefix flymine-res: <> SELECT ?gene ?label WHERE {?gene rdf:type flymine-res:flymine_Gene . ?gene rdfs:label ?label .} LIMIT 20 SPARQL
  23. 23. ● This is much like submitting PathQuery to InterMine with the following differences − SPARQL is a standardized query language, so more tools and information ● And so better meets FAIR’s interoperability principle. − SPARQL provides ways to involve data model ontology annotations in queries ● Doing this in PathQuery may require language extensions, needs discussion/thought − SPARQL is generally much more expressive than PathQuery ● So more is possible ● But more complex to understand and there are performance concerns SPARQ L
  24. 24. ● Because of performance concerns, plan to provide SPARQL support as a separate Docker image − Building on MOLD work − Explore techniques for better SPARQL reliability ● Such as Linked Data Fragments Optimization approach on client-side SPARQL
  25. 25. ● As mentioned previously, linking between data sources is important in FAIR, since it improves Findability (via following links) and Interoperability (by showing connections between different data sources). ● InterMine already provides some cross-referencing by publishing this information where it is provided by an input data source (chiefly as cross-references to other databases). ● However, we don’t save enough information on the input data source itself to construct links back to that data source (as you know, the right-hand sidebar external links are specified separately outside of the InterMine data model). ● We will address this and provide a way to link bioentities to third-party data sources as well, such as URIs in the Bio2RDF project. ● We will get this links from fine-grained URI registries such as Better linking
  26. 26. ● Currently, for most mines information on the licensing of the integrated data sources is generated manually and published in the data sources tab. ● As per FAIR’s reusability principle, we want to collect and publish this information more systematically, and in a consistent machine-readable format. ● Work in this area is young, there are pilot and emerging specifications (e.g. the Creative Commons Rights Expression Language and bioCADDIE’s Machine Actionable Licenses) but nothing generally deployed as far as we know. ● But reviewers highlighted this as a major area of concern, with suggestions such as a traffic light signal on report pages showing whether all the integrated data was suitable for reuse. ● So we will look to pursue this objective vigorously. Better licensing metadata
  27. 27. Providing FAIR capability in InterMine Advantages (many of these are classic InterMine advantages) Much of this comes ‘for free’ for existing InterMine operators on update Allows operators to say FAIR if asked Tailoring can be done through configuration Driven by the model which already lies at the heart of InterMine Contributions to FAIR facilities in InterMine from others curated by Cambridge and shared by all
  28. 28. Thankyou Questions? Ongoing discussion very welcome @justincc