Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transcript FAIR 3 -I-for-interoperable-13-9-17


Published on

#3 INTEROPERABLE covers: -- an overview of the 3 INTEROPERABLE principles which use vocabularies for knowledge representation, standardisation and references other metadata. -- resources to support institutional awareness and uptake of Interoperable principles

Speakers :
1)Keith Russell, ANDS, provides an overview of the key components of Interoperability
2) Simon Cox and Jonathan Yu (CSIRO) presented on how they have made the research data in the OzNome project Interoperable, not only for humans, but also for machines

Full YouTube recording:

Published in: Education
  • Be the first to comment

  • Be the first to like this

Transcript FAIR 3 -I-for-interoperable-13-9-17

  1. 1. [Unclear] w ordsare denoted in square brackets. FAIR Data webinar series #3: I for Interoperable – ANDS Webinar 13 September 2017 Video & slides available from ANDS website START OF TRANSCRIPT Keith Russell: My name's Keith Russell, I work for the Australian National Data Service, I am your host for today. My colleague, Susannah Sabine, is behind the site scenes co-hosting the webinar with me. Just a usual little bit of background, the Australian National Data Service works with research organisations around Australia to establish - or have them trusted partnerships, reliable services, and enhance capability in the research sector. We work together with two other NCRIS funded projects - Research - RDS, Research Data Services, and Nectar - to create an aligned set of joint investments to deliver transformation in the research sector. So this webinar is part of a series of activities we are undertaking to - which aim to support the Australian research community in increasing our ability to manage our research data as a national asset. So as I mentioned earlier, this is a third in a series of a webinars around FAIR. So we've already had the webinars on findable and accessible, and today interoperable, next week the reusable.
  2. 2. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 2 of 14 So today I will give a brief introduction about what is interoperable as described under the FAIR data principles in FORCE11. Then I'm very grateful that Simon and Jonathan have - are available to talk about what they did in practice in the OzNome project to make their data interoperable. I think it's a great example to show how this quite complex topic can actually be carried forward in practice. So this is what FORCE11 says about interoperable, and first of all a few things to keep in mind. So just reiterating a few things I mentioned in the very first webinar. So when they talk about data, and as you look at these headings you'll see that they talk about data and metadata, so interoperable applies both to the metadata, describing the data collection, and the actual data itself. Another point to keep in mind is throughout the FAIR principles they think a lot around not only data being usable for humans, but also for machines. That provides huge benefits in bringing together disparate datasets, in bringing together bits of knowledge that are distributed over different datasets. Interoperable is a key element there to make sure that data can be brought together, and actually can be - you can - we can get those benefits out of bringing data together which will enable new knowledge discovery, new relationships to be discovered, new patterns to be recognised. All those pieces of work. So as we look at these three headings that they have listed under interoperable, the first one there is that data and metadata use a formal accessible shared and broadly applicable language for knowledge representation. To keep in mind there is that not only for you as the - or the researcher that has created the data, but also for another researcher that wants to understand the data and use the data, it's useful that they understand the language you have used. That that is a standardised language, something that other users can also pick up and use. So ideally that is the case for the metadata -
  3. 3. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 3 of 14 sorry, that is definitely the case for the metadata, and ideally that would also be used in the actual data itself. A very basic example, if a researcher has observed that they saw a magpie they can write in, I saw a magpie. But it's much more useful for a researcher somewhere else on the other side of the world that you write in that it's an Australian magpie and that is a Cracticus tibicen. That means that a researcher on the other side of the world has - using a standard language will actually be able to better understand what you meant and what that description is about. Now it's not just in the actual wording used, in the vocabulary used, but it's also in - it's useful to have a framework around that which will allow the data to also be machine readable and picked up by machines and used and interpreted. Now one obvious example which gets mentioned quite a lot is using RDF and ontologies. That is quite common in the life sciences, and a number of life science researchers and that were quite active in the FORCE11 group. But one thing they emphasise is that it doesn't just have to be through RDF and ontologies. There might be other solutions for this, and they don't want to make it exclusively through those technologies. So that's something to keep in mind. Regarding the making of data interoperable, that's what I've invited Simon and Jonathan to come and talk about, and they'll be able to talk about it in much more detail. The second point here is around vocabularies and using vocabularies. They emphasise that if you use a vocabulary, well, first of all try and use one that already exists, and is agreed on by the community. If you have terms in there that are not in that vocabulary, but otherwise it fits, try and get them added to that vocabulary. Finally, if that is not possible, then, and only then, start creating your own vocabulary. So please don't go out and create vocabularies for everything. Rather look if there is already a community agreed vocabulary. Also make
  4. 4. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 4 of 14 sure that that vocabulary itself is fair. So findable, accessible, interoperable, reusable. So in your dataset you should have a reference to that vocabulary you are referring to, and make sure that that vocabulary can be found just as long as your dataset can also be found. Final point they make is that the data and the metadata should include qualified references to other data and metadata. So what they mean there is that shouldn't just be a reference to another dataset, for example, but also an indication what that relationship is. So it's not just it's related somehow to this other dataset, but perhaps it is a subset of another dataset or it builds on another dataset using standardised terminology. A little more on qualified references, from the perspective of the metadata especially, it's valuable to not only refer to other players or other elements around your dataset, but to do that using identifiers. So for example, if you are describing your dataset and saying, well, it was created - somebody was involved in creating that dataset. Provide a qualified identifier that that person was, for example, the author of that dataset, and if possible also use an identifier to identify that person. That allows other relationships to be made, and it allows further connections to be made, and that information to be picked up and used especially in machine - when being analysed by machines. So just a list here of possible identifiers, these are just examples there are more identifiers out there. But for example, if you are referring to an author include their ORCID, if you are referring to a publication use the DOI that is related to that publication. If you are referring to software nowadays you can assign a DOI to a software package and refer to that DOI, et cetera. Well, I think I've rambled on enough for now. So I would like to hand over to Simon and Jonathan, I'm very grateful that they have made their time available. So just a brief introduction. Simon is a research scientist at CSIRO Land and Water's Environment Information
  5. 5. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 5 of 14 Systems research program. He specialises in distributed architectures and information standards for environmental data focusing on geosciences and water. Jonathan Yu is a research computer scientist specialising in information architectures, data integration, linked data, semantic web, data analytics and visualisation. He's part of the Environmental Informatics Group in CSIRO Land and Water. So together they have been very active in applying their thinking around making data interoperable in the OzNome project. Now one thing I want to point out is that in the OzNome project they did a whole series of work around the FAIR data principles in all different aspects. Today I have asked them specially to focus on interoperable. But please keep in mind that they have also done a whole bunch of other work. So without any further ado I'd like to hand over to Simon and Jonathan. I'm very intrigued how they've picked up interoperability and used that in the OzNome project. Jonathan Yu: Okay, thanks Keith. So thanks for the introductions as well. So today we'll be presenting on some of the work we did in the OzNome initiative. Particularly looking at Land and Water and the data that we have in CSIRO, and how to make that interoperable accordingly to some of the principles that FAIR espouses. But as we will talk about, some of the implementations that we have explored around the FAIR principles into actionable questions to address how FAIR your data. So if you haven't come across OzNome, this is a CSIRO-led initiative aiming to connect information ecosystems throughout Australia. The OzNome name was coined echoing the genome project. So Oz being Australia, and the Nome being a gnome kind of inspired project. But really what we're looking at here is tools, services, products, methods, approaches and practices, and infrastructure to support having more connected information infrastructures.
  6. 6. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 6 of 14 In the previous year, as Keith mentioned, we focused on environmental information infrastructures. There's a couple of links there you can follow. Today we'll be talking about an example in the water space. Simon Cox: Okay, so as part of establishing the OzNome architecture, OzNome infrastructure, we felt that we needed to assist potential data providers to understand what good data was, what in the context of this seminar series, what FAIR data is, we called it OzNome data. Basically we developed a rating - a set of rating criteria and a tool to allow assessment by data providers of the data that they are providing. This is just on the right-hand side of the screen here, you can see a screen capture of the sort of the kick-off page of the tool. You'll also notice that we've got a slightly adapted version of the FAIR criteria - findable, accessible, interoperable and reusable - but we also add in the last line there, trusted. Which appears to go a little bit beyond what has been conceived in FAIR until now, but we suggest would be a useful addition. We're kind of bundling the interoperable and reusable together, we see those as being very closely related. Obviously, it's teasing out some of the issues around what it is that makes data interoperable. Keith's given a sort of high level overview and indicated what some of the concerns might be. We've done our own take on this, a bit - actually fairly strongly leaning on our experience over a number of years, more than a decade now actually of working in the data standards communities, in particular the geospatial data standards communities. Some of the learning which we've got from there which we're applying directly in here. Obviously environmental data, which is what we're largely - what our heritage is, where we've largely been working. A lot of that is geospatial so it makes sense to be building on that. Just a bit of a reminder, the FORCE11 FAIR principles, this is a summary slide from Michel Dumontier, who's one of the original authors of the papers and the developers of the FAIR principles. They
  7. 7. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 7 of 14 got these - the guiding principles with the four key words and are teased out into three or four sub-principles in each case with the F-A-I and R letters. We're looking at the interoperable set here, which Keith has already shown. It's interesting that Michel has recently done a study evaluating a number of repositories, particularly in Europe and some of them are broader than that, but here's the list of repositories that were evaluated. Scored those on the FAIR principles, the data's available in this form actually, this table, shoots off to the right of the screen and there's lots more going on there. But looking at the summary of the results it's fairly notable that the tallest red bar here is in the interoperable category. So what this is saying is, of the FAIR data principles this is the one which is hardest to meet, the one that's hardest to conform to. So really that's the focus of the approach that we've taken, which is to kind of lead people through how they can make their data more FAIR, more OzNomic, more interoperable. The particular way in which we've broken out the question of interoperability is on, if you look at the numbered terms here, is it loadable, is it usable, is it comprehensible, is it linked, as well as is it licenced. I'm just going to go through some of the details of those, and you'll see the - sense it's fairly repetitive of some of the concerns that Keith explained at the beginning. But we're putting some more concrete examples onto these criteria just to indicate to our data providers that when we say a standard data format we mean something like, CSV or JSON or XML or netCDF. These are all important file formats towards the left, and then they're kind of general, but netCDF is one that's used a lot in the remote sensing and environmental science communities. So we've got a bit of a ladder here of different levels of conformance which you can reach about whether a dataset would be loadable. Is it in a unique file format? Well, that means that you've got to have some
  8. 8. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 8 of 14 unique software to load it. Or is it in a standard data format, and normally that would be denoted by one of the standard MIME types. Best of all would be for data to be provided in multiple standard formats, giving a choice to the user so that whatever their favourite platform for loading data they can use. Next question, even when you've loaded it can you use it? If it's - if the structures within the dataset, even if it's loaded, if the structures are unclear then it's not going to be very usable. That comes down to the matter of, is there a schema that's provided which makes explicit the structures within the datasets. A lot of sort of traditional data, yeah, there's a structure in there but the schema's not available independently of the data, if you like the schema is implicit. It's not formalised. The schema maybe is different every time. A lot of spreadsheets are done that way, a spreadsheet has got a lot of boxes. But if every time you use it you add different columns and use the pages in a spreadsheet in a different way, then it takes a little while for the users to get their head's around what's going on before they can use it. So there's various explicit schema languages like DDL, which is loaded and used for relational systems, XML schema. There's something coming out in the open knowledge world these days called data packaging, which allows you essentially to describe a schema for a CSV file. Then you've got in the RDF, the semantic web space, RDFS and OWL. JSON even has a schema language these days, although it's not broadly used. So it's nice to provide data with a schema, but best of all would be to say, the data I am using I am using this community schema. This community, and for example the Open Geospatial Consortium provides a number of community schemas for observations, for time series, for hydrology, for geoscience. If you're publishing or attempting to share data in any of these disciplines then best to go off and find a community schema.
  9. 9. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 9 of 14 Then even when you've got it loaded and you understand what the structures are you've still got the question about what the words and numbers are inside the boxes. Do the column headings, are they explicit enough to understand, are they just shorthand for something which the project leader when he was developing the data knew that he or she would understand it the next week. But even he or she if they came back to it the next year may not understand it. Best of course is if the field labels are linked and do have explanations, probably in plain text. Better still is to use standard labels, for example the universal code for units of measure, units codes. Of the climate and forecast conventions coming out of the FluidEarth community. So the ladder that we've got her says, oh, you're using standard labels. Is it just some of the field names are linked to standard externally managed vocabularies, or are all the field names linked to standard externally managed vocabularies? You get this ladder better and better and better. Then the question about how well linked is your data? Well, if it's just a file sitting on a service somewhere and there's no links in or out, yeah, you're lucky to find it. If most of the datasets that we're - that this community would be expecting is that they are indexed in a catalogue or they are available from a landing page. That's the situation where you've got inbound links to the dataset. Best of all is when there are outbound links embedded or implicit in the data structures in a dataset which says exactly how it's related. This links in with some of the previous concerns that we had there about field names and these kinds of things. So I'm going to hand back to Jonathan to tell you tease through a case study that we've got here really based on the AWRA-L - the Australian Water Resources Assessment datasets. So Jonathan. Jonathan Yu: Yes, so as mentioned earlier, in the OzNome project we looked at a practice example and a case study in the AWRA-L dataset. This is a
  10. 10. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 10 of 14 continental cell dataset that has historical time series from 1911. The bureau published an operational version online, and you can find that on their website. But often scientists have to basically deal with this dataset by knowing where it is and knowing how to use it implicitly. Knowing how to reference the requisite geospatial features and understand the field name values. So I've got an example in the - oh sorry, so the next slide shows the assessment of it using our tool. Just focusing on the interoperable side of things we have rated it as a web service, you can get it by the web. However the reference definitions are text only, and they are localised in the dataset itself. Now I'll give an example in the next slide. So this is coming out from the netCDF metadata that this dataset, you can access this via online through THREDDS or via their netCDF tools. But this is a summary of the metadata that comes along with the data. So we've got long name here, Potential evapotranspiration, we've got the name which is a label for the field, e0_avg. Units, mm, and a standard name which is a convention in netCDF to refer to the actual - to a property which is e0_avg, which in this case isn't part of the CF conventions that's often used with this format. So if you are an expert in this area and you've used this dataset many times you will know what this is. If you are a newcomer you have to do a lot of work to - well, a little bit of work to understand what actually this data field needs. In the OzNome project what we did was enrich this with external variables. So if you go to the next slide Simon, so this is the same field. We've added - these added lines at the bottom here, they tease out what this particular data field means in the context of externally defined vocabularies. So we've now enriched this with a scaled quantity kind identifier, Potential evapotranspiration. It's an http URI where you can resolve it and get a definition. So similarly for substance ortaxon, unit ID and feature of interest.
  11. 11. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 11 of 14 We'll just talk about what they are. So this is what - a part of the project was to explore, could we define vocabularies for these from which we could reference outbound links from the data to the definition. This is just a summary of what we did in the context of the AWRA-L dataset. This is an example of potential evapotranspiration. We've got a conception model here where we've got broader notions of potential evapotranspiration. We've got linked relationships out to thinks like feature of interest, object of interest, and unit of measure. So this view provides a vocabulary entry for potential evapotranspiration, not only the identifier for it, not only the description for it, but a richer model than you would get from if you just had something inline. So you've got outbound relationships from this concept to its related concepts essentially. So this is a demonstration of defining the concepts externally, having them quite richly explained through this medium, but having the ability to link that from the dataset itself to this definition to make it more interoperable. So that if we have another dataset that talked about potential evapotranspiration it could potentially be linked and interoperable. A revised OzNome maturity estimation using the OzNome five-star tool and just focusing on the interoperable field we see that it's, for using the same tool and assessing it based on the criteria, we've gone up form two star to more than four stars in the interoperable space. The reason for that is that we now have reference definitions as linked data and externally hosted observed property vocabulary definitions. Rather than just inline labels of what it is. It provides more interoperability and if the vocabulary was standardised then we would have a higher estimation in that field. But it's just a demonstration of how we went about making something more interoperable through the OzNome project. Simon Cox: Yeah, I'll just pick up at the end here and just comment that when we were starting this data ratings exercise we actually didn't look at FAIR at the beginning. We developed our own set of criteria, these key
  12. 12. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 12 of 14 words here, and then subsequently correlated them with the FAIR principles. One of the interesting things was there was three lines in this table here, the ones in red, which didn't correlate with concerns that had been identified within FAIR. The first one might be seen as trivial, but we thought it was a question that was worth asking, particularly when working with research scientists and talking about making their data available, which was the question about, the first question, is your data intended to be used by anybody else? There's lots of data generated which is never shared. Now that's not necessarily a good thing, and to a certain extent having the question there highlights the fact that there is a question to be asked and that some scientists need, researchers, need to be encouraged to think about making their data available, about publishing it. So I think in terms of the FAIR principles this one was kind of the implicit starting point. If it's published, yes, it's implicitly FAIR. A couple of other rows, one concern which comes up, particularly we've worked a lot with agencies that have sort of systematic data collection processes with systematic curation and maintenance revisiting. A dataset is refreshed every day or every month or every year, all that. That concern didn't seem to be particularly addressed in the FAIR principles as they stand. So we'd say the concern about whether the data is expected to be updated and maintained, and maybe a bit more than FAIR. The bottom row there was well as the concern about this is a, if you like, an elaboration of the assessment of data that you might do, which is to get some information about how well trusted it is. Now a lot of that is about who else is using it, how much it's - well, that's often the criteria you'll use. Who else is using it, how many times is it being used, what other products have been generated from this dataset and so can I trust it?
  13. 13. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 13 of 14 So just emphasising that row there is the interoperable, it corresponds with the interoperability which is what we've really been focusing on today. The use of standards I guess. Standards is a funny word, you have to be a bit careful with it. Capital S standard, sometimes people think that's just to do with ISO of Australian Standards or whatever. Really the point of that standards is that they are community agreements. They are community agreements which are available for additional members of the community to join in. But it's important to think of them as agreements - agreements to do things in a common way. So finally just a slide with some links to some of the material that we've been showing today. We'll say thank you for listening. Keith Russell: Thank you Simon, thank you Jonathan. That was really interesting and a really useful way to see what it actually means in practice. Because in think interoperable can be quite a complex difficult subject, sometimes also one that requires much more knowledge of the actual field of research that's going on that you're talking about. So think this is a great example of where you've been working in a specific field to try and make that data more interoperable. Thanks very much for your time, and this is a really interesting discussion and really starting to tease out a number of the issues, and a number of the things that probably will need developing further. I've just put up a slide which links off to a number of resources, and some of these Simon already mentioned. So ANDS has a service, Research Vocabularies Australia, which anybody around the country - or actually internationally also - can use if you don't have your own tool to set up a vocabulary. That is a possible way of doing. There are also already existing vocabularies in there. So have a look at that if that's of interest. We also have an interest group that works in this space. If you are looking at the metadata and having qualified relationships within the metadata and using identifiers, there's a few links there to
  14. 14. Transcript-FAIR-3-I-for-Interoperable-13-9-17 Page 14 of 14 places where you can find information about possibly identifiers. We are also trying to pull that metadata, describing datasets, together and sharing that internationally through a number of hubs. That's taking place through the Scholix project. The Research Data Australia is sort of an Australian hub contributing into that international hub - international effort. So have a look there if you're interested. We did 23 research starter things last year, and two of the things are relevant for our discussion today. If you are interested in digging into it a little further and discovering a little bit more about it, and discovering what the vocabularies mean in practice, have a go at Thing 12. Or if you are more interested in the identifiers and link data have a look at Thing 14. Finally I would like to first of thank Simon and Jonathan again for their time and for the excellent presentation and the insights that they brought to the table. Finally we would like to acknowledge NCRIS, the National Collaborative Infrastructure Strategy Program that provides the funding for ANDS. So thanks again and look forward to seeing you all next week. END OF TRANSCRIPT