Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transcript - Tracking Research Data Footprints via Integration with Research Graph

85 views

Published on

Slides and full video available from the ANDS website: http://www.ands.org.au/news-and-events/presentations/2018

Published in: Education
  • Be the first to comment

  • Be the first to like this

Transcript - Tracking Research Data Footprints via Integration with Research Graph

  1. 1. [Unclear] words are denoted in brackets Webinar: Tracking Research Data Footprints via Integration with Research Graph 1 March 2018 Video & slides available from ANDS website START OF TRANSCRIPT Facilitator: Good afternoon everyone, thanks for coming to the webinar today. We have a talk today on the topic of tracking the footprint of research data across infrastructures, using the Research Graph API. The speakers today are Doctor Ben Evans from NCI, Associate Director of NCI, and Doctor Jingbo Wang who's a Collection Manager in NCI. So with that introduction, I'll actually hand over the talk to Ben for starting the talk. Ben Evans: So we're going to be talking about work that's going on to help track research data and how it's used in a broader setting. I should mention, NCI's got a lot of partners as a part of this that have been backing and worked with us in this, including from NCRIS and Bureau of Meteorology, Geoscience Australia, CSIRO, the ANU and a host of other partners and collaborators, including ANDS in particular, for this work. So some of the open questions, motivating questions, beyond just getting data management in place is - so say you publish data and datasets, is how is the research community actually connecting with that data? After you've put it into a public arena they could be connecting with it in various ways and making [use], so how do you track that? Also, how do you track the impact of that investment of that
  2. 2. Page 2 of 8 research data for other derived products downstream? So that's a challenging question that we can't answer fully with inside] of a single centre; you're really into an international world. That motivated us a lot to be working on this particular project which has part of the solution. So I should say that the standing of this work and this piece of infrastructure that we'll be going through on Research Graph started with a fairly small partnership. But now it's grown quite a bit and RDA, Research Data Alliance, have picked it up as this Registry Interoperability Working Group. It's got a number of players. You can see some of the players who've been strongly supporting this work over a period of time listed there and you can follow that link on RD Alliance website to track this. But, furthermore, now really through Amir's good work and others, the European Commission have picked this up and said, yes, this needs to now be pushed into an ICT specification. So all that is to say that this work is now on a pretty strong pathway and well worth paying attention to now as it goes forward. So there's four types of what we call nodes in this graph network when you're publishing data and using data. So one is the researcher, one's the dataset, one's the publication, one's grants. There could be other nodes as well, but the status of these whole graphs at the moment is basically built up of those fundamental areas. When we get down to it inside of the tool, you can see the attributes through that graphic on the right-hand side. Research is always in green and datasets are in orange, publication is blue and grants in yellow. You can see some of the attributes that are listed there and we'll talk about that. The other thing is that this graph network that's been built up understands very well-known metadata standards like ISO 19115-4; that's geospatial data, a lot of geospatial data fits into that. But also things like RIF-CS that's used in the librarian world, and inside of Research Data Australia - if you know that catalogue - uses RIF-CS, and MARC 21 and there are others as well. So just to say that this graph system is already supporting that framework.
  3. 3. Page 3 of 8 For NCI, we make a number of major national reference datasets available on NCI. We've curated them and put them into a certain form. They come, in principle, from a lot of the science agencies, being Bureau of Meteorology and Geoscience Australia and so forth, also sometimes from our research community itself. But they're being classified as really the major national reference collections that are associated with NCI. You can see some of the things listed there, climate, weather and satellite imagery, bathymetry, elevation, all of these earth systems, geospatial data in particular. As an example of a dataset now is - so we've got this thing called Bluelink ReANalysis dataset. On the left-hand side it gives you a summary of what it is. On the right-hand side many people are familiar and work with catalogue systems, so we're using GeoNetwork as part of our core catalogue system. So you get the title, so that's the blue - you can see on the right-hand side it's circled there and an abstract about it. You can see points of contact. So this is all part of this ISO 19115 standard, that's how all of this is recorded, how to get hold of that data. So the question that you've got off something like this is what researchers are working on that, or related datasets, how they're publishing, is there anything else connected to it. So you end up with this little graph of stuff. Just down on the bottom right-hand side here, just off this basic diagram here, you can see [Peter Oak], who's the main contact for that dataset, is somehow associated with this BRAN - Bluelink ReANalysis - dataset. So they're somehow associated with that even off our local information. So you can find out a little bit more about Peter. We have other information systems that have got Peter's details, so what project he's working on, publications somehow linked to him, his contact detail and a pretty picture there of Peter looking very spritely. So we have that information in NCI. So on the left-hand side, in this dotted line that you can see with the NCI logo around it, we know a fair bit about Peter, that's the number one with the green, there he is.
  4. 4. Page 4 of 8 There he is with his - as a researcher and an identity and attributes inside of our local information. We know various things about datasets that Peter is associated with. But there's other things that live outside of NIC. In particular, on the right-hand side there, you can say out in the real world, or out in the external world, Peter Oak has what's called an ORCID ID, and many of you know this. Inside of - associated with his ORCID ID we know things about his publication record. So the trick for all of this stuff is to try and associate our internal information to the external information. There's a number of steps that we go through here. Number one, let's have the information recorded inside of a little graph that we'll go through in a second. Then we can augment the graph with how it gets connected up with the ORCID ID. Then we can find out further information, in particular about other external records like his publication record. So almost redescribing this same [step] is, in a fundamental way what we do is we've got a GeoNetwork catalogue with a lot of this information; that is via the utilities in the Research Graph system. Harvest that and puts it into a Neo4j, which is a type of a graph database, just the one that we happen to be using for this. That Neo4j is just hosted inside of the cloud. That has our information, it's just a recasting of the local information and put inside of this system. Then what we do is go out into a broader Research Graph on the outside world, and we augment then the local graph database with that extra information. Then we can visualise it in various ways. So that's what this image - and there is a graphical tool that comes along with this, to start seeing a whole bunch of connected things to do with this data that can start to be exploited. So if we just had the local information of various datasets, then all we would have is the left-hand side of this. Through that extra augmentation, going and querying in the international Research Graph and then augmenting for the local data, we end up with a much richer set of information about what each of the individual
  5. 5. Page 5 of 8 datasets and researchers and what they're doing and their associations. So that's pretty simply what's going on. The Research Graph system that's been put in place really by the partners, and particularly Amir driving this, interoperates with a whole bunch of different services; ORCID, DataCite, Skolix has come on board, and other major datacentres like [ASIS] and so on and so forth. So there's a list there, and a growing list, of information being put into an interoperable graph system. So now there's richer and deeper details that we can start harvesting. There's actually - we did the simplest augmentation, is the description on this previous page. But, actually, you can run several levels of augmentation and we're still I guess trying to explore what's the best way of augmenting the data of the questions that we're trying to face. So, look, I'm going to hand over now to Jingbo who's going to take us a little bit more through some of the details of Research Graph and where it's going. Jingbo Wang: Thank you, Ben. Hi, from this point of time I wanted to go through a couple of slides, in the next 10 minutes or so, to demonstrate how we implement the Research Graph [pack line]. Also, report what are we currently working on, plus some future plans going forward. So in this slide, it shows you what is the input and what is the output. The input is NCI's metadata database. As you see in the previous slides by Ben, our dataset available in GeoNetwork in various formats - it could be CSV or XML or JSON - they are the input so that Jenkins server take that input from the [data hub] and build the NCI graph. So the output will be NCI graph. On the right-hand side, the bottom screenshot just shows you how easy to maintain and update the database with only one click of the button. The five different modules, in green colour, shows you the step-by-step inside of the Jenkins server to build the NCI graph and also augmentation with other database such as a geo - [ORCID]. So what we get eventually is an NCI graph [ML]. There are different ways
  6. 6. Page 6 of 8 to visualise the graph. One way, which was not presented here, is we can use the [GAVI] software to visualise. But a more popular way would be we present our graph in a web-based format. So if you click that link or type this link in your browser, you can actually see this is online. I'm going to show you three screenshots on this webpage, followed by a little live demo afterwards. Basically, this is the interesting part, once we get the graph and we're going to analyse the graph and try to tell the story from the graph. The first screenshot just really gives you an overview of how many publications in our augmented graph and how many datasets and how many researchers here. I'm going to run a little live demo to repeat the story that Ben told you about Peter Oak. If you type this, researchgraph.org/NCI. Jingbo Wang: Alright, in the web browser you can see a webpage about NCI's graph. Click that orange button, it'll open a new tab to show the graph. This is the actual graph look like. If I find Peter Oak as a researcher and click that one, it only shows the connection with this researcher. The colour code of the dot is that this is the dataset which is the Bluelink ReANalysis data associated with Peter Oak. If you notice, there is another green dot over here and this is the augmented part from ORCID. The blue dot represents the publication associated with this researcher. So this really demonstrates that, through the augmentation, our own database with the dataset and researcher are connected to the rest of the world. Let me go back to my presentation again. I should say that we did play around with the different analytics and this is the most interesting part. We demonstrate a few cases that we think people are interested. For example, what is the most publication related to a researcher, and this researcher is always identified with the ORCID ID. Also, which researcher has the most dataset associated with him, with his affiliation. On the right-hand side, if you are still with the web browser,
  7. 7. Page 7 of 8 you can actually put your mouse onto some of the name. It will only show the connections between this researcher and other researchers. So it's more like an interactive mode. I should also say that this augmentation is still work in progress. It means that we can augment with other databases, such as DataCite or other European data repository, and we can actually make our graph bigger and bigger. The last screenshot is just showing the number of publications along the year. As I said, this is not a static graph because we can always augment with other database and we can introduce more publication if it is not in the ORCID database. So behind the scene we use the Jupyter Notebook to generate this web interactive format. We plan to play around more by providing maybe predefined query, so that people can put the person's name on ORCID, find out what is the connection between this researcher and the publication and the dataset and, in the future, even the grants if it's available in our database. So next is we think that Research Graph can be useful for a number of different groups of people. We think also providing Research Graph in the linked-data format would be beneficial for people who want to work with more machine-searchable and actionable approach. So what we've done is we did a bit of proof-concept work by extending our current format of the Research Graph in JSON to JSON-LD, using schema.org to enhance the schematic feature of the Research Graph. We have a publication last year talking about the approach and the ideas, so the reference is at the bottom of the slide. The other thing is, once we build the Research Graph there are a lot of interesting analysis that we can do. So we are currently exploring the new ways of analysing the information in the Research Graph and trying to pick up the good stories about what Research Graph can tell us. The other thing is, because we are the national data repository we actually encourage people to do the cross-disciplinary research based on our high-performance platform. If we can demonstrate the value of [cross-system] and disciplinary research, by showing that when
  8. 8. Page 8 of 8 different type of dataset available on the same platform, more research, more publication and more funding was granted, it will be quite good to demonstrate the impact of our data management practice. So in summary, I think Research Graph really means a couple of things for a different group of user. For example, for a user itself of the data repository, they can understand the dynamic research integration through these analytics. I remember when some researcher submit an ARC grant, they sometimes show their publication citation along the year being increasingly better and better. But with the Research Graph they can actually show more information, not just publication but also their contribution of the dataset and their award on other additional funding using the Research Graph. For the higher-level executive and board, as a data repository we can demonstrate the value of our good data management practice and provide the interoperability of the data services through these more advanced services. We also advance the science research by having more publication and more impact in the matrix. Finally, for the funding body, since they invested a good amount of money for the data repository, we can demonstrate the impact of the investment on the data repository by showing the quantitative analysis of the impact matrix within the research community. So if you want to learn more about the graph, we have the GitHub source code and we also have the interactive demo of the graph, and there is Twitter also if you wanted to socialise it. I think that's it. Facilitator: Okay, thanks Jingbo. I'd like to thank Ben and Jingbo for giving this talk and thank you, everyone, for attending the webinar. Thank you. END OF TRANSCRIPT

×